-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Unrevert #18259 #18280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unrevert #18259 #18280
Conversation
This reverts commit 6d179b3.
It's true; the more we try to "handle invalid data correctly" the more performance we'll give up in the common case, which I don't think is a good trade-off. |
See my comment at: #18259 (comment) Have you tried that first? |
@nalimilan Sure, that might fix the immediate issue. It doesn't change that
On my machine, benchmarking reveals a 5% difference in favour of But yes, I agree that we should investigate why it isn't being hoisted. I'll look into that, but I think this change is still a good idea. |
a = [x for x in String([0xcf, 0x83, 0x83, 0x83, 0x83])] | ||
@test a[1] == 'σ' | ||
# we assume for performance that next/nextind return values past the end of a | ||
# string's underlying data; this helps with performance of `done`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this an "assumption"? Since this is only for String
, don't we control what next
and nextind
do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. How's "as an invariant"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, I don't understand why this comment is in the test file at all. Shouldn't any comments like this be with the done
method?
Second, I was thinking of something like # This implementation relies on the fact that next/nextind eventually return values past the end of a String's underlying data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new comment is still kind of misleading, since we "rely" on something that the current implementation doesn't guarantee for invalid UTF-8. It should rather mention that it only works for valid strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For invalid strings, what happens? Skipping invalid character data seems fine, for example. Or does it throw an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't throw an exception. It will just return false
if there is still stuff to consume, even if that stuff cannot be read as a valid character. This is in contrast to the old behaviour, where done
will return true
if all remaining string data is junk, and throw an exception if the entire string is junk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me clarify the changes in behaviour versus the previous behaviour with a few examples:
Say we have a string, "S", and append to it some continuation characters.
julia> s = string("S", String([0x83, 0x83, 0x83, 0x83]));
In both previous and current behaviour, start
will return 1, and done
and next
will give the expected result:
julia> start(s)
1
julia> done(s, 1)
false
julia> next(s, 1)
('S',2)
In the previous version, done
will now return true
, as the remaining data in the string is composed entirely of continuation characters, which endof
skips. Whereas now, done
will return false
, as it believes that the character at 2
should be a valid index. next
will then throw up on the index that isn't right.
This leads to the observed behaviour change:
- In the previous version,
collect(string("S", String([0x83, 0x83, 0x83, 0x83])))
returned just['S']
, discarding junk at the end of the string. - In the current version, the same will fail.
It is not clear to me which is better, and I suspect whether silently pretending bad data does not exist versus raising an error on encountering it are both undesirable in their own ways. Note that both previous and current version throw an exception on collect(string("S", String([0x83, 0x83, 0x83, 0x83]), "S"))
; the fact that only continuation characters end a string is the only reason the previous version was able to collect
this case. If it were the case that collect
never failed before this change, then I would be more cautious about allowing it to—but the fact is, in most cases where the string has invalid data, collect
will fail with an exception anyway. If it is not desired for collect
to fail, then that is an easy fix: simply modify next
to avoid throwing an exception. The fact that next
does explicitly throw an exception is evidence to me that failing was the expected and desired behaviour when string iteration was designed.
There is, to my knowledge, just one other case where the behaviour changes, and that is when endof
itself throws an exception. Thus in the previous version, a BoundsError
is generated by endof
on a string consisting only of continuation characters; in the new version, a UnicodeError
is generated by next
.
Anyhow, I really don't think any of these behaviours are documented, reliable, or should factor strongly. I personally prefer the new behaviour because it is more consistent with concatenation—the new behaviour guarantees that two strings that collect properly will concatenate into a string that collects properly; the old behaviour makes no such guarantee. Having whether invalid data is in the middle of a string, at the beginning, or at the end determine whether a string will collect properly is somewhat strange. But I really think performance considerations should take priority over how exactly invalid data is handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throwing an exception on iterating over characters in an invalid string seems like an okay behavior to me.
Basically, if you are in a situation where you are stuffing arbitrary binary data into a String
for some reason, then you usually shouldn't be iterating over characters at all.
(One thing that I find annoying is that show
barfs with an error on showing an arbitrary string, whereas I think it should give some useful information, at least in the REPL. But that is something for a different PR.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(See also #18296)
ce204b7
to
4121f0f
Compare
4121f0f
to
d89690e
Compare
Is this good to go? |
LGTM. |
Good to go here? |
Tentatively marking this as a candidate for backporting. |
Fix #18081 again.
The implementation of
next
forString
returns a value past the end of theString
's underlying data. Even in a black box, the only way fornext
to not do this is to know theString
's last index somehow, so any implementation that does not return a value past the end of theString
's underlying data is necessarily non-optimal. Thus I think it's reasonable to make this assumption.I threw out the test on
String([0xcf, 0x83, 0x83, 0x83, 0x83])
because the case where aString
contains invalid UTF-8 is really not worth dealing with. These strings already cause iteration problems anyway. Yes, this string now does notcollect
properly—but it wasn't really working well to begin with; sinceString([0xcf, 0x83, 0x83, 0x83, 0x83]) * "x"
did not collect properly even before this change.cc @KristofferC, @tkelman