Skip to content

Unrevert #18259 #18280

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 8, 2016
Merged

Conversation

TotalVerb
Copy link
Contributor

@TotalVerb TotalVerb commented Aug 29, 2016

Fix #18081 again.

The implementation of next for String returns a value past the end of the String's underlying data. Even in a black box, the only way for next to not do this is to know the String's last index somehow, so any implementation that does not return a value past the end of the String's underlying data is necessarily non-optimal. Thus I think it's reasonable to make this assumption.

I threw out the test on String([0xcf, 0x83, 0x83, 0x83, 0x83]) because the case where a String contains invalid UTF-8 is really not worth dealing with. These strings already cause iteration problems anyway. Yes, this string now does not collect properly—but it wasn't really working well to begin with; since String([0xcf, 0x83, 0x83, 0x83, 0x83]) * "x" did not collect properly even before this change.

cc @KristofferC, @tkelman

@tkelman tkelman added performance Must go faster unicode Related to unicode characters and encodings strings "Strings!" labels Aug 29, 2016
@JeffBezanson
Copy link
Member

It's true; the more we try to "handle invalid data correctly" the more performance we'll give up in the common case, which I don't think is a good trade-off.

@nalimilan
Copy link
Member

See my comment at: #18259 (comment) Have you tried that first?

@TotalVerb
Copy link
Contributor Author

TotalVerb commented Aug 29, 2016

@nalimilan Sure, that might fix the immediate issue. It doesn't change that endof is ridiculously expensive regardless, and that matters at least something for small strings. We really ought to avoid calling it unless we need to.

julia> function f()
           cs::Int32 = 0
           s = "こんにちは世界"
           e = endof(s)
           i = start(s)
           while i ≤ e
               v, i = next(s, i)
               cs = cs + Int32(v)
           end
           cs
       end
f (generic function with 1 method)

julia> f()
112003

julia> function g()
           cs::Int32 = 0
           s = "こんにちは世界"
           for c in s
               cs = cs + Int32(c)
           end
           cs
       end
g (generic function with 1 method)

julia> g()
112003

On my machine, benchmarking reveals a 5% difference in favour of g after this change, which is still nothing to scoff at—this difference is 4 times greater than the difference from adding @inbounds.

But yes, I agree that we should investigate why it isn't being hoisted. I'll look into that, but I think this change is still a good idea.

a = [x for x in String([0xcf, 0x83, 0x83, 0x83, 0x83])]
@test a[1] == 'σ'
# we assume for performance that next/nextind return values past the end of a
# string's underlying data; this helps with performance of `done`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this an "assumption"? Since this is only for String, don't we control what next and nextind do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. How's "as an invariant"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, I don't understand why this comment is in the test file at all. Shouldn't any comments like this be with the done method?

Second, I was thinking of something like # This implementation relies on the fact that next/nextind eventually return values past the end of a String's underlying data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new comment is still kind of misleading, since we "rely" on something that the current implementation doesn't guarantee for invalid UTF-8. It should rather mention that it only works for valid strings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For invalid strings, what happens? Skipping invalid character data seems fine, for example. Or does it throw an exception?

Copy link
Contributor Author

@TotalVerb TotalVerb Aug 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't throw an exception. It will just return false if there is still stuff to consume, even if that stuff cannot be read as a valid character. This is in contrast to the old behaviour, where done will return true if all remaining string data is junk, and throw an exception if the entire string is junk.

Copy link
Contributor Author

@TotalVerb TotalVerb Aug 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me clarify the changes in behaviour versus the previous behaviour with a few examples:

Say we have a string, "S", and append to it some continuation characters.

julia> s = string("S", String([0x83, 0x83, 0x83, 0x83]));

In both previous and current behaviour, start will return 1, and done and next will give the expected result:

julia> start(s)
1

julia> done(s, 1)
false

julia> next(s, 1)
('S',2)

In the previous version, done will now return true, as the remaining data in the string is composed entirely of continuation characters, which endof skips. Whereas now, done will return false, as it believes that the character at 2 should be a valid index. next will then throw up on the index that isn't right.

This leads to the observed behaviour change:

  • In the previous version, collect(string("S", String([0x83, 0x83, 0x83, 0x83]))) returned just ['S'], discarding junk at the end of the string.
  • In the current version, the same will fail.

It is not clear to me which is better, and I suspect whether silently pretending bad data does not exist versus raising an error on encountering it are both undesirable in their own ways. Note that both previous and current version throw an exception on collect(string("S", String([0x83, 0x83, 0x83, 0x83]), "S")); the fact that only continuation characters end a string is the only reason the previous version was able to collect this case. If it were the case that collect never failed before this change, then I would be more cautious about allowing it to—but the fact is, in most cases where the string has invalid data, collect will fail with an exception anyway. If it is not desired for collect to fail, then that is an easy fix: simply modify next to avoid throwing an exception. The fact that next does explicitly throw an exception is evidence to me that failing was the expected and desired behaviour when string iteration was designed.


There is, to my knowledge, just one other case where the behaviour changes, and that is when endof itself throws an exception. Thus in the previous version, a BoundsError is generated by endof on a string consisting only of continuation characters; in the new version, a UnicodeError is generated by next.


Anyhow, I really don't think any of these behaviours are documented, reliable, or should factor strongly. I personally prefer the new behaviour because it is more consistent with concatenation—the new behaviour guarantees that two strings that collect properly will concatenate into a string that collects properly; the old behaviour makes no such guarantee. Having whether invalid data is in the middle of a string, at the beginning, or at the end determine whether a string will collect properly is somewhat strange. But I really think performance considerations should take priority over how exactly invalid data is handled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throwing an exception on iterating over characters in an invalid string seems like an okay behavior to me.

Basically, if you are in a situation where you are stuffing arbitrary binary data into a String for some reason, then you usually shouldn't be iterating over characters at all.

(One thing that I find annoying is that show barfs with an error on showing an arbitrary string, whereas I think it should give some useful information, at least in the REPL. But that is something for a different PR.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(See also #18296)

@TotalVerb TotalVerb force-pushed the fw/test-next-endof-data branch from ce204b7 to 4121f0f Compare August 30, 2016 19:33
@TotalVerb
Copy link
Contributor Author

Is this good to go?

@stevengj
Copy link
Member

LGTM.

@KristofferC
Copy link
Member

Good to go here?

@stevengj stevengj merged commit 62c3bfa into JuliaLang:master Dec 8, 2016
@KristofferC
Copy link
Member

Tentatively marking this as a candidate for backporting.

tkelman pushed a commit that referenced this pull request Mar 1, 2017
* Revert "Revert #182599 "add faster done for strings" (#18275)"

This reverts commit 6d179b3.

* Test that next(s, endof(s)) > endof(s.data)

(cherry picked from commit 62c3bfa)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster strings "Strings!" unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants