Unrevert #18259 #18280

TotalVerb · 2016-08-29T18:19:52Z

Fix #18081 again.

The implementation of next for String returns a value past the end of the String's underlying data. Even in a black box, the only way for next to not do this is to know the String's last index somehow, so any implementation that does not return a value past the end of the String's underlying data is necessarily non-optimal. Thus I think it's reasonable to make this assumption.

I threw out the test on String([0xcf, 0x83, 0x83, 0x83, 0x83]) because the case where a String contains invalid UTF-8 is really not worth dealing with. These strings already cause iteration problems anyway. Yes, this string now does not collect properly—but it wasn't really working well to begin with; since String([0xcf, 0x83, 0x83, 0x83, 0x83]) * "x" did not collect properly even before this change.

cc @KristofferC, @tkelman

This reverts commit 6d179b3.

JeffBezanson · 2016-08-29T19:14:44Z

It's true; the more we try to "handle invalid data correctly" the more performance we'll give up in the common case, which I don't think is a good trade-off.

nalimilan · 2016-08-29T19:26:50Z

See my comment at: #18259 (comment) Have you tried that first?

TotalVerb · 2016-08-29T19:36:44Z

@nalimilan Sure, that might fix the immediate issue. It doesn't change that endof is ridiculously expensive regardless, and that matters at least something for small strings. We really ought to avoid calling it unless we need to.

julia> function f()
           cs::Int32 = 0
           s = "こんにちは世界"
           e = endof(s)
           i = start(s)
           while i ≤ e
               v, i = next(s, i)
               cs = cs + Int32(v)
           end
           cs
       end
f (generic function with 1 method)

julia> f()
112003

julia> function g()
           cs::Int32 = 0
           s = "こんにちは世界"
           for c in s
               cs = cs + Int32(c)
           end
           cs
       end
g (generic function with 1 method)

julia> g()
112003

On my machine, benchmarking reveals a 5% difference in favour of g after this change, which is still nothing to scoff at—this difference is 4 times greater than the difference from adding @inbounds.

But yes, I agree that we should investigate why it isn't being hoisted. I'll look into that, but I think this change is still a good idea.

stevengj · 2016-08-30T15:01:37Z

test/strings/basic.jl

-a = [x for x in String([0xcf, 0x83, 0x83, 0x83, 0x83])]
-@test a[1] == 'σ'
+# we assume for performance that next/nextind return values past the end of a
+# string's underlying data; this helps with performance of `done`.


Why is this an "assumption"? Since this is only for String, don't we control what next and nextind do?

I agree. How's "as an invariant"?

First, I don't understand why this comment is in the test file at all. Shouldn't any comments like this be with the done method?

Second, I was thinking of something like # This implementation relies on the fact that next/nextind eventually return values past the end of a String's underlying data

The new comment is still kind of misleading, since we "rely" on something that the current implementation doesn't guarantee for invalid UTF-8. It should rather mention that it only works for valid strings.

For invalid strings, what happens? Skipping invalid character data seems fine, for example. Or does it throw an exception?

It won't throw an exception. It will just return false if there is still stuff to consume, even if that stuff cannot be read as a valid character. This is in contrast to the old behaviour, where done will return true if all remaining string data is junk, and throw an exception if the entire string is junk.

Let me clarify the changes in behaviour versus the previous behaviour with a few examples:

Say we have a string, "S", and append to it some continuation characters.

julia> s = string("S", String([0x83, 0x83, 0x83, 0x83]));

In both previous and current behaviour, start will return 1, and done and next will give the expected result:

julia> start(s) 1 julia> done(s, 1) false julia> next(s, 1) ('S',2)

In the previous version, done will now return true, as the remaining data in the string is composed entirely of continuation characters, which endof skips. Whereas now, done will return false, as it believes that the character at 2 should be a valid index. next will then throw up on the index that isn't right.

This leads to the observed behaviour change:

In the previous version, collect(string("S", String([0x83, 0x83, 0x83, 0x83]))) returned just ['S'], discarding junk at the end of the string.

In the current version, the same will fail.

It is not clear to me which is better, and I suspect whether silently pretending bad data does not exist versus raising an error on encountering it are both undesirable in their own ways. Note that both previous and current version throw an exception on collect(string("S", String([0x83, 0x83, 0x83, 0x83]), "S")); the fact that only continuation characters end a string is the only reason the previous version was able to collect this case. If it were the case that collect never failed before this change, then I would be more cautious about allowing it to—but the fact is, in most cases where the string has invalid data, collect will fail with an exception anyway. If it is not desired for collect to fail, then that is an easy fix: simply modify next to avoid throwing an exception. The fact that next does explicitly throw an exception is evidence to me that failing was the expected and desired behaviour when string iteration was designed.

There is, to my knowledge, just one other case where the behaviour changes, and that is when endof itself throws an exception. Thus in the previous version, a BoundsError is generated by endof on a string consisting only of continuation characters; in the new version, a UnicodeError is generated by next.

Anyhow, I really don't think any of these behaviours are documented, reliable, or should factor strongly. I personally prefer the new behaviour because it is more consistent with concatenation—the new behaviour guarantees that two strings that collect properly will concatenate into a string that collects properly; the old behaviour makes no such guarantee. Having whether invalid data is in the middle of a string, at the beginning, or at the end determine whether a string will collect properly is somewhat strange. But I really think performance considerations should take priority over how exactly invalid data is handled.

Throwing an exception on iterating over characters in an invalid string seems like an okay behavior to me.

Basically, if you are in a situation where you are stuffing arbitrary binary data into a String for some reason, then you usually shouldn't be iterating over characters at all.

(One thing that I find annoying is that show barfs with an error on showing an arbitrary string, whereas I think it should give some useful information, at least in the REPL. But that is something for a different PR.)

(See also #18296)

TotalVerb · 2016-09-29T02:07:12Z

Is this good to go?

stevengj · 2016-09-29T19:01:43Z

LGTM.

KristofferC · 2016-12-08T07:18:49Z

Good to go here?

KristofferC · 2016-12-08T16:19:54Z

Tentatively marking this as a candidate for backporting.

* Revert "Revert #182599 "add faster done for strings" (#18275)" This reverts commit 6d179b3. * Test that next(s, endof(s)) > endof(s.data) (cherry picked from commit 62c3bfa)

Revert "Revert #182599 "add faster done for strings" (JuliaLang#18275)"

579366b

This reverts commit 6d179b3.

tkelman added performance Must go faster unicode Related to unicode characters and encodings strings "Strings!" labels Aug 29, 2016

TotalVerb mentioned this pull request Aug 29, 2016

Revert #182599 "add faster done for strings" #18275

Merged

stevengj reviewed Aug 30, 2016
View reviewed changes

TotalVerb force-pushed the fw/test-next-endof-data branch from ce204b7 to 4121f0f Compare August 30, 2016 19:33

Test that next(s, endof(s)) > endof(s.data)

d89690e

TotalVerb force-pushed the fw/test-next-endof-data branch from 4121f0f to d89690e Compare August 30, 2016 20:43

TotalVerb mentioned this pull request Sep 29, 2016

add faster done for strings #18259

Merged

nalimilan approved these changes Sep 29, 2016

View reviewed changes

stevengj merged commit 62c3bfa into JuliaLang:master Dec 8, 2016

KristofferC added the backport pending 0.5 label Dec 8, 2016

tkelman pushed a commit that referenced this pull request Mar 1, 2017

Unrevert #18259 (#18280)

eac77de

* Revert "Revert #182599 "add faster done for strings" (#18275)" This reverts commit 6d179b3. * Test that next(s, endof(s)) > endof(s.data) (cherry picked from commit 62c3bfa)

tkelman removed the backport pending 0.5 label Mar 5, 2017

Uh oh!

Unrevert #18259 #18280

Unrevert #18259 #18280

Uh oh!

Conversation

TotalVerb commented Aug 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JeffBezanson commented Aug 29, 2016

Uh oh!

nalimilan commented Aug 29, 2016

Uh oh!

TotalVerb commented Aug 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TotalVerb Aug 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TotalVerb Aug 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TotalVerb commented Sep 29, 2016

Uh oh!

stevengj commented Sep 29, 2016

Uh oh!

KristofferC commented Dec 8, 2016

Uh oh!

KristofferC commented Dec 8, 2016

Uh oh!

Uh oh!

TotalVerb commented Aug 29, 2016 •

edited

Loading

TotalVerb commented Aug 29, 2016 •

edited

Loading

TotalVerb Aug 31, 2016 •

edited

Loading

TotalVerb Aug 31, 2016 •

edited

Loading