fix/json unicode offsets#2659
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2659 +/- ##
=======================================
Coverage 46% 46%
- Complexity 6721 6724 +3
=======================================
Files 794 794
Lines 65912 65954 +42
Branches 9888 9903 +15
=======================================
+ Hits 30824 30870 +46
+ Misses 32698 32690 -8
- Partials 2390 2394 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
I think there's some uncommitted file? |
|
Yes, there was an uncommitted file. Thanks |
|
DavyLandman
left a comment
There was a problem hiding this comment.
I think this looks good, but I have the feeling it's duplicating code we already have in the ColumnOffset code. I suspect we'll be able to reduce the size of this PR (and thus new code to maintain) if we can reuse that.
| // For every high surrogate we assume a low surrogate will follow, | ||
| // and we count only one of them for the character offset by increasing `shift` |
There was a problem hiding this comment.
this would mean that a broken pair causes offsets? is that fine?
| * See the body of {@link JsonReader#fillBuffer(int minimum)} for the contract that we must satisfy and | ||
| * the preconditions we are given at every call to {@link #read(char[], int, int)}. | ||
| */ | ||
| public static class OriginTrackingReader extends FilterReader { |
There was a problem hiding this comment.
a lot of this code reminds me of the ColumnMaps and LineColumnOffsetMap (and its implementation ArrayLineOffsetMap). They've been migrated to rascal a while back, so maybe we can reuse that code here?
Perhaps the LineColumnOffsetMap needs to learn a few new features, but it has been used for quite a few years in rascal-lsp for a very similar job.



Fixes unicode offsets for the JSON parser/validator:
srckeyword fieldslocsemantics of vallang and Rascal (offset, length, line, column)This makes JSON parsers ready for use in an editor/UI context. From that perspective, this was a bug. From the "we had a reasonable JSON parser" perspective, this was an enhancement.
Instruments the OriginTrackingReader embedded in JSONValueReader to accurately deal with the presence of unicode surrogate pairs in the char buffer of the reader.
Note that unicode characters in comments are equally responsible for shifts in the offsets as unicode characters in string constants and field names.
posindex into buffer compensated for surrogate pairsThe current solution still streams quickly and scales freely to very long JSON content, very long lines in JSON content, and very long comments or strings in JSON content.