fix/json unicode offsets by jurgenvinju · Pull Request #2659 · usethesource/rascal

jurgenvinju · 2026-02-18T11:04:40Z

Fixes unicode offsets for the JSON parser/validator:

parse error locations
origin src keyword fields
fully satisfies the loc semantics of vallang and Rascal (offset, length, line, column)

This makes JSON parsers ready for use in an editor/UI context. From that perspective, this was a bug. From the "we had a reasonable JSON parser" perspective, this was an enhancement.

Instruments the OriginTrackingReader embedded in JSONValueReader to accurately deal with the presence of unicode surrogate pairs in the char buffer of the reader.

Note that unicode characters in comments are equally responsible for shifts in the offsets as unicode characters in string constants and field names.

The current solution still streams quickly and scales freely to very long JSON content, very long lines in JSON content, and very long comments or strings in JSON content.

codecov · 2026-02-18T11:10:33Z

Codecov Report

❌ Patch coverage is 85.24590% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 46%. Comparing base (827298e) to head (43aeda6).

Files with missing lines	Patch %	Lines
...pl/library/lang/json/internal/JsonValueReader.java	85%	2 Missing and 7 partials ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##              main   #2659   +/-   ##
=======================================
  Coverage       46%     46%           
- Complexity    6721    6724    +3     
=======================================
  Files          794     794           
  Lines        65912   65954   +42     
  Branches      9888    9903   +15     
=======================================
+ Hits         30824   30870   +46     
+ Misses       32698   32690    -8     
- Partials      2390    2394    +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

DavyLandman · 2026-02-24T14:05:39Z

Error: RROR] /home/runner/actions-runner/_work/rascal/rascal/src/org/rascalmpl/library/lang/json/internal/JsonValueReader.java:[1032,17] cannot find symbol
  symbol: variable lineHandler
Error: RROR] /home/runner/actions-runner/_work/rascal/rascal/src/org/rascalmpl/library/lang/json/internal/JsonValueReader.java:[1033,17] cannot find symbol

I think there's some uncommitted file?

jurgenvinju · 2026-04-01T11:15:40Z

Yes, there was an uncommitted file. Thanks

sonarqubecloud · 2026-04-01T11:16:48Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

DavyLandman

I think this looks good, but I have the feeling it's duplicating code we already have in the ColumnOffset code. I suspect we'll be able to reduce the size of this PR (and thus new code to maintain) if we can reuse that.

DavyLandman · 2026-05-05T08:38:22Z

+                    // For every high surrogate we assume a low surrogate will follow,
+                    // and we count only one of them for the character offset by increasing `shift`


this would mean that a broken pair causes offsets? is that fine?

DavyLandman · 2026-05-05T08:40:02Z

+     * See the body of {@link JsonReader#fillBuffer(int minimum)} for the contract that we must satisfy and
+     * the preconditions we are given at every call to {@link #read(char[], int, int)}.
     */
    public static class OriginTrackingReader extends FilterReader {


a lot of this code reminds me of the ColumnMaps and LineColumnOffsetMap (and its implementation ArrayLineOffsetMap). They've been migrated to rascal a while back, so maybe we can reuse that code here?

Perhaps the LineColumnOffsetMap needs to learn a few new features, but it has been used for quite a few years in rascal-lsp for a very similar job.

buffer offset is now compensated for surrogate pairs

3aa527d

jurgenvinju added 4 commits February 18, 2026 12:58

Merge branch 'main' into fix/json-unicode-offsets

505e233

added test with unicode surrogate pairs

977759a

initial throw at unicode resilient positions during JSON parsing

5432aea

minor improvements

fd57b97

jurgenvinju self-assigned this Feb 19, 2026

jurgenvinju added 18 commits February 19, 2026 12:47

minor comment

6c64095

working to get unicode columns right

f1b25c8

minor fix

39a0ef1

fixed line markup in unicode example for testing

d21bc29

gettin the off-by-ones under control

e1be9a1

cleanup, refactoring and documentation, plus corrections

d842105

cleanup

31f1a8a

working on another bug

09bbbb3

fixed boundary condition for getOffset

eb402e3

fixed all tests

df0a5c4

added new failing tests for boundary conditions with unicode origins

38e625a

fixed specific unicode offset issues

b0ce362

cleanup debug code

0c29e5a

cleanup unused handler code

9d146dd

added rationale in comment to explain use of offset buffers

0f7162e

better field names, removed need for comments

c88eb12

comments

7c2170c

comments

cebf2f9

jurgenvinju added enhancement bug labels Feb 24, 2026

jurgenvinju marked this pull request as ready for review February 24, 2026 13:54

jurgenvinju requested a review from DavyLandman February 24, 2026 13:54

jurgenvinju added 4 commits March 16, 2026 13:47

Merge branch 'main' into fix/json-unicode-offsets

8b145cd

Merge branch 'main' into fix/json-unicode-offsets

ef4e609

forgot to commit

8c6e4c5

Merge branch 'main' into fix/json-unicode-offsets

43aeda6

DavyLandman reviewed May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix/json unicode offsets#2659

fix/json unicode offsets#2659
jurgenvinju wants to merge 27 commits into
mainfrom
fix/json-unicode-offsets

jurgenvinju commented Feb 18, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Feb 18, 2026 •

edited

Loading

Uh oh!

DavyLandman commented Feb 24, 2026

Uh oh!

jurgenvinju commented Apr 1, 2026

Uh oh!

sonarqubecloud Bot commented Apr 1, 2026

Uh oh!

DavyLandman left a comment

Uh oh!

DavyLandman May 5, 2026

Uh oh!

DavyLandman May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// For every high surrogate we assume a low surrogate will follow,
		// and we count only one of them for the character offset by increasing `shift`

Conversation

jurgenvinju commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

DavyLandman commented Feb 24, 2026

Uh oh!

jurgenvinju commented Apr 1, 2026

Uh oh!

sonarqubecloud Bot commented Apr 1, 2026

Quality Gate passed

Uh oh!

DavyLandman left a comment

Choose a reason for hiding this comment

Uh oh!

DavyLandman May 5, 2026

Choose a reason for hiding this comment

Uh oh!

DavyLandman May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jurgenvinju commented Feb 18, 2026 •

edited

Loading

codecov Bot commented Feb 18, 2026 •

edited

Loading