Parsing by tyxia · Pull Request #45641 · envoyproxy/envoy

tyxia · 2026-06-15T15:44:28Z

WuffsJsonCursor + Handler interface + tests.
The cursor is a self-contained library: it tokenizes JSON and fires callbacks.

repokitteh-read-only · 2026-06-15T15:44:34Z

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #45641 was opened by tyxia.

see: more, trace.

tyxia · 2026-06-15T18:25:32Z

/coverage

repokitteh-read-only · 2026-06-15T18:25:38Z

Coverage for this Pull Request will be rendered here:

https://storage.googleapis.com/envoy-cncf-pr/45641/coverage/index.html

For comparison, current coverage on main branch is here:

https://storage.googleapis.com/envoy-cncf-postsubmit/main/coverage/index.html

The coverage results are (re-)rendered each time the CI Envoy/Checks (coverage) job completes.

🐱

Caused by: a #45641 (comment) was created by @tyxia.

see: more, trace.

Signed-off-by: tyxia <tyxia@google.com>

Signed-off-by: tyxia <kobesummerrain@gmail.com> Signed-off-by: tyxia <tyxia@google.com>

Cover six previously-uncovered lines in wuffs_json_cursor.cc: - wuffs_done_ early-return (feed after document completes) - string_chunk_active_=false when onStringChunk returns false on COPY token - string_chunk_active_=false when onStringChunk returns false on UNICODE_CODE_POINT token - onBoolean handler abort propagation - onKey handler abort propagation - token ring-buffer reset on short_write (50 pairs → ~301 tokens > kTokenBufLen=256) Also adds nextSourcePosition() exercise. Signed-off-by: tyxia <tyxia@google.com>

…turingHandler& AbortStringChunkHandler derives from WuffsJsonCursor::Handler directly, so the narrower CapturingHandler& parameter caused a compile error. parse() only calls cursor.feed() — it never accesses CapturingHandler members — so Handler& is the correct type. Signed-off-by: tyxia <kobesummerrain@gmail.com> Signed-off-by: tyxia <tyxia@google.com>

Signed-off-by: tyxia <tyxia@google.com>

Moves WuffsJsonCursor into Envoy::Json::Wuffs to scope it away from the broader Envoy::Json namespace. Signed-off-by: tyxia <kobesummerrain@gmail.com> Signed-off-by: tyxia <tyxia@google.com>

tyxia · 2026-06-17T13:28:41Z

/coverage

repokitteh-read-only · 2026-06-17T13:28:47Z

Coverage for this Pull Request will be rendered here:

https://storage.googleapis.com/envoy-cncf-pr/45641/coverage/index.html

For comparison, current coverage on main branch is here:

https://storage.googleapis.com/envoy-cncf-postsubmit/main/coverage/index.html

The coverage results are (re-)rendered each time the CI Envoy/Checks (coverage) job completes.

🐱

Caused by: a #45641 (comment) was created by @tyxia.

see: more, trace.

…odePoint 2/3-byte paths The UnicodeEscapeMultiByteUtf8 test had literal É and 中 characters (UTF-8 bytes c3 89 and e4 b8 ad) in the JSON string, which Wuffs tokenizes as STRING COPY tokens — plain bytes forwarded directly to onStringChunk without calling encodeCodePoint at all. The test was originally written with É and 中 JSON escape sequences, which Wuffs emits as UNICODE_CODE_POINT tokens that go through encodeCodePoint. The literal characters were silently substituted at some point, leaving lines 34-41 of wuffs_json_cursor.cc (the 2-byte and 3-byte UTF-8 encoding paths) uncovered despite a test that appeared to test them. Signed-off-by: tyxia <kobesummerrain@gmail.com> Signed-off-by: tyxia <tyxia@google.com>

Signed-off-by: tyxia <tyxia@google.com>

tyxia · 2026-06-17T18:35:00Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces WuffsJsonCursor, a streaming SAX-style JSON parser built on the Wuffs library, along with its Bazel dependencies and comprehensive unit tests. The feedback highlights a critical correctness bug where STRING, NUMBER, or LITERAL tokens split across chunk boundaries can cause a size_t underflow in token_start - chunk_base, potentially leading to crashes or data corruption. To resolve this, it is recommended to buffer uncommitted bytes and reconstruct split tokens using a helper method. Other suggestions include using absl::string_view directly in std::string::append to prevent potential undefined behavior, correcting typos in comments, and adding a test case to verify the handling of split numbers.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-17T18:38:20Z

+        const absl::string_view value_key = (depth_ < kMaxTrackedDepth && is_dict_[depth_])
+                                                ? absl::string_view(key_stack_[depth_])
+                                                : absl::string_view();
+        const absl::string_view raw = chunk.substr(token_start - chunk_base, token_len);


Critical Correctness Bug: Out-of-bounds/Corruption on Split Chunks

When parsing streaming JSON, tokens like NUMBER or LITERAL can be split across chunk boundaries. Wuffs' coroutine-based decoder suspends and resumes, emitting the complete token only when it is fully parsed in a subsequent chunk. At that point, token_start will be less than chunk_base (the start of the current chunk), and token_len will represent the total length of the token across all chunks.

Using chunk.substr(token_start - chunk_base, token_len) when token_start < chunk_base causes token_start - chunk_base to underflow (since they are size_t), leading to an extremely large offset. This will throw std::out_of_range (crashing the filter) or return corrupted data.

Recommended Fix:

Introduce a std::string pending_bytes_ member in WuffsJsonCursor to buffer uncommitted bytes at the end of each feed() call.

At the end of feed(), calculate the number of uncommitted bytes: size_t uncommitted_len = (chunk_base + chunk.size()) - body_src_pos_; and store them in pending_bytes_.

Implement a helper method getRawBytes to reconstruct the token when token_start < chunk_base:

std::string WuffsJsonCursor::getRawBytes(size_t token_start, size_t token_len, absl::string_view chunk, size_t chunk_base) { if (token_start >= chunk_base) { return std::string(chunk.substr(token_start - chunk_base, token_len)); } size_t bytes_from_pending = chunk_base - token_start; return pending_bytes_.substr(pending_bytes_.size() - bytes_from_pending) + std::string(chunk.substr(0, token_len - bytes_from_pending)); }

Update pending_bytes_ at the end of feed():

size_t uncommitted_len = (chunk_base + chunk.size()) - body_src_pos_; if (uncommitted_len > 0) { if (uncommitted_len <= chunk.size()) { pending_bytes_ = chunk.substr(chunk.size() - uncommitted_len); } else { size_t from_prev = uncommitted_len - chunk.size(); pending_bytes_ = pending_bytes_.substr(pending_bytes_.size() - from_prev) + std::string(chunk); } } else { pending_bytes_.clear(); }

Suggested change

const absl::string_view raw = chunk.substr(token_start - chunk_base, token_len);

const std::string raw = getRawBytes(token_start, token_len, chunk, chunk_base);

References

When capturing the return value of a function that returns by value, prefer assignment by value over a const reference (e.g., avoid 'const T& var = func()'). This allows the compiler to leverage Return Value Optimization (RVO) or move semantics, which is more efficient for modern types.

gemini-code-assist · 2026-06-17T18:38:20Z

+      // A single logical string may span multiple continued tokens if Wuffs
+      // fills token_buf_ before the closing quote; in_string_chain_ tracks mid-chain state.
+      case WUFFS_BASE__TOKEN__VBC__STRING: {
+        const absl::string_view raw = chunk.substr(token_start - chunk_base, token_len);


This line suffers from the same streaming chunk boundary bug as the NUMBER/LITERAL tokens. If a STRING token is split across chunks, token_start - chunk_base will underflow when token_start < chunk_base. Use the getRawBytes helper method to safely reconstruct the token bytes.

Suggested change

const absl::string_view raw = chunk.substr(token_start - chunk_base, token_len);

const std::string raw = getRawBytes(token_start, token_len, chunk, chunk_base);

References

When capturing the return value of a function that returns by value, prefer assignment by value over a const reference (e.g., avoid 'const T& var = func()'). This allows the compiler to leverage Return Value Optimization (RVO) or move semantics, which is more efficient for modern types.

gemini-code-assist · 2026-06-17T18:38:20Z

+  if (token_detail & WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_1_DST_1_SRC_COPY) {
+    out.append(raw.data(), raw.size());
+  }


Avoid passing raw pointers and sizes to std::string::append when an absl::string_view is already available. Passing raw.data() when it is null (which can happen for empty string views) to std::string::append(const char*, size_t) is technically undefined behavior in C++. Using out.append(raw) is safer, cleaner, and more idiomatic.

Suggested change

if (token_detail & WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_1_DST_1_SRC_COPY) {

out.append(raw.data(), raw.size());

}

if (token_detail & WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_1_DST_1_SRC_COPY) {

out.append(raw);

}

gemini-code-assist · 2026-06-17T18:38:20Z

+      // TODO(tyxia) Espace here to ensure any unicode token can be used, for example, for
+      // comparsion routing, logging purpose. This requires re-escape in the re-encode phase.
+      // Investigate later to see if escape and re-escape are needed.


Fix typos in the comment: 'Espace' -> 'Escape', and 'comparsion' -> 'comparison'.

Suggested change

// TODO(tyxia) Espace here to ensure any unicode token can be used, for example, for

// comparsion routing, logging purpose. This requires re-escape in the re-encode phase.

// Investigate later to see if escape and re-escape are needed.

// TODO(tyxia) Escape here to ensure any unicode token can be used, for example, for

// comparison routing, logging purpose. This requires re-escape in the re-encode phase.

// Investigate later to see if escape and re-escape are needed.

gemini-code-assist · 2026-06-17T18:38:20Z

+  EXPECT_EQ(h.paths[1], "messages[].role");
+}


Add a test case to verify that numbers split across chunk boundaries are correctly reconstructed and parsed without throwing out-of-bounds exceptions or returning corrupted data.

} TEST(WuffsJsonCursorTest, SplitNumberAcrossChunks) { CapturingHandler h; WuffsJsonCursor cursor(h); EXPECT_TRUE(cursor.feed(R"({"n":12)", /*closed=*/false).ok()); EXPECT_TRUE(cursor.feed(R"(34})", /*closed=*/true).ok()); ASSERT_EQ(h.fields.size(), 1u); EXPECT_EQ(h.fields[0].raw_val, "1234"); } } // namespace

Signed-off-by: tyxia <tyxia@google.com>

tyxia force-pushed the wuffs_parser branch from 0b06766 to 522f7ad Compare June 15, 2026 15:49

tyxia added 9 commits June 17, 2026 08:56

wuffs json

1c9c1bb

Signed-off-by: tyxia <tyxia@google.com>

update format

51eb338

Signed-off-by: tyxia <tyxia@google.com>

update format

e870691

Signed-off-by: tyxia <tyxia@google.com>

gcc build

ab44109

Signed-off-by: tyxia <tyxia@google.com>

update

9e9f6ec

Signed-off-by: tyxia <tyxia@google.com>

fix clang-format line wrapping in wuffs_json

d9c5cef

Signed-off-by: tyxia <kobesummerrain@gmail.com> Signed-off-by: tyxia <tyxia@google.com>

fix

d144004

Signed-off-by: tyxia <tyxia@google.com>

tyxia force-pushed the wuffs_parser branch from 6aba29d to d144004 Compare June 17, 2026 12:59

refactor(wuffs_json): add Wuffs sub-namespace

2e8a9d7

Moves WuffsJsonCursor into Envoy::Json::Wuffs to scope it away from the broader Envoy::Json namespace. Signed-off-by: tyxia <kobesummerrain@gmail.com> Signed-off-by: tyxia <tyxia@google.com>

tyxia added 2 commits June 17, 2026 09:51

test coverage

525b780

Signed-off-by: tyxia <tyxia@google.com>

gemini-code-assist Bot reviewed Jun 17, 2026

View reviewed changes

repokitteh-read-only Bot added the deps Approval required for changes to Envoy's external dependencies label Jun 17, 2026

test coverage

1c3bd53

Signed-off-by: tyxia <tyxia@google.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing#45641

Parsing#45641
tyxia wants to merge 13 commits into
envoyproxy:mainfrom
tyxia:wuffs_parser

tyxia commented Jun 15, 2026 •

edited

Loading

Uh oh!

repokitteh-read-only Bot commented Jun 15, 2026

Uh oh!

tyxia commented Jun 15, 2026

Uh oh!

repokitteh-read-only Bot commented Jun 15, 2026

Uh oh!

tyxia commented Jun 17, 2026

Uh oh!

repokitteh-read-only Bot commented Jun 17, 2026

Uh oh!

tyxia commented Jun 17, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	const absl::string_view raw = chunk.substr(token_start - chunk_base, token_len);
	const std::string raw = getRawBytes(token_start, token_len, chunk, chunk_base);

Conversation

tyxia commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

repokitteh-read-only Bot commented Jun 15, 2026

Uh oh!

tyxia commented Jun 15, 2026

Uh oh!

repokitteh-read-only Bot commented Jun 15, 2026

Uh oh!

tyxia commented Jun 17, 2026

Uh oh!

repokitteh-read-only Bot commented Jun 17, 2026

Uh oh!

tyxia commented Jun 17, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Critical Correctness Bug: Out-of-bounds/Corruption on Split Chunks

Recommended Fix:

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tyxia commented Jun 15, 2026 •

edited

Loading