Skip to content

Parsing#45641

Draft
tyxia wants to merge 13 commits into
envoyproxy:mainfrom
tyxia:wuffs_parser
Draft

Parsing#45641
tyxia wants to merge 13 commits into
envoyproxy:mainfrom
tyxia:wuffs_parser

Conversation

@tyxia

@tyxia tyxia commented Jun 15, 2026

Copy link
Copy Markdown
Member

WuffsJsonCursor + Handler interface + tests.
The cursor is a self-contained library: it tokenizes JSON and fires callbacks.

@repokitteh-read-only

Copy link
Copy Markdown

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #45641 was opened by tyxia.

see: more, trace.

@tyxia

tyxia commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

/coverage

@repokitteh-read-only

Copy link
Copy Markdown

Coverage for this Pull Request will be rendered here:

https://storage.googleapis.com/envoy-cncf-pr/45641/coverage/index.html

For comparison, current coverage on main branch is here:

https://storage.googleapis.com/envoy-cncf-postsubmit/main/coverage/index.html

The coverage results are (re-)rendered each time the CI Envoy/Checks (coverage) job completes.

🐱

Caused by: a #45641 (comment) was created by @tyxia.

see: more, trace.

tyxia added 9 commits June 17, 2026 08:56
Signed-off-by: tyxia <tyxia@google.com>
Signed-off-by: tyxia <tyxia@google.com>
Signed-off-by: tyxia <tyxia@google.com>
Signed-off-by: tyxia <tyxia@google.com>
Signed-off-by: tyxia <tyxia@google.com>
Signed-off-by: tyxia <kobesummerrain@gmail.com>

Signed-off-by: tyxia <tyxia@google.com>
Cover six previously-uncovered lines in wuffs_json_cursor.cc:
- wuffs_done_ early-return (feed after document completes)
- string_chunk_active_=false when onStringChunk returns false on COPY token
- string_chunk_active_=false when onStringChunk returns false on UNICODE_CODE_POINT token
- onBoolean handler abort propagation
- onKey handler abort propagation
- token ring-buffer reset on short_write (50 pairs → ~301 tokens > kTokenBufLen=256)

Also adds nextSourcePosition() exercise.

Signed-off-by: tyxia <tyxia@google.com>
…turingHandler&

AbortStringChunkHandler derives from WuffsJsonCursor::Handler directly,
so the narrower CapturingHandler& parameter caused a compile error.
parse() only calls cursor.feed() — it never accesses CapturingHandler
members — so Handler& is the correct type.

Signed-off-by: tyxia <kobesummerrain@gmail.com>

Signed-off-by: tyxia <tyxia@google.com>
Signed-off-by: tyxia <tyxia@google.com>
Moves WuffsJsonCursor into Envoy::Json::Wuffs to scope it away from
the broader Envoy::Json namespace.

Signed-off-by: tyxia <kobesummerrain@gmail.com>

Signed-off-by: tyxia <tyxia@google.com>
@tyxia

tyxia commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

/coverage

@repokitteh-read-only

Copy link
Copy Markdown

Coverage for this Pull Request will be rendered here:

https://storage.googleapis.com/envoy-cncf-pr/45641/coverage/index.html

For comparison, current coverage on main branch is here:

https://storage.googleapis.com/envoy-cncf-postsubmit/main/coverage/index.html

The coverage results are (re-)rendered each time the CI Envoy/Checks (coverage) job completes.

🐱

Caused by: a #45641 (comment) was created by @tyxia.

see: more, trace.

tyxia added 2 commits June 17, 2026 09:51
…odePoint 2/3-byte paths

The UnicodeEscapeMultiByteUtf8 test had literal É and 中 characters
(UTF-8 bytes c3 89 and e4 b8 ad) in the JSON string, which Wuffs
tokenizes as STRING COPY tokens — plain bytes forwarded directly to
onStringChunk without calling encodeCodePoint at all.

The test was originally written with É and 中 JSON escape
sequences, which Wuffs emits as UNICODE_CODE_POINT tokens that go
through encodeCodePoint. The literal characters were silently
substituted at some point, leaving lines 34-41 of
wuffs_json_cursor.cc (the 2-byte and 3-byte UTF-8 encoding paths)
uncovered despite a test that appeared to test them.

Signed-off-by: tyxia <kobesummerrain@gmail.com>

Signed-off-by: tyxia <tyxia@google.com>
Signed-off-by: tyxia <tyxia@google.com>
@tyxia

tyxia commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces WuffsJsonCursor, a streaming SAX-style JSON parser built on the Wuffs library, along with its Bazel dependencies and comprehensive unit tests. The feedback highlights a critical correctness bug where STRING, NUMBER, or LITERAL tokens split across chunk boundaries can cause a size_t underflow in token_start - chunk_base, potentially leading to crashes or data corruption. To resolve this, it is recommended to buffer uncommitted bytes and reconstruct split tokens using a helper method. Other suggestions include using absl::string_view directly in std::string::append to prevent potential undefined behavior, correcting typos in comments, and adding a test case to verify the handling of split numbers.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

const absl::string_view value_key = (depth_ < kMaxTrackedDepth && is_dict_[depth_])
? absl::string_view(key_stack_[depth_])
: absl::string_view();
const absl::string_view raw = chunk.substr(token_start - chunk_base, token_len);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Critical Correctness Bug: Out-of-bounds/Corruption on Split Chunks

When parsing streaming JSON, tokens like NUMBER or LITERAL can be split across chunk boundaries. Wuffs' coroutine-based decoder suspends and resumes, emitting the complete token only when it is fully parsed in a subsequent chunk. At that point, token_start will be less than chunk_base (the start of the current chunk), and token_len will represent the total length of the token across all chunks.

Using chunk.substr(token_start - chunk_base, token_len) when token_start < chunk_base causes token_start - chunk_base to underflow (since they are size_t), leading to an extremely large offset. This will throw std::out_of_range (crashing the filter) or return corrupted data.

Recommended Fix:

  1. Introduce a std::string pending_bytes_ member in WuffsJsonCursor to buffer uncommitted bytes at the end of each feed() call.
  2. At the end of feed(), calculate the number of uncommitted bytes: size_t uncommitted_len = (chunk_base + chunk.size()) - body_src_pos_; and store them in pending_bytes_.
  3. Implement a helper method getRawBytes to reconstruct the token when token_start < chunk_base:
std::string WuffsJsonCursor::getRawBytes(size_t token_start, size_t token_len, absl::string_view chunk, size_t chunk_base) {
  if (token_start >= chunk_base) {
    return std::string(chunk.substr(token_start - chunk_base, token_len));
  }
  size_t bytes_from_pending = chunk_base - token_start;
  return pending_bytes_.substr(pending_bytes_.size() - bytes_from_pending) +
         std::string(chunk.substr(0, token_len - bytes_from_pending));
}
  1. Update pending_bytes_ at the end of feed():
  size_t uncommitted_len = (chunk_base + chunk.size()) - body_src_pos_;
  if (uncommitted_len > 0) {
    if (uncommitted_len <= chunk.size()) {
      pending_bytes_ = chunk.substr(chunk.size() - uncommitted_len);
    } else {
      size_t from_prev = uncommitted_len - chunk.size();
      pending_bytes_ = pending_bytes_.substr(pending_bytes_.size() - from_prev) + std::string(chunk);
    }
  } else {
    pending_bytes_.clear();
  }
Suggested change
const absl::string_view raw = chunk.substr(token_start - chunk_base, token_len);
const std::string raw = getRawBytes(token_start, token_len, chunk, chunk_base);
References
  1. When capturing the return value of a function that returns by value, prefer assignment by value over a const reference (e.g., avoid 'const T& var = func()'). This allows the compiler to leverage Return Value Optimization (RVO) or move semantics, which is more efficient for modern types.

// A single logical string may span multiple continued tokens if Wuffs
// fills token_buf_ before the closing quote; in_string_chain_ tracks mid-chain state.
case WUFFS_BASE__TOKEN__VBC__STRING: {
const absl::string_view raw = chunk.substr(token_start - chunk_base, token_len);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This line suffers from the same streaming chunk boundary bug as the NUMBER/LITERAL tokens. If a STRING token is split across chunks, token_start - chunk_base will underflow when token_start < chunk_base. Use the getRawBytes helper method to safely reconstruct the token bytes.

Suggested change
const absl::string_view raw = chunk.substr(token_start - chunk_base, token_len);
const std::string raw = getRawBytes(token_start, token_len, chunk, chunk_base);
References
  1. When capturing the return value of a function that returns by value, prefer assignment by value over a const reference (e.g., avoid 'const T& var = func()'). This allows the compiler to leverage Return Value Optimization (RVO) or move semantics, which is more efficient for modern types.

Comment on lines +23 to +25
if (token_detail & WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_1_DST_1_SRC_COPY) {
out.append(raw.data(), raw.size());
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Avoid passing raw pointers and sizes to std::string::append when an absl::string_view is already available. Passing raw.data() when it is null (which can happen for empty string views) to std::string::append(const char*, size_t) is technically undefined behavior in C++. Using out.append(raw) is safer, cleaner, and more idiomatic.

Suggested change
if (token_detail & WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_1_DST_1_SRC_COPY) {
out.append(raw.data(), raw.size());
}
if (token_detail & WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_1_DST_1_SRC_COPY) {
out.append(raw);
}

Comment on lines +301 to +303
// TODO(tyxia) Espace here to ensure any unicode token can be used, for example, for
// comparsion routing, logging purpose. This requires re-escape in the re-encode phase.
// Investigate later to see if escape and re-escape are needed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Fix typos in the comment: 'Espace' -> 'Escape', and 'comparsion' -> 'comparison'.

Suggested change
// TODO(tyxia) Espace here to ensure any unicode token can be used, for example, for
// comparsion routing, logging purpose. This requires re-escape in the re-encode phase.
// Investigate later to see if escape and re-escape are needed.
// TODO(tyxia) Escape here to ensure any unicode token can be used, for example, for
// comparison routing, logging purpose. This requires re-escape in the re-encode phase.
// Investigate later to see if escape and re-escape are needed.

Comment on lines +593 to +594
EXPECT_EQ(h.paths[1], "messages[].role");
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add a test case to verify that numbers split across chunk boundaries are correctly reconstructed and parsed without throwing out-of-bounds exceptions or returning corrupted data.

}

TEST(WuffsJsonCursorTest, SplitNumberAcrossChunks) {
  CapturingHandler h;
  WuffsJsonCursor cursor(h);
  EXPECT_TRUE(cursor.feed(R"({"n":12)", /*closed=*/false).ok());
  EXPECT_TRUE(cursor.feed(R"(34})", /*closed=*/true).ok());
  ASSERT_EQ(h.fields.size(), 1u);
  EXPECT_EQ(h.fields[0].raw_val, "1234");
}

} // namespace

@repokitteh-read-only repokitteh-read-only Bot added the deps Approval required for changes to Envoy's external dependencies label Jun 17, 2026
Signed-off-by: tyxia <tyxia@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deps Approval required for changes to Envoy's external dependencies

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant