Parsing#45641
Conversation
|
/coverage |
|
Coverage for this Pull Request will be rendered here: https://storage.googleapis.com/envoy-cncf-pr/45641/coverage/index.html For comparison, current coverage on https://storage.googleapis.com/envoy-cncf-postsubmit/main/coverage/index.html The coverage results are (re-)rendered each time the CI |
Signed-off-by: tyxia <tyxia@google.com>
Signed-off-by: tyxia <tyxia@google.com>
Signed-off-by: tyxia <tyxia@google.com>
Signed-off-by: tyxia <kobesummerrain@gmail.com> Signed-off-by: tyxia <tyxia@google.com>
Cover six previously-uncovered lines in wuffs_json_cursor.cc: - wuffs_done_ early-return (feed after document completes) - string_chunk_active_=false when onStringChunk returns false on COPY token - string_chunk_active_=false when onStringChunk returns false on UNICODE_CODE_POINT token - onBoolean handler abort propagation - onKey handler abort propagation - token ring-buffer reset on short_write (50 pairs → ~301 tokens > kTokenBufLen=256) Also adds nextSourcePosition() exercise. Signed-off-by: tyxia <tyxia@google.com>
…turingHandler& AbortStringChunkHandler derives from WuffsJsonCursor::Handler directly, so the narrower CapturingHandler& parameter caused a compile error. parse() only calls cursor.feed() — it never accesses CapturingHandler members — so Handler& is the correct type. Signed-off-by: tyxia <kobesummerrain@gmail.com> Signed-off-by: tyxia <tyxia@google.com>
Moves WuffsJsonCursor into Envoy::Json::Wuffs to scope it away from the broader Envoy::Json namespace. Signed-off-by: tyxia <kobesummerrain@gmail.com> Signed-off-by: tyxia <tyxia@google.com>
|
/coverage |
|
Coverage for this Pull Request will be rendered here: https://storage.googleapis.com/envoy-cncf-pr/45641/coverage/index.html For comparison, current coverage on https://storage.googleapis.com/envoy-cncf-postsubmit/main/coverage/index.html The coverage results are (re-)rendered each time the CI |
…odePoint 2/3-byte paths The UnicodeEscapeMultiByteUtf8 test had literal É and 中 characters (UTF-8 bytes c3 89 and e4 b8 ad) in the JSON string, which Wuffs tokenizes as STRING COPY tokens — plain bytes forwarded directly to onStringChunk without calling encodeCodePoint at all. The test was originally written with É and 中 JSON escape sequences, which Wuffs emits as UNICODE_CODE_POINT tokens that go through encodeCodePoint. The literal characters were silently substituted at some point, leaving lines 34-41 of wuffs_json_cursor.cc (the 2-byte and 3-byte UTF-8 encoding paths) uncovered despite a test that appeared to test them. Signed-off-by: tyxia <kobesummerrain@gmail.com> Signed-off-by: tyxia <tyxia@google.com>
Signed-off-by: tyxia <tyxia@google.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces WuffsJsonCursor, a streaming SAX-style JSON parser built on the Wuffs library, along with its Bazel dependencies and comprehensive unit tests. The feedback highlights a critical correctness bug where STRING, NUMBER, or LITERAL tokens split across chunk boundaries can cause a size_t underflow in token_start - chunk_base, potentially leading to crashes or data corruption. To resolve this, it is recommended to buffer uncommitted bytes and reconstruct split tokens using a helper method. Other suggestions include using absl::string_view directly in std::string::append to prevent potential undefined behavior, correcting typos in comments, and adding a test case to verify the handling of split numbers.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| const absl::string_view value_key = (depth_ < kMaxTrackedDepth && is_dict_[depth_]) | ||
| ? absl::string_view(key_stack_[depth_]) | ||
| : absl::string_view(); | ||
| const absl::string_view raw = chunk.substr(token_start - chunk_base, token_len); |
There was a problem hiding this comment.
Critical Correctness Bug: Out-of-bounds/Corruption on Split Chunks
When parsing streaming JSON, tokens like NUMBER or LITERAL can be split across chunk boundaries. Wuffs' coroutine-based decoder suspends and resumes, emitting the complete token only when it is fully parsed in a subsequent chunk. At that point, token_start will be less than chunk_base (the start of the current chunk), and token_len will represent the total length of the token across all chunks.
Using chunk.substr(token_start - chunk_base, token_len) when token_start < chunk_base causes token_start - chunk_base to underflow (since they are size_t), leading to an extremely large offset. This will throw std::out_of_range (crashing the filter) or return corrupted data.
Recommended Fix:
- Introduce a
std::string pending_bytes_member inWuffsJsonCursorto buffer uncommitted bytes at the end of eachfeed()call. - At the end of
feed(), calculate the number of uncommitted bytes:size_t uncommitted_len = (chunk_base + chunk.size()) - body_src_pos_;and store them inpending_bytes_. - Implement a helper method
getRawBytesto reconstruct the token whentoken_start < chunk_base:
std::string WuffsJsonCursor::getRawBytes(size_t token_start, size_t token_len, absl::string_view chunk, size_t chunk_base) {
if (token_start >= chunk_base) {
return std::string(chunk.substr(token_start - chunk_base, token_len));
}
size_t bytes_from_pending = chunk_base - token_start;
return pending_bytes_.substr(pending_bytes_.size() - bytes_from_pending) +
std::string(chunk.substr(0, token_len - bytes_from_pending));
}- Update
pending_bytes_at the end offeed():
size_t uncommitted_len = (chunk_base + chunk.size()) - body_src_pos_;
if (uncommitted_len > 0) {
if (uncommitted_len <= chunk.size()) {
pending_bytes_ = chunk.substr(chunk.size() - uncommitted_len);
} else {
size_t from_prev = uncommitted_len - chunk.size();
pending_bytes_ = pending_bytes_.substr(pending_bytes_.size() - from_prev) + std::string(chunk);
}
} else {
pending_bytes_.clear();
}| const absl::string_view raw = chunk.substr(token_start - chunk_base, token_len); | |
| const std::string raw = getRawBytes(token_start, token_len, chunk, chunk_base); |
References
- When capturing the return value of a function that returns by value, prefer assignment by value over a const reference (e.g., avoid 'const T& var = func()'). This allows the compiler to leverage Return Value Optimization (RVO) or move semantics, which is more efficient for modern types.
| // A single logical string may span multiple continued tokens if Wuffs | ||
| // fills token_buf_ before the closing quote; in_string_chain_ tracks mid-chain state. | ||
| case WUFFS_BASE__TOKEN__VBC__STRING: { | ||
| const absl::string_view raw = chunk.substr(token_start - chunk_base, token_len); |
There was a problem hiding this comment.
This line suffers from the same streaming chunk boundary bug as the NUMBER/LITERAL tokens. If a STRING token is split across chunks, token_start - chunk_base will underflow when token_start < chunk_base. Use the getRawBytes helper method to safely reconstruct the token bytes.
| const absl::string_view raw = chunk.substr(token_start - chunk_base, token_len); | |
| const std::string raw = getRawBytes(token_start, token_len, chunk, chunk_base); |
References
- When capturing the return value of a function that returns by value, prefer assignment by value over a const reference (e.g., avoid 'const T& var = func()'). This allows the compiler to leverage Return Value Optimization (RVO) or move semantics, which is more efficient for modern types.
| if (token_detail & WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_1_DST_1_SRC_COPY) { | ||
| out.append(raw.data(), raw.size()); | ||
| } |
There was a problem hiding this comment.
Avoid passing raw pointers and sizes to std::string::append when an absl::string_view is already available. Passing raw.data() when it is null (which can happen for empty string views) to std::string::append(const char*, size_t) is technically undefined behavior in C++. Using out.append(raw) is safer, cleaner, and more idiomatic.
| if (token_detail & WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_1_DST_1_SRC_COPY) { | |
| out.append(raw.data(), raw.size()); | |
| } | |
| if (token_detail & WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_1_DST_1_SRC_COPY) { | |
| out.append(raw); | |
| } |
| // TODO(tyxia) Espace here to ensure any unicode token can be used, for example, for | ||
| // comparsion routing, logging purpose. This requires re-escape in the re-encode phase. | ||
| // Investigate later to see if escape and re-escape are needed. |
There was a problem hiding this comment.
Fix typos in the comment: 'Espace' -> 'Escape', and 'comparsion' -> 'comparison'.
| // TODO(tyxia) Espace here to ensure any unicode token can be used, for example, for | |
| // comparsion routing, logging purpose. This requires re-escape in the re-encode phase. | |
| // Investigate later to see if escape and re-escape are needed. | |
| // TODO(tyxia) Escape here to ensure any unicode token can be used, for example, for | |
| // comparison routing, logging purpose. This requires re-escape in the re-encode phase. | |
| // Investigate later to see if escape and re-escape are needed. |
| EXPECT_EQ(h.paths[1], "messages[].role"); | ||
| } |
There was a problem hiding this comment.
Add a test case to verify that numbers split across chunk boundaries are correctly reconstructed and parsed without throwing out-of-bounds exceptions or returning corrupted data.
}
TEST(WuffsJsonCursorTest, SplitNumberAcrossChunks) {
CapturingHandler h;
WuffsJsonCursor cursor(h);
EXPECT_TRUE(cursor.feed(R"({"n":12)", /*closed=*/false).ok());
EXPECT_TRUE(cursor.feed(R"(34})", /*closed=*/true).ok());
ASSERT_EQ(h.fields.size(), 1u);
EXPECT_EQ(h.fields[0].raw_val, "1234");
}
} // namespaceSigned-off-by: tyxia <tyxia@google.com>
WuffsJsonCursor + Handler interface + tests.
The cursor is a self-contained library: it tokenizes JSON and fires callbacks.