jlexer: reject raw control characters in string literals#441
Open
omkhar wants to merge 1 commit into
Open
Conversation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
jlexeraccepts raw, unescaped control characters (bytes0x00–0x1F, including raw TAB / newline / NUL) inside JSON string literals, in both string values and object keys.encoding/jsonrejects these per RFC 8259 §7 ("invalid character … in string literal").This is a parser-differential / JSON-interoperability issue (same class as #375): a strict JSON validator placed in front of an easyjson consumer rejects a payload carrying an embedded raw newline / control byte, while easyjson accepts it and decodes the control byte into the Go string — a smuggling primitive (e.g. log/record injection, content-filter evasion).
Reproduce (current master):
{"str":"<0x09>"}(raw tab) decodes with a nil error into"\t";encoding/jsonrejects the same bytes.Fix
Reject raw bytes
< 0x20outside an escape in the string scanner (findStringLen/fetchString). Escaped sequences (\t,\n,) remain accepted and decode normally, matchingencoding/json(which also accepts escaped control chars but rejects raw ones).jlexer/lexer.go: +22/-5 (one helper + a third named return on the existing scanner; single pass, no extra allocation).go test ./...passes; no existing fixture relied on raw-control-char acceptance.Open question on rollout: this tightens default parsing. If preserving maximum leniency by default is preferred, the check could instead be gated behind an opt-in generator flag in the spirit of
-disallow_unknown_fieldsand the proposed-disallow_duplicate_fields(#375). Happy to rework it that way — flagging the trade-off rather than assuming. See also #72, #309 for prior validation-strictness work.