Skip to content

gh-150871: Speed up JSON string decoding for long ASCII strings#150872

Open
gaborbernat wants to merge 3 commits into
python:mainfrom
gaborbernat:opt/json-swar-string-scan
Open

gh-150871: Speed up JSON string decoding for long ASCII strings#150872
gaborbernat wants to merge 3 commits into
python:mainfrom
gaborbernat:opt/json-swar-string-scan

Conversation

@gaborbernat
Copy link
Copy Markdown
Contributor

@gaborbernat gaborbernat commented Jun 3, 2026

JSON documents whose payload is long text (log lines, description and content fields, base64 or embedded-document values) spend most of their decode time finding the end of each string. This speeds that up by scanning eight bytes at a time instead of one, with no SIMD intrinsics and no CPU detection, and it is byte-identical to the current decoder.

What we do now (scalar, one code point at a time)

The current scanstring_unicode inner loop:

for (next = end; next < len; next++) {
    d = PyUnicode_READ(kind, buf, next);   // read ONE character
    if (d == '"' || d == '\\') break;      // is it a terminator/escape?
    if (d <= 0x1f && strict) { error }     // is it an illegal control char?
}

For a 64-character string with no escapes that is 64 iterations, each doing one read, up to three comparisons, and a loop branch: about 64 × 4 operations just to find the closing ". The CPU walks the string one byte at a time while its 64-bit registers sit mostly idle.

What SWAR does (8 bytes at a time, in one register)

SWAR is "SIMD within a register": load 8 bytes into a single uint64_t and test all 8 lanes at once with ordinary integer ops.

while (next + 8 <= len) {
    memcpy(&w, p + next, 8);                          // load 8 bytes as one 64-bit word
    mq = haszero(w ^ (0x22 * 0x0101010101010101));   // any lane == '"' ?
    ms = haszero(w ^ (0x5c * 0x0101010101010101));   // any lane == '\\' ?
    mc = strict ? haszero(w & 0xE0E0E0E0E0E0E0E0) : 0;// any lane < 0x20 ?
    if (mq | ms | mc) break;                          // something special in these 8 -> stop
    next += 8;                                        // nothing special -> skip all 8 at once
}
// then the scalar loop above runs to pin the exact byte and do the real work

Two tricks make this work:

  • Broadcast: 0x22 * 0x0101010101010101 puts " (0x22) into all 8 byte lanes. XOR-ing the word with it turns "this lane equals "" into "this lane is zero".
  • haszero(v) = (v - 0x0101…) & ~v & 0x8080…: a zero lane borrows and lights its high bit, and the masks isolate exactly the zero lanes, with no false positives or negatives. For control characters it uses w & 0xE0, since a byte is < 0x20 exactly when its top three bits are zero, again detected by haszero.

One loop iteration answers "do any of these 8 bytes need attention?" in about 6 integer ops. If not, it advances 8 bytes, so the 64-character string takes 8 iterations instead of 64.

The key design point: SWAR only skips, it never decides

When the mask is nonzero, meaning a ", \\, or control char sits somewhere in the 8-byte window, the loop breaks and the original scalar loop re-scans those 8 bytes to find the exact position and do the actual work: terminate the string, handle the escape, or raise the error at the right index. Every decode decision stays on the proven scalar path. SWAR is purely a fast-forward over the runs of ordinary characters that make up the bulk of most strings.

Worked example

Chunk hello wo (8 ordinary ASCII bytes):

  • Now: 8 iterations, 8 reads, about 24 comparisons.
  • SWAR: 1 memcpy, 3 haszero tests, all zero, next += 8. One iteration.

Chunk lo","wor (a " at offset 2):

  • SWAR: mq is nonzero, so it breaks. The scalar loop walks l, o, hits " at offset 2: identical to today, reached after skipping the prior runs.

Why it is a win, and its limits

Now (scalar) SWAR
Chars per iteration 1 8
Ops per 8 ordinary chars ~32 ~6
Finds exact special char itself hands off to scalar
Representation handled all (UCS-1/2/4) UCS-1 only; UCS-2/4 fall back to scalar
  • It helps in proportion to string length: large for long text and blob values, modest for medium strings, neutral for short keys where the loop barely runs.
  • It engages only for the 1-byte (ASCII/Latin-1) representation. The moment a string contains an emoji or CJK character it is UCS-2/UCS-4 and uses the unchanged scalar loop, which is why multi-kind input measures neutral, not regressed.
  • It uses the same masks CPython already applies for ASCII detection (ASCII_CHAR_MASK = 0x8080…, VECTOR_0101 = 0x0101… in unicodeobject.c, and UCS1_ASCII_CHAR_MASK in find_max_char.h), so it is not exotic for this codebase.

Mental model: today we ask "is this byte special?" once per byte; SWAR asks "is any of these eight bytes special?" once per eight, and drops to per-byte only when the answer is yes.

When and how this changes performance

json.loads, current decoder versus this change:

Document shape Effect
One long text field (~11 KB string) 6.3x faster
Many 200-character ASCII string values 4.5x faster
Realistic mixed records (short and medium strings) 1.17x faster
Short keys, numbers, the pyperformance document no change
Strings with emoji or other non-Latin-1 text no change (scalar path)

The standard pyperformance bm_json_loads document is short-string and dict dominated and shows no change. The benefit lands on documents whose strings are long.

Correctness

Output is byte-identical to the current decoder, error positions included. Verified three ways: the full test_json suite; a 347-input differential corpus (real-world JSON, plus a quote, backslash, raw control character, escape, and \uXXXX placed at every offset across the eight-byte window in all three string representations, plus surrogate pairs, lone surrogates, embedded nulls, and truncated escapes); and all 340 files of nst/JSONTestSuite (318 parsing and 22 transform, including the must-reject and implementation-defined cases). Every value and every raised error position matched the current implementation.

Benchmark
import json, pyperf
long_ascii = json.dumps([("x"*200) for _ in range(200)])
text_blob  = json.dumps({"body": "lorem ipsum dolor sit amet " * 400})
short_keys = json.dumps({f"k{i}": i for i in range(2000)})
mixed_real = json.dumps([{"id":i,"name":f"user_{i}","email":f"u{i}@example.com","bio":"hello "*10} for i in range(300)])
multikind  = json.dumps(["emoji 😀 中文 текст "*20 for _ in range(200)])
docs = {"long_ascii_values": long_ascii, "huge_text_blob": text_blob,
        "short_keys": short_keys, "mixed_real": mixed_real, "multikind_emoji": multikind}
runner = pyperf.Runner()
for name, s in docs.items():
    runner.bench_func(f"loads/{name}", lambda s=s: json.loads(s))

References for the bit tricks: Sean Anderson, Bit Twiddling Hacks (zero byte, byte equal to n, byte less than n); Henry S. Warren Jr., Hacker's Delight, 2nd ed., chapter 6.

It is not the SIMD parsing backend from #142915: it adds no intrinsics, no CPU detection, and no build configuration, and it does not depend on #125022.

Resolves #150871.

scanstring_unicode scans each JSON string one character at a time for the
closing quote, a backslash, or a control character. For the one-byte
(ASCII/Latin-1) representation, skip eight bytes at a time with a word-at-a-time
test using the same masks Objects/unicodeobject.c applies for ASCII scanning;
the existing per-character loop then pins the exact byte and performs every
decode decision. Two-byte and four-byte strings keep the current loop.

Output is byte-identical, verified against test_json, a 347-input differential
corpus, and all 340 nst/JSONTestSuite files. Long ASCII string values decode up
to 6.3x faster; short keys, numbers, and non-Latin-1 strings are unaffected.
Cover long runs that cross the scan windows with a terminator, backslash
escape and \uXXXX escape at every offset in 1-byte and wider strings, plus
strict and non-strict control-character handling at the window boundaries.
@gaborbernat
Copy link
Copy Markdown
Contributor Author

Added test_long_string_scan_paths to test_json/test_decode.py (runs under both the Python and C decoders): a terminator, backslash escape, and \uXXXX escape at every offset across the scan windows in 1-byte and wider strings, plus strict and non-strict control-character handling at the window boundaries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Speed up JSON string decoding for documents with long string values

1 participant