gh-150871: Speed up JSON string decoding for long ASCII strings by gaborbernat · Pull Request #150872 · python/cpython

gaborbernat · 2026-06-03T16:28:59Z

JSON documents whose payload is long text (log lines, description and content fields, base64 or embedded-document values) spend most of their decode time finding the end of each string. This speeds that up by scanning eight bytes at a time instead of one, with no SIMD intrinsics and no CPU detection, and it is byte-identical to the current decoder.

What we do now (scalar, one code point at a time)

The current scanstring_unicode inner loop:

for (next = end; next < len; next++) {
    d = PyUnicode_READ(kind, buf, next);   // read ONE character
    if (d == '"' || d == '\\') break;      // is it a terminator/escape?
    if (d <= 0x1f && strict) { error }     // is it an illegal control char?
}

For a 64-character string with no escapes that is 64 iterations, each doing one read, up to three comparisons, and a loop branch: about 64 × 4 operations just to find the closing ". The CPU walks the string one byte at a time while its 64-bit registers sit mostly idle.

What SWAR does (8 bytes at a time, in one register)

SWAR is "SIMD within a register": load 8 bytes into a single uint64_t and test all 8 lanes at once with ordinary integer ops.

while (next + 8 <= len) {
    memcpy(&w, p + next, 8);                          // load 8 bytes as one 64-bit word
    mq = haszero(w ^ (0x22 * 0x0101010101010101));   // any lane == '"' ?
    ms = haszero(w ^ (0x5c * 0x0101010101010101));   // any lane == '\\' ?
    mc = strict ? haszero(w & 0xE0E0E0E0E0E0E0E0) : 0;// any lane < 0x20 ?
    if (mq | ms | mc) break;                          // something special in these 8 -> stop
    next += 8;                                        // nothing special -> skip all 8 at once
}
// then the scalar loop above runs to pin the exact byte and do the real work

Two tricks make this work:

Broadcast: 0x22 * 0x0101010101010101 puts " (0x22) into all 8 byte lanes. XOR-ing the word with it turns "this lane equals "" into "this lane is zero".
haszero(v) = (v - 0x0101…) & ~v & 0x8080…: a zero lane borrows and lights its high bit, and the masks isolate exactly the zero lanes, with no false positives or negatives. For control characters it uses w & 0xE0, since a byte is < 0x20 exactly when its top three bits are zero, again detected by haszero.

One loop iteration answers "do any of these 8 bytes need attention?" in about 6 integer ops. If not, it advances 8 bytes, so the 64-character string takes 8 iterations instead of 64.

The key design point: SWAR only skips, it never decides

When the mask is nonzero, meaning a ", \\, or control char sits somewhere in the 8-byte window, the loop breaks and the original scalar loop re-scans those 8 bytes to find the exact position and do the actual work: terminate the string, handle the escape, or raise the error at the right index. Every decode decision stays on the proven scalar path. SWAR is purely a fast-forward over the runs of ordinary characters that make up the bulk of most strings.

Worked example

Chunk hello wo (8 ordinary ASCII bytes):

Now: 8 iterations, 8 reads, about 24 comparisons.
SWAR: 1 memcpy, 3 haszero tests, all zero, next += 8. One iteration.

Chunk lo","wor (a " at offset 2):

SWAR: mq is nonzero, so it breaks. The scalar loop walks l, o, hits " at offset 2: identical to today, reached after skipping the prior runs.

Why it is a win, and its limits

	Now (scalar)	SWAR
Chars per iteration	1	8
Ops per 8 ordinary chars	~32	~6
Finds exact special char	itself	hands off to scalar
Representation handled	all (UCS-1/2/4)	UCS-1 only; UCS-2/4 fall back to scalar

It helps in proportion to string length: large for long text and blob values, modest for medium strings, neutral for short keys where the loop barely runs.
It engages only for the 1-byte (ASCII/Latin-1) representation. The moment a string contains an emoji or CJK character it is UCS-2/UCS-4 and uses the unchanged scalar loop, which is why multi-kind input measures neutral, not regressed.
It uses the same masks CPython already applies for ASCII detection (ASCII_CHAR_MASK = 0x8080…, VECTOR_0101 = 0x0101… in unicodeobject.c, and UCS1_ASCII_CHAR_MASK in find_max_char.h), so it is not exotic for this codebase.

Mental model: today we ask "is this byte special?" once per byte; SWAR asks "is any of these eight bytes special?" once per eight, and drops to per-byte only when the answer is yes.

When and how this changes performance

json.loads, current decoder versus this change:

Document shape	Effect
One long text field (~11 KB string)	6.3x faster
Many 200-character ASCII string values	4.5x faster
Realistic mixed records (short and medium strings)	1.17x faster
Short keys, numbers, the pyperformance document	no change
Strings with emoji or other non-Latin-1 text	no change (scalar path)

The standard pyperformance bm_json_loads document is short-string and dict dominated and shows no change. The benefit lands on documents whose strings are long.

Correctness

Output is byte-identical to the current decoder, error positions included. Verified three ways: the full test_json suite; a 347-input differential corpus (real-world JSON, plus a quote, backslash, raw control character, escape, and \uXXXX placed at every offset across the eight-byte window in all three string representations, plus surrogate pairs, lone surrogates, embedded nulls, and truncated escapes); and all 340 files of nst/JSONTestSuite (318 parsing and 22 transform, including the must-reject and implementation-defined cases). Every value and every raised error position matched the current implementation.

Benchmark

import json, pyperf
long_ascii = json.dumps([("x"*200) for _ in range(200)])
text_blob  = json.dumps({"body": "lorem ipsum dolor sit amet " * 400})
short_keys = json.dumps({f"k{i}": i for i in range(2000)})
mixed_real = json.dumps([{"id":i,"name":f"user_{i}","email":f"u{i}@example.com","bio":"hello "*10} for i in range(300)])
multikind  = json.dumps(["emoji 😀 中文 текст "*20 for _ in range(200)])
docs = {"long_ascii_values": long_ascii, "huge_text_blob": text_blob,
        "short_keys": short_keys, "mixed_real": mixed_real, "multikind_emoji": multikind}
runner = pyperf.Runner()
for name, s in docs.items():
    runner.bench_func(f"loads/{name}", lambda s=s: json.loads(s))

References for the bit tricks: Sean Anderson, Bit Twiddling Hacks (zero byte, byte equal to n, byte less than n); Henry S. Warren Jr., Hacker's Delight, 2nd ed., chapter 6.

It is not the SIMD parsing backend from #142915: it adds no intrinsics, no CPU detection, and no build configuration, and it does not depend on #125022.

Resolves #150871.

Issue: Speed up JSON string decoding for documents with long string values #150871

scanstring_unicode scans each JSON string one character at a time for the closing quote, a backslash, or a control character. For the one-byte (ASCII/Latin-1) representation, skip eight bytes at a time with a word-at-a-time test using the same masks Objects/unicodeobject.c applies for ASCII scanning; the existing per-character loop then pins the exact byte and performs every decode decision. Two-byte and four-byte strings keep the current loop. Output is byte-identical, verified against test_json, a 347-input differential corpus, and all 340 nst/JSONTestSuite files. Long ASCII string values decode up to 6.3x faster; short keys, numbers, and non-Latin-1 strings are unaffected.

Cover long runs that cross the scan windows with a terminator, backslash escape and \uXXXX escape at every offset in 1-byte and wider strings, plus strict and non-strict control-character handling at the window boundaries.

gaborbernat · 2026-06-03T22:05:14Z

Added test_long_string_scan_paths to test_json/test_decode.py (runs under both the Python and C decoders): a terminator, backslash escape, and \uXXXX escape at every offset across the scan windows in 1-byte and wider strings, plus strict and non-strict control-character handling at the window boundaries.

bedevere-app Bot mentioned this pull request Jun 3, 2026

Speed up JSON string decoding for documents with long string values #150871

Open

Reword the scan comment now that it is a proposed change

c17ac21

This was referenced Jun 3, 2026

Speed up JSON string encoding for documents with long string values #150875

Open

gh-150875: Speed up JSON string encoding for long ASCII strings #150876

Open

gaborbernat marked this pull request as ready for review June 3, 2026 18:22

bedevere-app Bot added the awaiting review label Jun 3, 2026

This was referenced Jun 3, 2026

Speed up JSON string encoding with ensure_ascii=False for long string values #150878

Open

gh-150878: Speed up json.dumps(ensure_ascii=False) for long strings #150879

Open

Add tests exercising the string-decode scan paths

8a7c6df

Cover long runs that cross the scan windows with a terminator, backslash escape and \uXXXX escape at every offset in 1-byte and wider strings, plus strict and non-strict control-character handling at the window boundaries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-150871: Speed up JSON string decoding for long ASCII strings#150872

gh-150871: Speed up JSON string decoding for long ASCII strings#150872
gaborbernat wants to merge 3 commits into
python:mainfrom
gaborbernat:opt/json-swar-string-scan

gaborbernat commented Jun 3, 2026 •

edited

Loading

Uh oh!

gaborbernat commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

gaborbernat commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What we do now (scalar, one code point at a time)

What SWAR does (8 bytes at a time, in one register)

The key design point: SWAR only skips, it never decides

Worked example

Why it is a win, and its limits

When and how this changes performance

Correctness

Uh oh!

gaborbernat commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gaborbernat commented Jun 3, 2026 •

edited

Loading