Proposal
The C JSON decoder finds the end of every string by reading one character at a time, checking each for the closing quote, a backslash, or a control character (scanstring_unicode in Modules/_json.c). For a string with a long run of ordinary characters, such as a log line, a text field, or a base64 or embedded-document value, that loop does one read and several comparisons per byte while the machine's 64-bit registers sit mostly idle.
The proposal is to scan the one-byte (ASCII/Latin-1) representation eight bytes at a time: load eight bytes into a single machine word and test all eight at once for the closing quote, a backslash, or a control character. A run of ordinary characters then advances eight bytes per step. At the first byte that needs attention, the existing per-character loop takes over, so every decode decision stays on the current path. Two-byte and four-byte strings (anything containing a non-Latin-1 character) keep the current loop unchanged.
In one line: today the decoder asks "is this byte special?" once per byte; this asks "is any of these eight bytes special?" once per eight, and drops to per-byte only when the answer is yes.
How this differs from the SIMD backend in #142915
This is not the SIMD parsing architecture declined in #142915. It uses no SIMD intrinsics, no runtime CPU detection, and no build configuration. It relies only on portable 64-bit integer arithmetic, with the same 0x0101… / 0x8080… masks that Objects/unicodeobject.c already applies for ASCII scanning. It changes one function and adds no infrastructure, so it does not depend on #125022 and needs no PEP.
The single-character find_char from fastsearch.h (adopted for the SRE prefix scanner in #148729) does not fit here: a JSON string scan stops at the first of three different bytes, and the strict-mode control-character test is a character-class check that a single-character search cannot express. A word-at-a-time mask handles the class in one operation.
When it helps, and when it does not
Measured json.loads speedups against the current decoder:
| Document shape |
Effect |
| One long text field (~11 KB string) |
6.3x faster |
| Many 200-character ASCII string values |
4.5x faster |
| Realistic mixed records (short and medium strings) |
1.17x faster |
| Short keys, numbers, the pyperformance document |
no change |
| Strings with emoji or other non-Latin-1 text |
no change (scalar path) |
The standard pyperformance bm_json_loads document is short-string and dict dominated, so it shows no change. The benefit is specific to documents whose payload is long text.
Correctness
The decoded output is byte-identical to the current decoder. A proof-of-concept patch is validated against test_json, a 347-input differential corpus (real-world JSON plus a special character placed at every offset across the eight-byte window, in all three string representations), and all 340 files of nst/JSONTestSuite (318 parsing plus 22 transform). Every value and every error position matches.
Relation to other json work
Independent of, and complementary to, the active number-parsing (#150639) and encoder (#150827) changes.
A proof-of-concept PR follows.
Benchmark
Built base and patched interpreters from this branch's main ancestor and the patch, ran the same script under each, and compared with pyperf compare_to (A/B by swapping Lib/json/decoder.py on the same build; macOS arm64, non-PGO).
import json, pyperf
# Ceiling probe: vary string length & kind to expose where SWAR string-scan helps.
long_ascii = json.dumps([("x"*200) for _ in range(200)]) # long ASCII values -> max win
text_blob = json.dumps({"body": "lorem ipsum dolor sit amet " * 400}) # one huge string
short_keys = json.dumps({f"k{i}": i for i in range(2000)}) # short keys -> minimal win
mixed_real = json.dumps([{"id":i,"name":f"user_{i}","email":f"u{i}@example.com","bio":"hello "*10} for i in range(300)])
multikind = json.dumps(["emoji 😀 中文 текст "*20 for _ in range(200)]) # UCS-2/4 -> scalar fallback (neutral check)
# pyperformance dataset
SD={'key1':0,'key2':True,'key3':'value','key4':'foo','key5':'string'}
ND={'key1':0,'key2':SD,'key3':'value','key4':SD,'key5':SD,'key':'ąćż'}
ppset=[(json.dumps({}),2000),(json.dumps(SD),1000),(json.dumps(ND),1000),(json.dumps([ND]*1000),1)]
docs = {"long_ascii_values": long_ascii, "huge_text_blob": text_blob,
"short_keys": short_keys, "mixed_real": mixed_real, "multikind_emoji": multikind}
runner = pyperf.Runner()
for name, s in docs.items():
runner.bench_func(f"loads/{name}", lambda s=s: json.loads(s))
def pp(items):
for s,n in items:
for _ in range(n): json.loads(s)
runner.bench_func("loads/pyperformance", pp, ppset)
Linked PRs
Proposal
The C JSON decoder finds the end of every string by reading one character at a time, checking each for the closing quote, a backslash, or a control character (
scanstring_unicodeinModules/_json.c). For a string with a long run of ordinary characters, such as a log line, a text field, or a base64 or embedded-document value, that loop does one read and several comparisons per byte while the machine's 64-bit registers sit mostly idle.The proposal is to scan the one-byte (ASCII/Latin-1) representation eight bytes at a time: load eight bytes into a single machine word and test all eight at once for the closing quote, a backslash, or a control character. A run of ordinary characters then advances eight bytes per step. At the first byte that needs attention, the existing per-character loop takes over, so every decode decision stays on the current path. Two-byte and four-byte strings (anything containing a non-Latin-1 character) keep the current loop unchanged.
In one line: today the decoder asks "is this byte special?" once per byte; this asks "is any of these eight bytes special?" once per eight, and drops to per-byte only when the answer is yes.
How this differs from the SIMD backend in #142915
This is not the SIMD parsing architecture declined in #142915. It uses no SIMD intrinsics, no runtime CPU detection, and no build configuration. It relies only on portable 64-bit integer arithmetic, with the same
0x0101…/0x8080…masks thatObjects/unicodeobject.calready applies for ASCII scanning. It changes one function and adds no infrastructure, so it does not depend on #125022 and needs no PEP.The single-character
find_charfromfastsearch.h(adopted for the SRE prefix scanner in #148729) does not fit here: a JSON string scan stops at the first of three different bytes, and the strict-mode control-character test is a character-class check that a single-character search cannot express. A word-at-a-time mask handles the class in one operation.When it helps, and when it does not
Measured
json.loadsspeedups against the current decoder:The standard
pyperformance bm_json_loadsdocument is short-string and dict dominated, so it shows no change. The benefit is specific to documents whose payload is long text.Correctness
The decoded output is byte-identical to the current decoder. A proof-of-concept patch is validated against
test_json, a 347-input differential corpus (real-world JSON plus a special character placed at every offset across the eight-byte window, in all three string representations), and all 340 files ofnst/JSONTestSuite(318 parsing plus 22 transform). Every value and every error position matches.Relation to other json work
Independent of, and complementary to, the active number-parsing (#150639) and encoder (#150827) changes.
A proof-of-concept PR follows.
Benchmark
Built base and patched interpreters from this branch's
mainancestor and the patch, ran the same script under each, and compared withpyperf compare_to(A/B by swappingLib/json/decoder.pyon the same build; macOS arm64, non-PGO).Linked PRs