Speed up JSON string decoding for documents with long string values

### Proposal

The C JSON decoder finds the end of every string by reading one character at a time, checking each for the closing quote, a backslash, or a control character (`scanstring_unicode` in `Modules/_json.c`). For a string with a long run of ordinary characters, such as a log line, a text field, or a base64 or embedded-document value, that loop does one read and several comparisons per byte while the machine's 64-bit registers sit mostly idle.

The proposal is to scan the one-byte (ASCII/Latin-1) representation eight bytes at a time: load eight bytes into a single machine word and test all eight at once for the closing quote, a backslash, or a control character. A run of ordinary characters then advances eight bytes per step. At the first byte that needs attention, the existing per-character loop takes over, so every decode decision stays on the current path. Two-byte and four-byte strings (anything containing a non-Latin-1 character) keep the current loop unchanged.

In one line: today the decoder asks "is this byte special?" once per byte; this asks "is any of these eight bytes special?" once per eight, and drops to per-byte only when the answer is yes.

### How this differs from the SIMD backend in #142915

This is not the SIMD parsing architecture declined in #142915. It uses no SIMD intrinsics, no runtime CPU detection, and no build configuration. It relies only on portable 64-bit integer arithmetic, with the same `0x0101…` / `0x8080…` masks that `Objects/unicodeobject.c` already applies for ASCII scanning. It changes one function and adds no infrastructure, so it does not depend on #125022 and needs no PEP.

The single-character `find_char` from `fastsearch.h` (adopted for the SRE prefix scanner in #148729) does not fit here: a JSON string scan stops at the first of three different bytes, and the strict-mode control-character test is a character-class check that a single-character search cannot express. A word-at-a-time mask handles the class in one operation.

### When it helps, and when it does not

Measured `json.loads` speedups against the current decoder:

| Document shape | Effect |
|---|---|
| One long text field (~11 KB string) | 6.3x faster |
| Many 200-character ASCII string values | 4.5x faster |
| Realistic mixed records (short and medium strings) | 1.17x faster |
| Short keys, numbers, the pyperformance document | no change |
| Strings with emoji or other non-Latin-1 text | no change (scalar path) |

The standard `pyperformance bm_json_loads` document is short-string and dict dominated, so it shows no change. The benefit is specific to documents whose payload is long text.

### Correctness

The decoded output is byte-identical to the current decoder. A proof-of-concept patch is validated against `test_json`, a 347-input differential corpus (real-world JSON plus a special character placed at every offset across the eight-byte window, in all three string representations), and all 340 files of `nst/JSONTestSuite` (318 parsing plus 22 transform). Every value and every error position matches.

### Relation to other json work

Independent of, and complementary to, the active number-parsing (#150639) and encoder (#150827) changes.

A proof-of-concept PR follows.

### Benchmark

Built base and patched interpreters from this branch's `main` ancestor and the patch, ran the same script under each, and compared with `pyperf compare_to` (A/B by swapping `Lib/json/decoder.py` on the same build; macOS arm64, non-PGO).

```python
import json, pyperf
# Ceiling probe: vary string length & kind to expose where SWAR string-scan helps.
long_ascii   = json.dumps([("x"*200) for _ in range(200)])          # long ASCII values -> max win
text_blob    = json.dumps({"body": "lorem ipsum dolor sit amet " * 400})  # one huge string
short_keys   = json.dumps({f"k{i}": i for i in range(2000)})         # short keys -> minimal win
mixed_real   = json.dumps([{"id":i,"name":f"user_{i}","email":f"u{i}@example.com","bio":"hello "*10} for i in range(300)])
multikind    = json.dumps(["emoji 😀 中文 текст "*20 for _ in range(200)])  # UCS-2/4 -> scalar fallback (neutral check)
# pyperformance dataset
SD={'key1':0,'key2':True,'key3':'value','key4':'foo','key5':'string'}
ND={'key1':0,'key2':SD,'key3':'value','key4':SD,'key5':SD,'key':'ąćż'}
ppset=[(json.dumps({}),2000),(json.dumps(SD),1000),(json.dumps(ND),1000),(json.dumps([ND]*1000),1)]

docs = {"long_ascii_values": long_ascii, "huge_text_blob": text_blob,
        "short_keys": short_keys, "mixed_real": mixed_real, "multikind_emoji": multikind}
runner = pyperf.Runner()
for name, s in docs.items():
    runner.bench_func(f"loads/{name}", lambda s=s: json.loads(s))
def pp(items):
    for s,n in items:
        for _ in range(n): json.loads(s)
runner.bench_func("loads/pyperformance", pp, ppset)
```


### Linked PRs
* gh-150872

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up JSON string decoding for documents with long string values #150871

Proposal

How this differs from the SIMD backend in #142915

When it helps, and when it does not

Correctness

Relation to other json work

Benchmark

Linked PRs

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Document shape	Effect
One long text field (~11 KB string)	6.3x faster
Many 200-character ASCII string values	4.5x faster
Realistic mixed records (short and medium strings)	1.17x faster
Short keys, numbers, the pyperformance document	no change
Strings with emoji or other non-Latin-1 text	no change (scalar path)

Uh oh!

Speed up JSON string decoding for documents with long string values #150871

Description

Proposal

How this differs from the SIMD backend in #142915

When it helps, and when it does not

Correctness

Relation to other json work

Benchmark

Linked PRs

Metadata

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Issue actions