Skip to content

Speed up JSON string encoding for documents with long string values #150875

@gaborbernat

Description

@gaborbernat

Feature or enhancement

Proposal

json.dumps escapes each string by first scanning it one character at a time to compute the escaped size (ascii_escape_size in Modules/_json.c); when nothing needs escaping, write_escaped_ascii then copies the string verbatim. For a long string with no characters that need escaping, which is the common case for text values, log messages, and other long content, that per-character sizing scan is pure overhead before the verbatim copy.

The proposal is to detect the no-escape case on the one-byte (ASCII/Latin-1) representation eight bytes at a time. Load eight bytes into a single machine word and test all eight at once for a character that needs escaping (c < 0x20, c > 0x7e, c == '"', or c == '\\'). When a long run has none, return the verbatim size directly. A length guard keeps short strings, such as the typical dict key, on the existing per-character loop, where the eight-byte path would not pay for its setup. Strings that need escaping, and two-byte and four-byte strings (anything with a non-Latin-1 character), keep the current path.

This is the encode-side counterpart to the decode-side scan in #150871 (PR #150872). The two touch different code paths, so they are separate changes.

How this differs from the SIMD backend in #142915

It is not the SIMD parsing architecture declined in #142915. It uses no SIMD intrinsics, no runtime CPU detection, and no build configuration, only portable 64-bit integer arithmetic with the same 0x0101… / 0x8080… masks that Objects/unicodeobject.c already applies for ASCII scanning. It changes one function and adds no infrastructure, so it does not depend on #125022 and needs no PEP.

When it helps, and when it does not

Measured json.dumps speedups against the current encoder:

Document shape Effect
One long text field (~11 KB string) 5.3x faster
Many 200-character ASCII string values 3.1x faster
Realistic mixed records (short and medium strings) 1.3x faster
Short keys, strings that need escaping, the pyperformance document no change
Strings with emoji or other non-Latin-1 text no change (scalar path)

The benefit is specific to documents whose payload is long, escape-free text. The short-string guard keeps key-heavy documents unaffected.

Correctness

The encoded output is byte-identical to the current encoder. A patch is validated against test_json and a 199-case differential corpus (strings placing each escape-relevant character, including ", \, control chars, 0x7f, and non-Latin-1 characters, at every offset across the eight-byte window, in both ensure_ascii=True and ensure_ascii=False modes). Every output matched.

A proof-of-concept PR follows.

Benchmark

Built base and patched interpreters from this branch's main ancestor and the patch, ran the same script under each, and compared with pyperf compare_to (A/B by swapping Lib/json/encoder.py on the same build; macOS arm64, non-PGO).

import json, pyperf
long_ascii = [("x"*200) for _ in range(200)]                 # long no-escape ASCII values
text_blob  = {"body": "lorem ipsum dolor sit amet " * 400}   # one huge no-escape string
escaped    = [('a"b\\c\n'*30) for _ in range(200)]           # escape-heavy
short_keys = {f"k{i}": i for i in range(2000)}               # short keys
mixed_real = [{"id":i,"name":f"user_{i}","email":f"u{i}@x.com","bio":"hello world "*10} for i in range(300)]
nonascii   = ["café 😀 中文 "*20 for _ in range(200)]          # UCS-2/4 (scalar path)
objs={"long_ascii":long_ascii,"text_blob":text_blob,"escaped":escaped,"short_keys":short_keys,"mixed_real":mixed_real,"nonascii":nonascii}
r=pyperf.Runner()
for n,o in objs.items():
    r.bench_func(f"dumps/{n}", lambda o=o: json.dumps(o))

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    extension-modulesC modules in the Modules dirperformancePerformance or resource usagetype-featureA feature request or enhancement
    No fields configured for issues without a type.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions