Skip to content

Skip materializing parsed_number_string_t spans on the hot path (addresses #384)#386

Open
fcostaoliveira wants to merge 2 commits into
fastfloat:mainfrom
redis-performance:pr/lazy-spans-coldpath
Open

Skip materializing parsed_number_string_t spans on the hot path (addresses #384)#386
fcostaoliveira wants to merge 2 commits into
fastfloat:mainfrom
redis-performance:pr/lazy-spans-coldpath

Conversation

@fcostaoliveira
Copy link
Copy Markdown
Contributor

@fcostaoliveira fcostaoliveira commented Jun 3, 2026

Summary

parsed_number_string_t carries two span<UC const> members (integer, fraction) that are read only on the rare slow pathsdigit_comp, and the >19-significant-digit truncation recompute. But they are written on every parse, which forces the ~56/64-byte struct to be materialized and marshaled through the by-value return. On the hot path that surfaces as backend/store pressure. This addresses #384 ("the structure parsed_number_string_t is ... probably too fat for its own good").

What changed

  • parse_number_string gains a runtime bool store_spans = true parameter (default keeps every existing caller unchanged). When false, the integer/fraction span stores and the span-reading >19-digit recompute are skipped.
  • from_chars_float_advanced parses with store_spans = false, attempts Clinger + Eisel-Lemire inline, and routes the two rare slow branches (too_many_digits, am.power2 < 0) to a single fastfloat_noinline (noinline+cold) helper that re-parses with spans and calls the unchanged from_chars_advanced.
  • New fastfloat_noinline macro in float_common.h.

Two deliberate choices:

  • Runtime flag, not a template parameter — a template would create a second instantiation of the whole scanner whose icache cost wipes out the gain.
  • noinline cold slow path — the rare re-parse must stay out of line, or the force-inlined hot scanner gets duplicated into the caller; that bloats the hot frame and lengthens the loop-carried dependency chain, which regresses some targets (notably ARM gcc) even though it removes the spill.

Public from_chars / from_chars_advanced / parsed_number_string_t are unchanged.

Performance

Per-parser microbench (from_chars<double> in a tight loop over each dataset, median of 5, pinned core). base → this PR, MB/s (Δ%), vs current tip:

Target random canada mesh
ARM Neoverse-V2 (Graviton4) gcc 1087 → 1907 (+75%) 948 → 1645 (+73%) 503 → 1489 (+196%)
ARM Neoverse-V2 (Graviton4) clang 1347 → 1449 (+8%) 1049 → 1135 (+8%) 879 → 976 (+11%)
Intel Ice Lake (Xeon 8360Y) gcc 1142 → 1358 (+19%) 973 → 1138 (+17%) 814 → 955 (+17%)
Intel Cascade Lake (Xeon 6248) gcc 681 → 800 (+18%) 599 → 705 (+18%) 448 → 548 (+22%)
Intel Cascade Lake (Xeon 6248) clang 528 → 595 (+13%) 431 → 528 (+23%) 311 → 365 (+17%)

float mirrors double (e.g. ARM gcc float: +72% / +71% / +172%). The win is largest where the base codegen spilled the struct most (ARM gcc); clang baselines that already partly avoided the spill gain less. (Drift-controlled: the unchanged ffc-vs-fast_float control row was flat across base/patch on these nodes.)

Intel TMA (top-down), Ice Lake (Xeon 8360Y), short floats (mesh)

Isolated fast_float microbench under perf stat -M TopdownL1/L2:

base this PR
Backend-Bound 26.0% 2.2%
Retiring 60.3% 77.3%
pipeline slots (TOPDOWN.SLOTS) 37.2 B 23.7 B (−36%)
wall time (same work) 2.41 s 1.53 s (−37%)

The base spends 26% of pipeline slots backend-bound on the span spill; this PR collapses that to 2.2% and lifts retiring to 77%, with 36% fewer issued slots. That is the microarchitectural mechanism behind the throughput numbers above.

Correctness

  • Full float-exhaustive suite passes: exhaustive32, exhaustive32_64, exhaustive32_midpoint, random64 — all "all ok".
  • A 2³² single-precision sweep is byte-identical to the current tip.
  • Core + supplemental tests pass under -Werror -Wall -Wextra -Wconversion.

Equivalence reasoning: when store_spans=false and >19 digits, the mantissa is left un-truncated but too_many_digits is set and the caller re-parses before reading it; the am.power2<0 re-parse re-runs Clinger, but Clinger is a pure function of (mantissa, exponent, negative, T) which store_spans does not affect for !too_many_digits, so a Clinger that failed on the hot path fails again, and digit_comp reproduces the original result via the re-materialized spans. answer.ptr/ec are set identically on every path.

Notes

  • __attribute__((noinline, cold)) / __declspec(noinline); the attribute is ignored during constant evaluation, and the constexpr from_chars tests pass on gcc 13.3 and clang 18.1 (and the MSVC/Alpine/MINGW CI here is green).

parsed_number_string_t carries two span<UC const> members (integer, fraction)
that are only read on the rare slow paths (digit_comp, and the >19-significant-
digit truncation recompute). Materializing them on every parse forces the ~56/64-
byte struct to be written out and marshaled through the by-value return, which
shows up as backend/store pressure on the hot path.

This adds a runtime `store_spans` flag (default true, so all existing callers are
unchanged) to parse_number_string; from_chars_float_advanced parses with it false,
attempts the Clinger and Eisel-Lemire fast paths inline, and only re-parses with
spans on the two rare slow branches. The re-parse is pushed into a single
`fastfloat_noinline` (noinline+cold) helper so the force-inlined hot scanner is
emitted once rather than duplicated into the caller (without this the extra inline
copies regress some targets, e.g. ARM gcc, by bloating the hot frame and lengthening
the loop-carried dependency chain).

A runtime flag is used deliberately rather than a template parameter: a template
would create a second instantiation of the whole scanner whose icache cost wipes
out the gain.

Measured (per-parser microbench, median of 5, pinned core), fast_float from_chars
<double>/<float>, vs the current tip:
  - Intel Ice Lake (Xeon 8360Y): +17-19% (gcc), Intel TMA shows backend-bound
    26.0% -> 2.2% and retiring 60.3% -> 77.3% on short floats (the eliminated span
    spill), with -36% pipeline slots.
  - Intel Cascade Lake (Xeon 6248): +18-22% (gcc), +13-23% (clang).
  - ARM Neoverse-V2 (Graviton4): +73-196% (gcc), +8-11% (clang) -- the struct spill
    dominated the gcc hot loop there.
Correctness: the full float exhaustive suite (exhaustive32, exhaustive32_64,
exhaustive32_midpoint, random64) passes, and a 2^32 sweep is byte-identical to the
current tip. Public from_chars / from_chars_advanced / parsed_number_string_t are
unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant