Skip materializing parsed_number_string_t spans on the hot path (addresses #384) by fcostaoliveira · Pull Request #386 · fastfloat/fast_float

fcostaoliveira · 2026-06-03T08:32:26Z

Summary

parsed_number_string_t carries two span<UC const> members (integer, fraction) that are read only on the rare slow paths — digit_comp, and the >19-significant-digit truncation recompute. But they are written on every parse, which forces the ~56/64-byte struct to be materialized and marshaled through the by-value return. On the hot path that surfaces as backend/store pressure. This addresses #384 ("the structure parsed_number_string_t is ... probably too fat for its own good").

What changed

parse_number_string gains a runtime bool store_spans = true parameter (default keeps every existing caller unchanged). When false, the integer/fraction span stores and the span-reading >19-digit recompute are skipped.
from_chars_float_advanced parses with store_spans = false, attempts Clinger + Eisel-Lemire inline, and routes the two rare slow branches (too_many_digits, am.power2 < 0) to a single fastfloat_noinline (noinline+cold) helper that re-parses with spans and calls the unchanged from_chars_advanced.
New fastfloat_noinline macro in float_common.h.

Two deliberate choices:

Runtime flag, not a template parameter — a template would create a second instantiation of the whole scanner whose icache cost wipes out the gain.
noinline cold slow path — the rare re-parse must stay out of line, or the force-inlined hot scanner gets duplicated into the caller; that bloats the hot frame and lengthens the loop-carried dependency chain, which regresses some targets (notably ARM gcc) even though it removes the spill.

Public from_chars / from_chars_advanced / parsed_number_string_t are unchanged.

Performance

Per-parser microbench (from_chars<double> in a tight loop over each dataset, median of 5, pinned core). base → this PR, MB/s (Δ%), vs current tip:

Target	random	canada	mesh
ARM Neoverse-V2 (Graviton4) gcc	1087 → 1907 (+75%)	948 → 1645 (+73%)	503 → 1489 (+196%)
ARM Neoverse-V2 (Graviton4) clang	1347 → 1449 (+8%)	1049 → 1135 (+8%)	879 → 976 (+11%)
Intel Ice Lake (Xeon 8360Y) gcc	1142 → 1358 (+19%)	973 → 1138 (+17%)	814 → 955 (+17%)
Intel Cascade Lake (Xeon 6248) gcc	681 → 800 (+18%)	599 → 705 (+18%)	448 → 548 (+22%)
Intel Cascade Lake (Xeon 6248) clang	528 → 595 (+13%)	431 → 528 (+23%)	311 → 365 (+17%)

float mirrors double (e.g. ARM gcc float: +72% / +71% / +172%). The win is largest where the base codegen spilled the struct most (ARM gcc); clang baselines that already partly avoided the spill gain less. (Drift-controlled: the unchanged ffc-vs-fast_float control row was flat across base/patch on these nodes.)

Intel TMA (top-down), Ice Lake (Xeon 8360Y), short floats (`mesh`)

Isolated fast_float microbench under perf stat -M TopdownL1/L2:

	base	this PR
Backend-Bound	26.0%	2.2%
Retiring	60.3%	77.3%
pipeline slots (TOPDOWN.SLOTS)	37.2 B	23.7 B (−36%)
wall time (same work)	2.41 s	1.53 s (−37%)

The base spends 26% of pipeline slots backend-bound on the span spill; this PR collapses that to 2.2% and lifts retiring to 77%, with 36% fewer issued slots. That is the microarchitectural mechanism behind the throughput numbers above.

Correctness

Full float-exhaustive suite passes: exhaustive32, exhaustive32_64, exhaustive32_midpoint, random64 — all "all ok".
A 2³² single-precision sweep is byte-identical to the current tip.
Core + supplemental tests pass under -Werror -Wall -Wextra -Wconversion.

Equivalence reasoning: when store_spans=false and >19 digits, the mantissa is left un-truncated but too_many_digits is set and the caller re-parses before reading it; the am.power2<0 re-parse re-runs Clinger, but Clinger is a pure function of (mantissa, exponent, negative, T) which store_spans does not affect for !too_many_digits, so a Clinger that failed on the hot path fails again, and digit_comp reproduces the original result via the re-materialized spans. answer.ptr/ec are set identically on every path.

Notes

__attribute__((noinline, cold)) / __declspec(noinline); the attribute is ignored during constant evaluation, and the constexpr from_chars tests pass on gcc 13.3 and clang 18.1 (and the MSVC/Alpine/MINGW CI here is green).

parsed_number_string_t carries two span<UC const> members (integer, fraction) that are only read on the rare slow paths (digit_comp, and the >19-significant- digit truncation recompute). Materializing them on every parse forces the ~56/64- byte struct to be written out and marshaled through the by-value return, which shows up as backend/store pressure on the hot path. This adds a runtime `store_spans` flag (default true, so all existing callers are unchanged) to parse_number_string; from_chars_float_advanced parses with it false, attempts the Clinger and Eisel-Lemire fast paths inline, and only re-parses with spans on the two rare slow branches. The re-parse is pushed into a single `fastfloat_noinline` (noinline+cold) helper so the force-inlined hot scanner is emitted once rather than duplicated into the caller (without this the extra inline copies regress some targets, e.g. ARM gcc, by bloating the hot frame and lengthening the loop-carried dependency chain). A runtime flag is used deliberately rather than a template parameter: a template would create a second instantiation of the whole scanner whose icache cost wipes out the gain. Measured (per-parser microbench, median of 5, pinned core), fast_float from_chars <double>/<float>, vs the current tip: - Intel Ice Lake (Xeon 8360Y): +17-19% (gcc), Intel TMA shows backend-bound 26.0% -> 2.2% and retiring 60.3% -> 77.3% on short floats (the eliminated span spill), with -36% pipeline slots. - Intel Cascade Lake (Xeon 6248): +18-22% (gcc), +13-23% (clang). - ARM Neoverse-V2 (Graviton4): +73-196% (gcc), +8-11% (clang) -- the struct spill dominated the gcc hot loop there. Correctness: the full float exhaustive suite (exhaustive32, exhaustive32_64, exhaustive32_midpoint, random64) passes, and a 2^32 sweep is byte-identical to the current tip. Public from_chars / from_chars_advanced / parsed_number_string_t are unchanged.

…antic change)

fcostaoliveira added 2 commits June 3, 2026 09:30

clang-format (clang-format-17 comment reflow + signature wrap; no sem…

3067491

…antic change)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip materializing parsed_number_string_t spans on the hot path (addresses #384)#386

Skip materializing parsed_number_string_t spans on the hot path (addresses #384)#386
fcostaoliveira wants to merge 2 commits into
fastfloat:mainfrom
redis-performance:pr/lazy-spans-coldpath

fcostaoliveira commented Jun 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fcostaoliveira commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Performance

Intel TMA (top-down), Ice Lake (Xeon 8360Y), short floats (mesh)

Correctness

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fcostaoliveira commented Jun 3, 2026 •

edited

Loading

Intel TMA (top-down), Ice Lake (Xeon 8360Y), short floats (`mesh`)