Skip materializing parsed_number_string_t spans on the hot path (addresses #384)#386
Open
fcostaoliveira wants to merge 2 commits into
Open
Skip materializing parsed_number_string_t spans on the hot path (addresses #384)#386fcostaoliveira wants to merge 2 commits into
fcostaoliveira wants to merge 2 commits into
Conversation
parsed_number_string_t carries two span<UC const> members (integer, fraction)
that are only read on the rare slow paths (digit_comp, and the >19-significant-
digit truncation recompute). Materializing them on every parse forces the ~56/64-
byte struct to be written out and marshaled through the by-value return, which
shows up as backend/store pressure on the hot path.
This adds a runtime `store_spans` flag (default true, so all existing callers are
unchanged) to parse_number_string; from_chars_float_advanced parses with it false,
attempts the Clinger and Eisel-Lemire fast paths inline, and only re-parses with
spans on the two rare slow branches. The re-parse is pushed into a single
`fastfloat_noinline` (noinline+cold) helper so the force-inlined hot scanner is
emitted once rather than duplicated into the caller (without this the extra inline
copies regress some targets, e.g. ARM gcc, by bloating the hot frame and lengthening
the loop-carried dependency chain).
A runtime flag is used deliberately rather than a template parameter: a template
would create a second instantiation of the whole scanner whose icache cost wipes
out the gain.
Measured (per-parser microbench, median of 5, pinned core), fast_float from_chars
<double>/<float>, vs the current tip:
- Intel Ice Lake (Xeon 8360Y): +17-19% (gcc), Intel TMA shows backend-bound
26.0% -> 2.2% and retiring 60.3% -> 77.3% on short floats (the eliminated span
spill), with -36% pipeline slots.
- Intel Cascade Lake (Xeon 6248): +18-22% (gcc), +13-23% (clang).
- ARM Neoverse-V2 (Graviton4): +73-196% (gcc), +8-11% (clang) -- the struct spill
dominated the gcc hot loop there.
Correctness: the full float exhaustive suite (exhaustive32, exhaustive32_64,
exhaustive32_midpoint, random64) passes, and a 2^32 sweep is byte-identical to the
current tip. Public from_chars / from_chars_advanced / parsed_number_string_t are
unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
parsed_number_string_tcarries twospan<UC const>members (integer,fraction) that are read only on the rare slow paths —digit_comp, and the>19-significant-digit truncation recompute. But they are written on every parse, which forces the ~56/64-byte struct to be materialized and marshaled through the by-value return. On the hot path that surfaces as backend/store pressure. This addresses #384 ("the structureparsed_number_string_tis ... probably too fat for its own good").What changed
parse_number_stringgains a runtimebool store_spans = trueparameter (default keeps every existing caller unchanged). Whenfalse, theinteger/fractionspan stores and the span-reading>19-digit recompute are skipped.from_chars_float_advancedparses withstore_spans = false, attempts Clinger + Eisel-Lemire inline, and routes the two rare slow branches (too_many_digits,am.power2 < 0) to a singlefastfloat_noinline(noinline+cold) helper that re-parses with spans and calls the unchangedfrom_chars_advanced.fastfloat_noinlinemacro infloat_common.h.Two deliberate choices:
noinline coldslow path — the rare re-parse must stay out of line, or the force-inlined hot scanner gets duplicated into the caller; that bloats the hot frame and lengthens the loop-carried dependency chain, which regresses some targets (notably ARM gcc) even though it removes the spill.Public
from_chars/from_chars_advanced/parsed_number_string_tare unchanged.Performance
Per-parser microbench (
from_chars<double>in a tight loop over each dataset, median of 5, pinned core). base → this PR, MB/s (Δ%), vs current tip:floatmirrorsdouble(e.g. ARM gcc float: +72% / +71% / +172%). The win is largest where the base codegen spilled the struct most (ARM gcc); clang baselines that already partly avoided the spill gain less. (Drift-controlled: the unchangedffc-vs-fast_floatcontrol row was flat across base/patch on these nodes.)Intel TMA (top-down), Ice Lake (Xeon 8360Y), short floats (
mesh)Isolated fast_float microbench under
perf stat -M TopdownL1/L2:The base spends 26% of pipeline slots backend-bound on the span spill; this PR collapses that to 2.2% and lifts retiring to 77%, with 36% fewer issued slots. That is the microarchitectural mechanism behind the throughput numbers above.
Correctness
exhaustive32,exhaustive32_64,exhaustive32_midpoint,random64— all "all ok".-Werror -Wall -Wextra -Wconversion.Equivalence reasoning: when
store_spans=falseand>19digits, the mantissa is left un-truncated buttoo_many_digitsis set and the caller re-parses before reading it; theam.power2<0re-parse re-runs Clinger, but Clinger is a pure function of(mantissa, exponent, negative, T)whichstore_spansdoes not affect for!too_many_digits, so a Clinger that failed on the hot path fails again, anddigit_compreproduces the original result via the re-materialized spans.answer.ptr/ecare set identically on every path.Notes
__attribute__((noinline, cold))/__declspec(noinline); the attribute is ignored during constant evaluation, and the constexprfrom_charstests pass on gcc 13.3 and clang 18.1 (and the MSVC/Alpine/MINGW CI here is green).