Summary
While benchmarking on a Graviton4 (AWS m8g.metal-24xl, aarch64) I noticed that on
short floats compiled with GCC, a large share of from_chars→double time is
spent marshaling the parsed_number_string_t rather than parsing or doing the
Clinger/Eisel-Lemire math. Filing as a discussion (not a PR) since any fix touches the
core path and the public struct — your call whether it's worth pursuing.
Measurement
Isolated profile (a tiny driver that only calls fast_float::from_chars in a loop over
the mesh corpus — short floats like 12.345), g++ 13.3 -O3 -march=native,
perf source-line attribution:
ascii_number.h return answer; (end of parse_number_string, returning
parsed_number_string_t by value): ~44%
parse_number.h answer.ptr = pns.lastmatch; (start of from_chars_advanced,
consuming it): ~30%
- the Clinger range-check / multiply lines: <1.5% each
The hot instructions are stp/ldp stack pairs — GCC returns the ~56-72-byte struct
via the AArch64 sret path and stores/reloads it. On short inputs the parse work is tiny,
so this fixed cost dominates. clang is much less affected on the same box.
Why it's not a trivial fix
- The
integer/fraction spans (~32 bytes) are only used by the slow path
(digit_comp), but they're carried in the struct that's returned on every call.
- I tried the cheap thing — stop reading the spans back inside
parse_number_string's
>19-digit block (use locals) so GCC could DSE the span stores on the fast path. It
made no difference: the spans are still stored into the returned struct (the slow
path needs them), so GCC still spills it.
- Slimming the struct doesn't reach the 16-byte register-return threshold on AArch64,
and parsed_number_string_t + from_chars_advanced(pns&, value) are a documented
public hook, so changing the layout is an API/ABI concern.
- The only thing that fully removes it is an internal fused parse→compute fast path
(compute mantissa/exponent as locals, attempt Clinger before ever materializing the
struct; build the struct only when falling to the slow path) — while leaving the
public struct and overload untouched.
Question
Is this worth pursuing, and if so do you prefer (a) an internal fused fast path beside
the existing API, or (b) leaving it? Before any PR I'd want to confirm it's a win (or
neutral) across Intel/AMD/Apple × gcc/clang/MSVC, not just Graviton — happy to gather
that if you're interested. (Context: this came out of the work behind #381/#382/#383.)
Summary
While benchmarking on a Graviton4 (AWS
m8g.metal-24xl, aarch64) I noticed that onshort floats compiled with GCC, a large share of
from_chars→doubletime isspent marshaling the
parsed_number_string_trather than parsing or doing theClinger/Eisel-Lemire math. Filing as a discussion (not a PR) since any fix touches the
core path and the public struct — your call whether it's worth pursuing.
Measurement
Isolated profile (a tiny driver that only calls
fast_float::from_charsin a loop overthe
meshcorpus — short floats like12.345),g++ 13.3 -O3 -march=native,perfsource-line attribution:ascii_number.hreturn answer;(end ofparse_number_string, returningparsed_number_string_tby value): ~44%parse_number.hanswer.ptr = pns.lastmatch;(start offrom_chars_advanced,consuming it): ~30%
The hot instructions are
stp/ldpstack pairs — GCC returns the ~56-72-byte structvia the AArch64 sret path and stores/reloads it. On short inputs the parse work is tiny,
so this fixed cost dominates.
clangis much less affected on the same box.Why it's not a trivial fix
integer/fractionspans (~32 bytes) are only used by the slow path(
digit_comp), but they're carried in the struct that's returned on every call.parse_number_string's>19-digitblock (use locals) so GCC could DSE the span stores on the fast path. Itmade no difference: the spans are still stored into the returned struct (the slow
path needs them), so GCC still spills it.
and
parsed_number_string_t+from_chars_advanced(pns&, value)are a documentedpublic hook, so changing the layout is an API/ABI concern.
(compute mantissa/exponent as locals, attempt Clinger before ever materializing the
struct; build the struct only when falling to the slow path) — while leaving the
public struct and overload untouched.
Question
Is this worth pursuing, and if so do you prefer (a) an internal fused fast path beside
the existing API, or (b) leaving it? Before any PR I'd want to confirm it's a win (or
neutral) across Intel/AMD/Apple × gcc/clang/MSVC, not just Graviton — happy to gather
that if you're interested. (Context: this came out of the work behind #381/#382/#383.)