Skip to content

GCC: parsed_number_string marshaling dominates short-float parsing on aarch64 #384

@fcostaoliveira

Description

@fcostaoliveira

Summary

While benchmarking on a Graviton4 (AWS m8g.metal-24xl, aarch64) I noticed that on
short floats compiled with GCC, a large share of from_charsdouble time is
spent marshaling the parsed_number_string_t rather than parsing or doing the
Clinger/Eisel-Lemire math. Filing as a discussion (not a PR) since any fix touches the
core path and the public struct — your call whether it's worth pursuing.

Measurement

Isolated profile (a tiny driver that only calls fast_float::from_chars in a loop over
the mesh corpus — short floats like 12.345), g++ 13.3 -O3 -march=native,
perf source-line attribution:

  • ascii_number.h return answer; (end of parse_number_string, returning
    parsed_number_string_t by value): ~44%
  • parse_number.h answer.ptr = pns.lastmatch; (start of from_chars_advanced,
    consuming it): ~30%
  • the Clinger range-check / multiply lines: <1.5% each

The hot instructions are stp/ldp stack pairs — GCC returns the ~56-72-byte struct
via the AArch64 sret path and stores/reloads it. On short inputs the parse work is tiny,
so this fixed cost dominates. clang is much less affected on the same box.

Why it's not a trivial fix

  • The integer/fraction spans (~32 bytes) are only used by the slow path
    (digit_comp), but they're carried in the struct that's returned on every call.
  • I tried the cheap thing — stop reading the spans back inside parse_number_string's
    >19-digit block (use locals) so GCC could DSE the span stores on the fast path. It
    made no difference: the spans are still stored into the returned struct (the slow
    path needs them), so GCC still spills it.
  • Slimming the struct doesn't reach the 16-byte register-return threshold on AArch64,
    and parsed_number_string_t + from_chars_advanced(pns&, value) are a documented
    public hook, so changing the layout is an API/ABI concern.
  • The only thing that fully removes it is an internal fused parse→compute fast path
    (compute mantissa/exponent as locals, attempt Clinger before ever materializing the
    struct; build the struct only when falling to the slow path) — while leaving the
    public struct and overload untouched.

Question

Is this worth pursuing, and if so do you prefer (a) an internal fused fast path beside
the existing API, or (b) leaving it? Before any PR I'd want to confirm it's a win (or
neutral) across Intel/AMD/Apple × gcc/clang/MSVC, not just Graviton — happy to gather
that if you're interested. (Context: this came out of the work behind #381/#382/#383.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions