GCC: parsed_number_string marshaling dominates short-float parsing on aarch64

### Summary

While benchmarking on a Graviton4 (AWS `m8g.metal-24xl`, aarch64) I noticed that on
**short floats compiled with GCC**, a large share of `from_chars`→`double` time is
spent marshaling the `parsed_number_string_t` rather than parsing or doing the
Clinger/Eisel-Lemire math. Filing as a discussion (not a PR) since any fix touches the
core path and the public struct — your call whether it's worth pursuing.

### Measurement

Isolated profile (a tiny driver that only calls `fast_float::from_chars` in a loop over
the `mesh` corpus — short floats like `12.345`), `g++ 13.3 -O3 -march=native`,
`perf` source-line attribution:

- `ascii_number.h` `return answer;` (end of `parse_number_string`, returning
  `parsed_number_string_t` by value): **~44%**
- `parse_number.h` `answer.ptr = pns.lastmatch;` (start of `from_chars_advanced`,
  consuming it): **~30%**
- the Clinger range-check / multiply lines: **<1.5% each**

The hot instructions are `stp`/`ldp` stack pairs — GCC returns the ~56-72-byte struct
via the AArch64 sret path and stores/reloads it. On short inputs the parse work is tiny,
so this fixed cost dominates. `clang` is much less affected on the same box.

### Why it's not a trivial fix

- The `integer`/`fraction` spans (~32 bytes) are only used by the slow path
  (`digit_comp`), but they're carried in the struct that's returned on every call.
- I tried the cheap thing — stop reading the spans back inside `parse_number_string`'s
  `>19-digit` block (use locals) so GCC could DSE the span stores on the fast path. It
  made **no difference**: the spans are still stored into the returned struct (the slow
  path needs them), so GCC still spills it.
- Slimming the struct doesn't reach the 16-byte register-return threshold on AArch64,
  and `parsed_number_string_t` + `from_chars_advanced(pns&, value)` are a documented
  public hook, so changing the layout is an API/ABI concern.
- The only thing that fully removes it is an internal fused parse→compute fast path
  (compute mantissa/exponent as locals, attempt Clinger before ever materializing the
  struct; build the struct only when falling to the slow path) — while leaving the
  public struct and overload untouched.

### Question

Is this worth pursuing, and if so do you prefer (a) an internal fused fast path beside
the existing API, or (b) leaving it? Before any PR I'd want to confirm it's a win (or
neutral) across Intel/AMD/Apple × gcc/clang/MSVC, not just Graviton — happy to gather
that if you're interested. (Context: this came out of the work behind #381/#382/#383.)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCC: parsed_number_string marshaling dominates short-float parsing on aarch64 #384

Summary

Measurement

Why it's not a trivial fix

Question

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

GCC: parsed_number_string marshaling dominates short-float parsing on aarch64 #384

Description

Summary

Measurement

Why it's not a trivial fix

Question

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions