Skip to content

Unroll the integer-part digit scan (straight-line for the common 1-5 digit case)#381

Merged
lemire merged 1 commit into
fastfloat:mainfrom
redis-performance:pr/integer-scan-unroll
Jun 1, 2026
Merged

Unroll the integer-part digit scan (straight-line for the common 1-5 digit case)#381
lemire merged 1 commit into
fastfloat:mainfrom
redis-performance:pr/integer-scan-unroll

Conversation

@fcostaoliveira
Copy link
Copy Markdown
Contributor

The integer part of a number is scanned one byte at a time, while the fractional
part already uses the 8-digit SWAR loop (loop_parse_if_eight_digits). Integer parts
are usually short (1–5 digits), so the loop back-edge is a large share of the cost.
This peels the first five iterations into straight-line ifs and falls through to the
original loop for longer inputs. The arithmetic is unchanged (i = 10*i + digit), so
behavior is identical; one file, +29/−6, in the UC-templated path.

Benchmark — m8g.metal-24xl (Graviton4), -O3 -march=native,
simple_fastfloat_benchmark, from_charsdouble, base vs patch measured
back-to-back (mean of 2 runs):

dataset gcc 13 clang 18
canada.txt +3.1% +2.8%
mesh.txt +5.4% +5.1%
random [0,1] ~0% ~0%

random is 0.xxx (a 1-digit integer part), so it is unaffected, as expected. No
regression on any input.

For completeness I also tried reusing loop_parse_if_eight_digits for the integer
part, and a counted for (k < 5) loop; both were slower here (the 8-digit SWAR setup
does not pay off for short integer parts, and clang optimized the counted loop less
well), so this keeps the explicit peel.

Tests: FASTFLOAT_TEST 14/14 and FASTFLOAT_EXHAUSTIVE (exhaustive32 / 32_64 /
midpoint / long variants) all pass. Builds clean on gcc and clang at C++11 and C++20
under -Werror -Wall -Wextra -Weffc++ -Wconversion -Wsign-conversion -Wshadow,
clang-format clean. No new multi-byte reads, so big-endian (s390x) is unaffected.

…digit case)

parse_number_string scans the integer part one byte at a time in a while loop,
while the fraction already uses the 8-digit SWAR loop. Most integer parts are
1-5 digits, so the loop back-edge dominates. Peel the first five iterations into
nested ifs, falling through to the original while for longer runs. Semantics are
identical (i = 10*i + digit, advancing p); no behavior change.

AWS m8g.metal-24xl (Graviton4), -O3 -march=native, simple_fastfloat_benchmark,
from_chars->double. base vs patch measured back-to-back, mean of 2 runs:
  canada: gcc +3.1%, clang +2.8%
  mesh:   gcc +5.4%, clang +5.1%
  random: ~flat (1-digit integer part)
No regression; gcc and clang agree.

Alternatives benchmarked and rejected: reusing loop_parse_if_eight_digits for the
integer part regressed 5-8% (integer parts are too short for 8-digit SWAR setup);
a counted for(k<5) loop matched on gcc but clang optimized it worse (canada -0.9%).
The explicit peel is the only form solidly positive on both compilers.
Copy link
Copy Markdown
Member

@lemire lemire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will merge once tests pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants