From 7589a4fea569d5bebd2a474136e8ec9e00ce0335 Mon Sep 17 00:00:00 2001 From: fcostaoliveira Date: Mon, 1 Jun 2026 10:55:04 +0100 Subject: [PATCH] Add a 4-digit SWAR follow-up to loop_parse_if_eight_digits (clang) After the 8-digit SWAR block loop, consume a remaining 4-7 digit run in one read4_to_u32 + parse_four_digits_unrolled step instead of byte-by-byte (reusing the existing 4-digit helpers). The parsed result is identical; this is purely a faster way to consume the same digits. Gated to clang: on gcc the extra 4-digit check regresses inputs whose remainder is < 4 digits (e.g. the 17-digit fraction of uniform [0,1] -> -3% on 'random'), because the check becomes pure overhead there; clang does not show that. m8g.metal-24xl (Graviton4), -O3 -march=native, simple_fastfloat_benchmark, from_chars->double, clang 18, base vs patch back-to-back (2 samples): canada.txt +11.7%, mesh.txt +7.4%, random ~flat. No regression. --- include/fast_float/ascii_number.h | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/include/fast_float/ascii_number.h b/include/fast_float/ascii_number.h index 12c2fddc..1c9c8575 100644 --- a/include/fast_float/ascii_number.h +++ b/include/fast_float/ascii_number.h @@ -266,6 +266,21 @@ loop_parse_if_eight_digits(char const *&p, char const *const pend, p)); // in rare cases, this will overflow, but that's ok p += 8; } + // Consume a remaining 4-7 digit run in a single SWAR step instead of + // byte-by-byte (reuses the existing 4-digit helpers). The parsed result is + // identical either way. Gated to clang: on gcc the extra 4-digit check + // regresses inputs whose remainder is shorter than 4 digits (it becomes pure + // overhead there); clang does not show that. +#if defined(__clang__) + if ((pend - p) >= 4) { + uint32_t const val4 = read4_to_u32(p); + if (is_made_of_four_digits_fast(val4)) { + i = i * 10000 + + parse_four_digits_unrolled(val4); // may overflow, that's ok + p += 4; + } + } +#endif } enum class parse_error {