Skip to content

perf(specifier): byte-level dispatch in require_without_parse#246

Closed
stormslowly wants to merge 2 commits into
bench/stable-short-specifierfrom
perf/specifier-head-byte-dispatch
Closed

perf(specifier): byte-level dispatch in require_without_parse#246
stormslowly wants to merge 2 commits into
bench/stable-short-specifierfrom
perf/specifier-head-byte-dispatch

Conversation

@stormslowly
Copy link
Copy Markdown
Collaborator

@stormslowly stormslowly commented May 27, 2026

Why

Path::new(specifier).components().next() runs the full std::path::Components state machine — Windows-prefix detection, Component enum construction, UTF-8 boundary checks — on every resolve just to look at the first character. The separator and ./.. markers are always single-byte ASCII, so the dispatch can be done by direct byte inspection.

Base

This PR now targets bench/stable-short-specifier (#247), which splits the short specifier benches into their own [[bench]] binary. That gives CodSpeed a stable baseline for the per-case deltas so any change here reflects the parser, not binary-layout luck.

What

  • SpecifierHead + classify_specifier_head in src/specifier.rs, with #[inline] so LTO inlines into require_without_parse (previous attempt without #[inline] showed cross-module-call regression on x86_64).
  • require_without_parse matches on SpecifierHead instead of Components::next.
  • Windows drive-letter / UNC specifiers still fall back to the std API.

Maintainability

A proptest property test (classify_matches_components_for_ascii, classify_matches_components_for_utf8) asserts that for any ASCII input on all platforms — and any UTF-8 input on Unix — classify_specifier_head agrees with Path::new(s).components().next()'s dispatch. Future changes that drift either side will fail the test.

Test plan

  • cargo test --lib --features __internal_bench -- --skip 'tests::pnp::' — 142 / 142 pass
  • cargo clippy --all-features -- -D warnings — clean
  • CodSpeed run on top of the stable-bench base — isolates opt3's real per-case impact

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 27, 2026

Merging this PR will not alter performance

✅ 90 untouched benchmarks
⏩ 84 skipped benchmarks1


Comparing perf/specifier-head-byte-dispatch (4742d9d) with bench/stable-short-specifier (68da90c)

Open in CodSpeed

Footnotes

  1. 84 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@stormslowly stormslowly force-pushed the perf/specifier-head-byte-dispatch branch from 4f0e742 to a7f2f10 Compare May 27, 2026 05:58
@stormslowly stormslowly changed the base branch from main to bench/stable-short-specifier May 27, 2026 05:58
@stormslowly stormslowly force-pushed the bench/stable-short-specifier branch from 4e33856 to f5fd078 Compare May 27, 2026 06:26
@stormslowly stormslowly force-pushed the perf/specifier-head-byte-dispatch branch from a7f2f10 to 27cdda3 Compare May 27, 2026 06:26
@stormslowly stormslowly force-pushed the bench/stable-short-specifier branch from f5fd078 to 4e33856 Compare May 27, 2026 06:42
@stormslowly stormslowly force-pushed the perf/specifier-head-byte-dispatch branch from 27cdda3 to a7f2f10 Compare May 27, 2026 06:42
@rstackjs rstackjs deleted a comment from codspeed-hq Bot May 27, 2026
@stormslowly
Copy link
Copy Markdown
Collaborator Author

@codspeedbot explain this regression

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 27, 2026

@codspeedbot explain this regression

Based on my thorough analysis of the flamegraph data from both the HEAD and BASE runs, here's my explanation of this regression. This is an analysis-only response — no code changes are needed.


Summary: This is not a real code regression. It is a measurement artifact caused by the HEAD and BASE runs executing on different hardware.

Environment Mismatch

The CodSpeed report itself flagged this:

BASE HEAD
CPU AMD EPYC 7763 64-Core Intel Xeon Platinum 8370C

CodSpeed's simulation mode uses Valgrind/callgrind to model instruction and memory costs. The cache simulation model is sensitive to the underlying CPU's cache hierarchy (sizes, associativity, line sizes), so identical code can produce different simulated memory costs when run on different hardware.

Flamegraph Breakdown

Comparing Specifier::parse self time between the two runs:

Component BASE HEAD Delta
Instructions 68.3 ns 68.3 ns ±0 (identical)
Caches 26.4 ns 27.8 ns +1.4 ns
Memory 527.8 ns 555.6 ns +27.8 ns
Self total 622.5 ns 651.7 ns +29.2 ns

The instruction count — the component that directly reflects your code change — is exactly the same in both runs. The entire ~29 ns regression comes from the simulated memory access cost, which shifted due to the different CPU's cache model.

The overall benchmark delta (816.7 ns → 845.8 ns = +29.1 ns) lines up precisely with this memory simulation difference in Specifier::parse.

Conclusion

Your byte-level dispatch optimization is instruction-neutral on this specific benchmark case (rw/hash-only), which makes sense — this case hits the "normal module" fast path where classify_specifier_head replaces a Components::next() call that was already cheap for this input. The -3.45% regression is entirely an artifact of cross-CPU cache simulation differences and can be safely acknowledged/dismissed.

To get a clean apples-to-apples comparison, re-running CI so both HEAD and BASE land on the same hardware class would eliminate this noise.

Summary
  • Warming up the performance wizard...
  • Analyzing regression — Flamegraph comparison complete — regression explained
  • Handling comment — Comment handling finished successfully

Was this response helpful? React with 👍 or 👎 to help us improve CodSpeedBot.

`Path::new(specifier).components().next()` runs the full `Components`
state machine (Windows-prefix detection, `Component` enum construction,
UTF-8 boundary checks) on every resolve just to look at the first
character. The separator and `.`/`..` markers are always single-byte
ASCII, so the dispatch can be done by direct byte inspection.

This is one slice of a larger ASCII fast-path effort. The standalone
gain is small because the resolver only hits this dispatch once per
call; the bigger wins from removing `Components` from `normalize`,
`normalize_with`, and parent walks will land in follow-up PRs.

Callgrind on `benches/resolver.rs::resolver[single-thread]`:
  Baseline:    92,893,808 Ir
  After:       92,695,778 Ir
  Delta:       -0.21%

Maintainability:
- `SpecifierHead` + `classify_specifier_head` live in `src/specifier.rs`
  next to the other specifier code, not at the bottom of `lib.rs`.
- A `proptest` property test asserts that, for any ASCII input on all
  platforms (and any UTF-8 input on Unix), `classify_specifier_head`
  agrees with `Path::new(s).components().next()`'s dispatch. Future
  changes to either side that diverge will fail this test.
CodSpeed CI (x86_64) reported small regressions on the resolver hot
benches (+0.57% on single-thread, +0.40% on multi-threaded resolve)
despite a local Ir improvement on arm64. The body itself does less work
than the previous inline `Path::components()` match; the regression
matched a per-resolve call cost, suggesting LTO didn't inline the
cross-module call on x86_64. `#[inline]` restores the per-call
inlining the old code had for free.
@stormslowly stormslowly force-pushed the perf/specifier-head-byte-dispatch branch from a7f2f10 to 4742d9d Compare May 27, 2026 09:15
@stormslowly
Copy link
Copy Markdown
Collaborator Author

It's not worth gaining little performance by making it a little unreadable

stormslowly added a commit that referenced this pull request May 28, 2026
On Unix, an OsStr is raw bytes and '/', '.', '..' are always single-byte
ASCII, so the heavy std::path::Components state machine (Windows-prefix
detection, Component enum construction, double-ended-iter bookkeeping) is
unnecessary for the resolver's hot paths.

Changes:
- src/path.rs: unix_normalize and unix_normalize_with replace the
  Components-based PathUtil::normalize / normalize_with on Unix.
- src/cache.rs: byte-level parent_path replaces Path::parent in
  Cache::value; join_last_segment replaces Path::strip_prefix +
  normalize_with in realpath_uncached, since parent.path is already a
  strict byte prefix of self.path.

Note: an earlier draft of this PR also rewrote require_without_parse to
dispatch on the specifier head via a byte table. That hunk was dropped
(see #246) because the maintenance cost of keeping a parallel impl of
Path::components in sync with std outweighed the small isolated win.
The remaining changes here cover the dominant Ir cost paths.

138/138 non-PNP unit tests pass. The 6 PNP tests already fail on main
without these changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant