perf(specifier): byte-level dispatch in require_without_parse by stormslowly · Pull Request #246 · rstackjs/rspack-resolver

stormslowly · 2026-05-27T01:45:42Z

Why

Path::new(specifier).components().next() runs the full std::path::Components state machine — Windows-prefix detection, Component enum construction, UTF-8 boundary checks — on every resolve just to look at the first character. The separator and ./.. markers are always single-byte ASCII, so the dispatch can be done by direct byte inspection.

Base

This PR now targets bench/stable-short-specifier (#247), which splits the short specifier benches into their own [[bench]] binary. That gives CodSpeed a stable baseline for the per-case deltas so any change here reflects the parser, not binary-layout luck.

What

SpecifierHead + classify_specifier_head in src/specifier.rs, with #[inline] so LTO inlines into require_without_parse (previous attempt without #[inline] showed cross-module-call regression on x86_64).
require_without_parse matches on SpecifierHead instead of Components::next.
Windows drive-letter / UNC specifiers still fall back to the std API.

Maintainability

A proptest property test (classify_matches_components_for_ascii, classify_matches_components_for_utf8) asserts that for any ASCII input on all platforms — and any UTF-8 input on Unix — classify_specifier_head agrees with Path::new(s).components().next()'s dispatch. Future changes that drift either side will fail the test.

Test plan

cargo test --lib --features __internal_bench -- --skip 'tests::pnp::' — 142 / 142 pass
cargo clippy --all-features -- -D warnings — clean
CodSpeed run on top of the stable-bench base — isolates opt3's real per-case impact

codspeed-hq · 2026-05-27T01:52:16Z

Merging this PR will not alter performance

✅ 90 untouched benchmarks
⏩ 84 skipped benchmarks¹

_{Comparing perf/specifier-head-byte-dispatch (4742d9d) with bench/stable-short-specifier (68da90c)}

84 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

stormslowly · 2026-05-27T08:29:50Z

@codspeedbot explain this regression

codspeed-hq · 2026-05-27T08:29:55Z

@codspeedbot explain this regression

Based on my thorough analysis of the flamegraph data from both the HEAD and BASE runs, here's my explanation of this regression. This is an analysis-only response — no code changes are needed.

Summary: This is not a real code regression. It is a measurement artifact caused by the HEAD and BASE runs executing on different hardware.

Environment Mismatch

The CodSpeed report itself flagged this:

	BASE	HEAD
CPU	AMD EPYC 7763 64-Core	Intel Xeon Platinum 8370C

CodSpeed's simulation mode uses Valgrind/callgrind to model instruction and memory costs. The cache simulation model is sensitive to the underlying CPU's cache hierarchy (sizes, associativity, line sizes), so identical code can produce different simulated memory costs when run on different hardware.

Flamegraph Breakdown

Comparing Specifier::parse self time between the two runs:

Component	BASE	HEAD	Delta
Instructions	68.3 ns	68.3 ns	±0 (identical)
Caches	26.4 ns	27.8 ns	+1.4 ns
Memory	527.8 ns	555.6 ns	+27.8 ns
Self total	622.5 ns	651.7 ns	+29.2 ns

The instruction count — the component that directly reflects your code change — is exactly the same in both runs. The entire ~29 ns regression comes from the simulated memory access cost, which shifted due to the different CPU's cache model.

The overall benchmark delta (816.7 ns → 845.8 ns = +29.1 ns) lines up precisely with this memory simulation difference in Specifier::parse.

Conclusion

Your byte-level dispatch optimization is instruction-neutral on this specific benchmark case (rw/hash-only), which makes sense — this case hits the "normal module" fast path where classify_specifier_head replaces a Components::next() call that was already cheap for this input. The -3.45% regression is entirely an artifact of cross-CPU cache simulation differences and can be safely acknowledged/dismissed.

To get a clean apples-to-apples comparison, re-running CI so both HEAD and BASE land on the same hardware class would eliminate this noise.

Summary

Warming up the performance wizard...
Analyzing regression — Flamegraph comparison complete — regression explained
Handling comment — Comment handling finished successfully

_{Was this response helpful? React with 👍 or 👎 to help us improve CodSpeedBot.}

`Path::new(specifier).components().next()` runs the full `Components` state machine (Windows-prefix detection, `Component` enum construction, UTF-8 boundary checks) on every resolve just to look at the first character. The separator and `.`/`..` markers are always single-byte ASCII, so the dispatch can be done by direct byte inspection. This is one slice of a larger ASCII fast-path effort. The standalone gain is small because the resolver only hits this dispatch once per call; the bigger wins from removing `Components` from `normalize`, `normalize_with`, and parent walks will land in follow-up PRs. Callgrind on `benches/resolver.rs::resolver[single-thread]`: Baseline: 92,893,808 Ir After: 92,695,778 Ir Delta: -0.21% Maintainability: - `SpecifierHead` + `classify_specifier_head` live in `src/specifier.rs` next to the other specifier code, not at the bottom of `lib.rs`. - A `proptest` property test asserts that, for any ASCII input on all platforms (and any UTF-8 input on Unix), `classify_specifier_head` agrees with `Path::new(s).components().next()`'s dispatch. Future changes to either side that diverge will fail this test.

CodSpeed CI (x86_64) reported small regressions on the resolver hot benches (+0.57% on single-thread, +0.40% on multi-threaded resolve) despite a local Ir improvement on arm64. The body itself does less work than the previous inline `Path::components()` match; the regression matched a per-resolve call cost, suggesting LTO didn't inline the cross-module call on x86_64. `#[inline]` restores the per-call inlining the old code had for free.

stormslowly · 2026-05-27T18:16:00Z

It's not worth gaining little performance by making it a little unreadable

On Unix, an OsStr is raw bytes and '/', '.', '..' are always single-byte ASCII, so the heavy std::path::Components state machine (Windows-prefix detection, Component enum construction, double-ended-iter bookkeeping) is unnecessary for the resolver's hot paths. Changes: - src/path.rs: unix_normalize and unix_normalize_with replace the Components-based PathUtil::normalize / normalize_with on Unix. - src/cache.rs: byte-level parent_path replaces Path::parent in Cache::value; join_last_segment replaces Path::strip_prefix + normalize_with in realpath_uncached, since parent.path is already a strict byte prefix of self.path. Note: an earlier draft of this PR also rewrote require_without_parse to dispatch on the specifier head via a byte table. That hunk was dropped (see #246) because the maintenance cost of keeping a parallel impl of Path::components in sync with std outweighed the small isolated win. The remaining changes here cover the dominant Ir cost paths. 138/138 non-PNP unit tests pass. The 6 PNP tests already fail on main without these changes.

stormslowly mentioned this pull request May 27, 2026

chore(bench): split specifier microbenches into a separate bench binary #247

Merged

3 tasks

stormslowly force-pushed the perf/specifier-head-byte-dispatch branch from 4f0e742 to a7f2f10 Compare May 27, 2026 05:58

stormslowly changed the base branch from main to bench/stable-short-specifier May 27, 2026 05:58

stormslowly force-pushed the bench/stable-short-specifier branch from 4e33856 to f5fd078 Compare May 27, 2026 06:26

stormslowly force-pushed the perf/specifier-head-byte-dispatch branch from a7f2f10 to 27cdda3 Compare May 27, 2026 06:26

stormslowly force-pushed the bench/stable-short-specifier branch from f5fd078 to 4e33856 Compare May 27, 2026 06:42

stormslowly force-pushed the perf/specifier-head-byte-dispatch branch from 27cdda3 to a7f2f10 Compare May 27, 2026 06:42

rstackjs deleted a comment from codspeed-hq Bot May 27, 2026

stormslowly added 2 commits May 27, 2026 17:14

stormslowly force-pushed the perf/specifier-head-byte-dispatch branch from a7f2f10 to 4742d9d Compare May 27, 2026 09:15

stormslowly closed this May 27, 2026

stormslowly mentioned this pull request May 28, 2026

perf: byte-level path handling on Unix to bypass Components iterator #245

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(specifier): byte-level dispatch in require_without_parse#246

perf(specifier): byte-level dispatch in require_without_parse#246
stormslowly wants to merge 2 commits into
bench/stable-short-specifierfrom
perf/specifier-head-byte-dispatch

stormslowly commented May 27, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented May 27, 2026 •

edited

Loading

Uh oh!

stormslowly commented May 27, 2026

Uh oh!

codspeed-hq Bot commented May 27, 2026 •

edited

Loading

Uh oh!

stormslowly commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stormslowly commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Base

What

Maintainability

Test plan

Uh oh!

codspeed-hq Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Footnotes

Uh oh!

stormslowly commented May 27, 2026

Uh oh!

codspeed-hq Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Environment Mismatch

Flamegraph Breakdown

Conclusion

Uh oh!

stormslowly commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

stormslowly commented May 27, 2026 •

edited

Loading

codspeed-hq Bot commented May 27, 2026 •

edited

Loading

codspeed-hq Bot commented May 27, 2026 •

edited

Loading