perf(specifier): byte-level dispatch in require_without_parse#246
perf(specifier): byte-level dispatch in require_without_parse#246stormslowly wants to merge 2 commits into
Conversation
Merging this PR will not alter performance
Comparing Footnotes
|
4f0e742 to
a7f2f10
Compare
4e33856 to
f5fd078
Compare
a7f2f10 to
27cdda3
Compare
f5fd078 to
4e33856
Compare
27cdda3 to
a7f2f10
Compare
|
@codspeedbot explain this regression |
Based on my thorough analysis of the flamegraph data from both the HEAD and BASE runs, here's my explanation of this regression. This is an analysis-only response — no code changes are needed. Summary: This is not a real code regression. It is a measurement artifact caused by the HEAD and BASE runs executing on different hardware. Environment MismatchThe CodSpeed report itself flagged this:
CodSpeed's simulation mode uses Valgrind/callgrind to model instruction and memory costs. The cache simulation model is sensitive to the underlying CPU's cache hierarchy (sizes, associativity, line sizes), so identical code can produce different simulated memory costs when run on different hardware. Flamegraph BreakdownComparing
The instruction count — the component that directly reflects your code change — is exactly the same in both runs. The entire ~29 ns regression comes from the simulated memory access cost, which shifted due to the different CPU's cache model. The overall benchmark delta (816.7 ns → 845.8 ns = +29.1 ns) lines up precisely with this memory simulation difference in ConclusionYour byte-level dispatch optimization is instruction-neutral on this specific benchmark case ( To get a clean apples-to-apples comparison, re-running CI so both HEAD and BASE land on the same hardware class would eliminate this noise. Summary
Was this response helpful? React with 👍 or 👎 to help us improve CodSpeedBot. |
`Path::new(specifier).components().next()` runs the full `Components` state machine (Windows-prefix detection, `Component` enum construction, UTF-8 boundary checks) on every resolve just to look at the first character. The separator and `.`/`..` markers are always single-byte ASCII, so the dispatch can be done by direct byte inspection. This is one slice of a larger ASCII fast-path effort. The standalone gain is small because the resolver only hits this dispatch once per call; the bigger wins from removing `Components` from `normalize`, `normalize_with`, and parent walks will land in follow-up PRs. Callgrind on `benches/resolver.rs::resolver[single-thread]`: Baseline: 92,893,808 Ir After: 92,695,778 Ir Delta: -0.21% Maintainability: - `SpecifierHead` + `classify_specifier_head` live in `src/specifier.rs` next to the other specifier code, not at the bottom of `lib.rs`. - A `proptest` property test asserts that, for any ASCII input on all platforms (and any UTF-8 input on Unix), `classify_specifier_head` agrees with `Path::new(s).components().next()`'s dispatch. Future changes to either side that diverge will fail this test.
CodSpeed CI (x86_64) reported small regressions on the resolver hot benches (+0.57% on single-thread, +0.40% on multi-threaded resolve) despite a local Ir improvement on arm64. The body itself does less work than the previous inline `Path::components()` match; the regression matched a per-resolve call cost, suggesting LTO didn't inline the cross-module call on x86_64. `#[inline]` restores the per-call inlining the old code had for free.
a7f2f10 to
4742d9d
Compare
|
It's not worth gaining little performance by making it a little unreadable |
On Unix, an OsStr is raw bytes and '/', '.', '..' are always single-byte ASCII, so the heavy std::path::Components state machine (Windows-prefix detection, Component enum construction, double-ended-iter bookkeeping) is unnecessary for the resolver's hot paths. Changes: - src/path.rs: unix_normalize and unix_normalize_with replace the Components-based PathUtil::normalize / normalize_with on Unix. - src/cache.rs: byte-level parent_path replaces Path::parent in Cache::value; join_last_segment replaces Path::strip_prefix + normalize_with in realpath_uncached, since parent.path is already a strict byte prefix of self.path. Note: an earlier draft of this PR also rewrote require_without_parse to dispatch on the specifier head via a byte table. That hunk was dropped (see #246) because the maintenance cost of keeping a parallel impl of Path::components in sync with std outweighed the small isolated win. The remaining changes here cover the dominant Ir cost paths. 138/138 non-PNP unit tests pass. The 6 PNP tests already fail on main without these changes.
Why
Path::new(specifier).components().next()runs the fullstd::path::Componentsstate machine — Windows-prefix detection,Componentenum construction, UTF-8 boundary checks — on every resolve just to look at the first character. The separator and./..markers are always single-byte ASCII, so the dispatch can be done by direct byte inspection.Base
This PR now targets
bench/stable-short-specifier(#247), which splits the short specifier benches into their own[[bench]]binary. That gives CodSpeed a stable baseline for the per-case deltas so any change here reflects the parser, not binary-layout luck.What
SpecifierHead+classify_specifier_headinsrc/specifier.rs, with#[inline]so LTO inlines intorequire_without_parse(previous attempt without#[inline]showed cross-module-call regression on x86_64).require_without_parsematches onSpecifierHeadinstead ofComponents::next.Maintainability
A
proptestproperty test (classify_matches_components_for_ascii,classify_matches_components_for_utf8) asserts that for any ASCII input on all platforms — and any UTF-8 input on Unix —classify_specifier_headagrees withPath::new(s).components().next()'s dispatch. Future changes that drift either side will fail the test.Test plan
cargo test --lib --features __internal_bench -- --skip 'tests::pnp::'— 142 / 142 passcargo clippy --all-features -- -D warnings— clean