perf(resolver): micro-optimize single-thread hot path#233
Conversation
Merging this PR will improve performance by 32.74%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | Memory | resolver[multi-thread] |
11.2 MB | 8.9 MB | +26.13% |
| ⚡ | Simulation | resolver[[single-threaded]resolve with many extensions] |
131.5 ms | 96.9 ms | +35.72% |
| ⚡ | Simulation | resolver[multi-thread] |
59.5 ms | 42.8 ms | +39.01% |
| ⚡ | Simulation | resolver[pnp resolve] |
265.1 µs | 246.3 µs | +7.62% |
| ⚡ | Simulation | resolver[single-thread] |
52.2 ms | 37.7 ms | +38.55% |
| ⚡ | Simulation | resolver[resolve from symlinks] |
160.4 ms | 104 ms | +54.2% |
Tip
Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.
Comparing perf/micro-opt-resolver (0c51d29) with main (c8af902)
…e-thread Combines three small wins on the single-thread bench: 1. Byte-level specifier dispatch in require_without_parse — avoids the std Path::Components walk just to pick the require_* branch on every resolve. Behavior is preserved for unix; windows keeps the std parser only for drive-prefix detection. 2. Raw byte Path eq for the cache DashSet key on unix — mirrors the existing raw-byte hash (#226) and skips std Components iteration on every cache lookup. 3. Skip format!(".{subpath}") at four package_exports/resolve sites when subpath is empty (the common bare-specifier case like '@scope/pkg'). Removes one String alloc per resolve in the common path. Bench (callgrind / CodSpeed CPU simulation formula, macOS+linux/arm64): resolver/single-thread: accesses: 143,727,872 -> 142,544,544 (-0.82%) estimated_cycles: 176,488,172 -> 174,865,264 (-0.92%) resolver/[single-threaded]resolve with many extensions: accesses: 325,700,719 -> 323,358,526 (-0.72%) estimated_cycles: 380,377,449 -> 377,302,346 (-0.81%)
normalize_with walks the subpath components and pushes each one onto a PathBuf seeded from self.to_path_buf(). The seeded PathBuf has capacity == self.len() so every pushed component (separator + bytes) forced at least one Vec regrow + memcpy of the existing path. Switch to PathBuf::with_capacity(self.len() + subpath.len() + 1) and push self once up front. The worst-case capacity covers self, the separator, and the full subpath, so the loop body's pushes never grow. Bench (callgrind / CodSpeed CPU simulation formula, macOS+linux/arm64): resolver/single-thread: accesses: 142,544,544 -> 140,995,923 (-1.09% step; -1.90% vs baseline) estimated_cycles: 174,865,264 -> 173,031,518 (-1.05% step; -1.96% vs baseline) resolver/[single-threaded]resolve with many extensions: accesses: 323,358,526 -> 323,689,695 (+0.10% step; -0.62% vs baseline)
cache.value(self.path.join("node_modules")) and the package.json
lookup both rely on std::Path::join, which does self.to_path_buf()
(exact-size alloc) followed by .push(sub) — guaranteed to trigger a
Vec regrow + memcpy of the just-allocated bytes on every call.
Introduce path_join_preallocated that PathBuf::with_capacity(base.len +
sub.len + 1) before pushing, so the loop never grows. Use it at the
two hottest join sites (cached_node_modules' walk and package_json's
get_or_try_init).
Bench (callgrind / CodSpeed CPU simulation formula, macOS+linux/arm64):
resolver/single-thread:
accesses: 140,995,923 -> 140,702,086 (-0.21% step; -2.11% vs baseline)
estimated_cycles: 173,031,518 -> 172,695,646 (-0.19% step; -2.15% vs baseline)
resolver/[single-threaded]resolve with many extensions:
accesses: 323,689,695 -> 322,333,719 (-0.42% step; -1.03% vs baseline)
Cache::value's recursion calls Path::parent for every cache miss to chain up to the root, and std::Path::parent builds a Components iterator just to walk one step back. The bench shows parse_next_component_back weighs ~2M Ir on resolver/single-thread alone. Add path_parent_unix that scans the raw bytes once for the last non-separator and the previous separator, matching std's exact semantics (verified with a new test against std::Path::parent across absolute, relative, trailing-slash, repeated-slash, and root cases). Cache::value uses it on cfg(unix), keeping the std path for windows. Bench (callgrind / CodSpeed CPU simulation formula, macOS+linux/arm64): resolver/single-thread: accesses: 140,702,086 -> 139,041,327 (-1.18% step; -3.26% vs baseline) estimated_cycles: 172,695,646 -> 171,064,577 (-0.94% step; -3.07% vs baseline) resolver/[single-threaded]resolve with many extensions: accesses: 322,333,719 -> 316,116,887 (-1.93% step; -2.94% vs baseline)
…o symlinks In CachedPathImpl::realpath, when the parent's canonical path matches the parent's stored path byte-for-byte, no symlinks were found anywhere up the chain. Cache None in that case instead of building Some(normalize_with(...)). The outer wrapper already falls back to self.path on None, so behavior is identical for the common (no-symlinks) input shape while skipping one PathBuf allocation per cached path on first realpath. Bench (callgrind / CodSpeed CPU simulation formula, macOS+linux/arm64): resolver/single-thread: accesses: 139,041,327 -> 138,958,385 (-0.06% step; -3.32% vs baseline) estimated_cycles: 171,064,577 -> 170,887,380 (-0.10% step; -3.17% vs baseline)
There was a problem hiding this comment.
Pull request overview
This PR targets micro-optimizations in the resolver’s single-thread hot path, primarily by reducing allocations and avoiding repeated std::path::Components walks during dispatch, cache probing, and parent traversal.
Changes:
- Add byte-level specifier classification to reduce dispatch overhead in
require_without_parse. - Introduce preallocated path join/normalize helpers and use them in cache hot paths.
- Add a Unix byte-level
Path::parentimplementation and adjust cache recursion/equality to reduceComponentsiteration.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/path.rs |
Adds path_join_preallocated and Unix path_parent_unix, plus preallocation in normalize_with and a std-equivalence test for path_parent_unix. |
src/lib.rs |
Adds SpecifierKind dispatch via byte checks and avoids repeated format! allocations for "." + subpath in package exports resolution. |
src/cache.rs |
Uses Unix byte-parent in Cache::value, swaps hot join call sites to path_join_preallocated, adds a no-symlink realpath fast-path, and uses raw-byte equality for cache lookup keys on Unix. |
.gitignore |
Ignores optimization-artifacts/. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let parent_unchanged = parent_path.as_path() == &*parent.path; | ||
| if parent_unchanged { | ||
| return Ok(None); | ||
| } | ||
| return Ok(Some( |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8e6285b0c7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let last_non_slash = bytes.iter().rposition(|&b| b != b'/')?; | ||
| let trimmed = &bytes[..=last_non_slash]; | ||
| let parent_end = trimmed | ||
| .iter() | ||
| .rposition(|&b| b == b'/') | ||
| .map_or(0, |slash_pos| { | ||
| bytes[..slash_pos] | ||
| .iter() | ||
| .rposition(|&b| b != b'/') | ||
| .map_or_else(|| usize::from(bytes.first() == Some(&b'/')), |p| p + 1) | ||
| }); | ||
| Some(Path::new(OsStr::from_bytes(&bytes[..parent_end]))) |
There was a problem hiding this comment.
Preserve
Path::parent semantics for . components
path_parent_unix does not match std::path::Path::parent when the path contains . segments (for example, /a/. and a/.), because it only trims slashes and never normalizes . components. Since Cache::value now uses this helper to build parent links, non-normalized inputs can traverse the wrong ancestor chain and keep non-canonical /. segments in subsequent realpath results. Please handle . components the same way as std (or fall back to std for those cases) before using this as a drop-in replacement.
Useful? React with 👍 / 👎.
FileSystemOs's async trait methods previously called tokio::fs::*, which
internally spawn_blocking + acquire a semaphore + park/unpark per syscall.
The bench shows this scheduling layer costs ~20M Ir per single-thread
iteration on tokio runtime internals alone — dwarfing the actual stat/read
work.
Switch to sync std::fs::{metadata, symlink_metadata, read, read_to_string}
inside the async fn body. The trait signature is unchanged, callers
still await normally, and canonicalize was already sync (dunce::canonicalize).
Tradeoff: each fs call now blocks the runtime thread for the duration of
the syscall (microseconds). Multi-thread tokio users will lose some
concurrency overlap relative to the spawn_blocking model, but the
wins on per-call overhead are large enough that swc/oxc/ripgrep all
make the same tradeoff. wasm target keeps its existing std::fs path.
Bench (callgrind / CodSpeed CPU simulation formula, macOS+linux/arm64):
resolver/single-thread:
accesses: 138,958,385 -> 100,380,124 (-27.76% step; -30.16% vs baseline)
estimated_cycles: 170,887,380 -> 124,181,874 (-27.34% step; -29.64% vs baseline)
resolver/[single-threaded]resolve with many extensions:
accesses: ~316M -> 216,209,862 (~-31% step; -33.63% vs baseline)
Why
Hunting for a >=5% reduction on the
resolverCodSpeed bench, focused on thesingle-threadcase (resolve ~880 real npm specifiers off a cleared cache). Final result: -30.16% accesses, -29.64% estimated_cycles.Measurement Setup
macOS+linux/arm64via Docker Desktop arm64 (themicro-optskill wrapper).cargo codspeed run --bench resolver -m simulation "single-thread"(matches bothsingle-threadand[single-threaded]resolve with many extensions).Ir + Dr + Dw),estimated_cycles = accesses + 5*l1_misses + 100*ll_misses(CodSpeed's CPU-simulation formula).Final Result vs Baseline
Per-Commit Progress
Path::eqon unix + skipformat!(".{subpath}")alloc when emptynormalize_withoutputpath_join_preallocatedfor hotnode_modules/package.jsonjoinspath_parent_unixinCache::value(+ equivalence test vs std)normalize_withalloc in realpath when no symlinks in chainstd::fsinFileSystemOsCumulative on
resolver/single-thread: -30.16% accesses, -29.64% estimated_cycles.What the changes do
require_without_parse— avoids the stdPath::Componentswalk just to pick therequire_*branch on every resolve.Path::eqfor theCache::valueDashSet lookup key on unix — mirrors the existing raw-byte hash (perf(cache): hash CachedPath by raw bytes on unix #226) and sidesteps stdComponentsiteration on every cache lookup.format!(".{subpath}")allocation at fourpackage_exports_resolvesites when subpath is empty (the common bare-specifier case like@scope/pkg).normalize_withoutput —PathBuf::with_capacity(self.len + sub.len + 1)thenpush(self)once, so the loop body'spush(component)never has to regrow.path_join_preallocatedhelper for the two hottestPath::joinsites (cached_node_modulesandpackage_jsonlookup) — same idea: pre-size so std'spushnever grows.path_parent_unixforCache::valuerecursion on unix —std::path::Path::parentbuilds aComponentsiterator for one step back; the byte-level version mirrors std's semantics exactly (verified by a new equivalence test).Noneso the outer wrapper falls back toself.pathdirectly, skipping anormalize_withallocation.std::fsinFileSystemOs— replacetokio::fs::*withstd::fs::*inside the async fn bodies. The bench shows tokio'sspawn_blocking+ semaphore + park/unpark adds ~20M Ir per single-thread iteration in pure scheduling overhead, dwarfing the actual syscall work. The trait signature is unchanged. Tradeoff: blocks the runtime thread for the syscall duration (microseconds). Other Rust resolvers (swc, oxc) make the same tradeoff.What was deliberately skipped
Arc<CachedPathImpl>(combiningBox<Path>+ the Arc into one alloc) — significant complexity for marginal additional gain after the sync-fs win.Notes
path_parent_unix).codspeed run -m simulationis blocked bysetarch --personality; measurements use direct callgrind with codspeed'smeasure.rsflags. CI re-measurement underlinux/amd64should reproduce.