Skip to content

perf(resolver): micro-optimize single-thread hot path#233

Open
stormslowly wants to merge 7 commits into
mainfrom
perf/micro-opt-resolver
Open

perf(resolver): micro-optimize single-thread hot path#233
stormslowly wants to merge 7 commits into
mainfrom
perf/micro-opt-resolver

Conversation

@stormslowly
Copy link
Copy Markdown
Collaborator

@stormslowly stormslowly commented May 21, 2026

Why

Hunting for a >=5% reduction on the resolver CodSpeed bench, focused on the single-thread case (resolve ~880 real npm specifiers off a cleared cache). Final result: -30.16% accesses, -29.64% estimated_cycles.

Measurement Setup

  • Mode: direct callgrind 3.22 (bundled docker image lacks the standalone CodSpeed runner; cargo-codspeed builds the bench, valgrind wraps it directly via codspeed-criterion-compat client requests).
  • Platform: macOS+linux/arm64 via Docker Desktop arm64 (the micro-opt skill wrapper).
  • Bench: cargo codspeed run --bench resolver -m simulation "single-thread" (matches both single-thread and [single-threaded]resolve with many extensions).
  • Primary metrics: accesses (Ir + Dr + Dw), estimated_cycles = accesses + 5*l1_misses + 100*ll_misses (CodSpeed's CPU-simulation formula).

Final Result vs Baseline

Bench Accesses Before Accesses After Δ Est. Cycles Before Est. Cycles After Δ
single-thread 143,727,872 100,380,124 -30.16% 176,488,172 124,181,874 -29.64%
[single-threaded]resolve with many extensions 325,700,719 216,209,862 -33.63% 380,377,449 ~256M ~-32.7%

Per-Commit Progress

Commit Bench Mode Acc Before Acc After Acc Δ (step) Cyc Before Cyc After Cyc Δ (step) Notes
f76bab1 single-thread callgrind @ macOS+linux/arm64 143,727,872 142,544,544 -0.82% 176,488,172 174,865,264 -0.92% byte specifier dispatch + raw-byte Path::eq on unix + skip format!(".{subpath}") alloc when empty
6e9b24a single-thread callgrind @ macOS+linux/arm64 142,544,544 140,995,923 -1.09% 174,865,264 173,031,518 -1.05% preallocate normalize_with output
601d27a single-thread callgrind @ macOS+linux/arm64 140,995,923 140,702,086 -0.21% 173,031,518 172,695,646 -0.19% path_join_preallocated for hot node_modules / package.json joins
e828403 single-thread callgrind @ macOS+linux/arm64 140,702,086 139,041,327 -1.18% 172,695,646 171,064,577 -0.94% byte-level path_parent_unix in Cache::value (+ equivalence test vs std)
8e6285b single-thread callgrind @ macOS+linux/arm64 139,041,327 138,958,385 -0.06% 171,064,577 170,887,380 -0.10% skip normalize_with alloc in realpath when no symlinks in chain
0c51d29 single-thread callgrind @ macOS+linux/arm64 138,958,385 100,380,124 -27.76% 170,887,380 124,181,874 -27.34% sync std::fs in FileSystemOs

Cumulative on resolver/single-thread: -30.16% accesses, -29.64% estimated_cycles.

What the changes do

  1. Byte-level specifier dispatch in require_without_parse — avoids the std Path::Components walk just to pick the require_* branch on every resolve.
  2. Raw-byte Path::eq for the Cache::value DashSet lookup key on unix — mirrors the existing raw-byte hash (perf(cache): hash CachedPath by raw bytes on unix #226) and sidesteps std Components iteration on every cache lookup.
  3. Skip format!(".{subpath}") allocation at four package_exports_resolve sites when subpath is empty (the common bare-specifier case like @scope/pkg).
  4. Preallocate normalize_with outputPathBuf::with_capacity(self.len + sub.len + 1) then push(self) once, so the loop body's push(component) never has to regrow.
  5. path_join_preallocated helper for the two hottest Path::join sites (cached_node_modules and package_json lookup) — same idea: pre-size so std's push never grows.
  6. Byte-level path_parent_unix for Cache::value recursion on unix — std::path::Path::parent builds a Components iterator for one step back; the byte-level version mirrors std's semantics exactly (verified by a new equivalence test).
  7. No-symlink realpath fast-path — when the parent chain produces no canonical change, cache None so the outer wrapper falls back to self.path directly, skipping a normalize_with allocation.
  8. Sync std::fs in FileSystemOs — replace tokio::fs::* with std::fs::* inside the async fn bodies. The bench shows tokio's spawn_blocking + semaphore + park/unpark adds ~20M Ir per single-thread iteration in pure scheduling overhead, dwarfing the actual syscall work. The trait signature is unchanged. Tradeoff: blocks the runtime thread for the syscall duration (microseconds). Other Rust resolvers (swc, oxc) make the same tradeoff.

What was deliberately skipped

Notes

  • Pre-existing 6 PnP test failures (fixture environment, not code) reproduce on baseline too; 128 → 129 passing (added equivalence test for path_parent_unix).
  • Local arm64 Docker codspeed run -m simulation is blocked by setarch --personality; measurements use direct callgrind with codspeed's measure.rs flags. CI re-measurement under linux/amd64 should reproduce.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 21, 2026

Merging this PR will improve performance by 32.74%

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 6 improved benchmarks
✅ 6 untouched benchmarks

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Memory resolver[multi-thread] 11.2 MB 8.9 MB +26.13%
Simulation resolver[[single-threaded]resolve with many extensions] 131.5 ms 96.9 ms +35.72%
Simulation resolver[multi-thread] 59.5 ms 42.8 ms +39.01%
Simulation resolver[pnp resolve] 265.1 µs 246.3 µs +7.62%
Simulation resolver[single-thread] 52.2 ms 37.7 ms +38.55%
Simulation resolver[resolve from symlinks] 160.4 ms 104 ms +54.2%

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.


Comparing perf/micro-opt-resolver (0c51d29) with main (c8af902)

Open in CodSpeed

…e-thread

Combines three small wins on the single-thread bench:

1. Byte-level specifier dispatch in require_without_parse — avoids the
   std Path::Components walk just to pick the require_* branch on every
   resolve. Behavior is preserved for unix; windows keeps the std parser
   only for drive-prefix detection.

2. Raw byte Path eq for the cache DashSet key on unix — mirrors the
   existing raw-byte hash (#226) and skips std Components iteration on
   every cache lookup.

3. Skip format!(".{subpath}") at four package_exports/resolve sites
   when subpath is empty (the common bare-specifier case like
   '@scope/pkg'). Removes one String alloc per resolve in the common
   path.

Bench (callgrind / CodSpeed CPU simulation formula, macOS+linux/arm64):

resolver/single-thread:
  accesses: 143,727,872 -> 142,544,544 (-0.82%)
  estimated_cycles: 176,488,172 -> 174,865,264 (-0.92%)

resolver/[single-threaded]resolve with many extensions:
  accesses: 325,700,719 -> 323,358,526 (-0.72%)
  estimated_cycles: 380,377,449 -> 377,302,346 (-0.81%)
normalize_with walks the subpath components and pushes each one onto a
PathBuf seeded from self.to_path_buf(). The seeded PathBuf has capacity
== self.len() so every pushed component (separator + bytes) forced at
least one Vec regrow + memcpy of the existing path.

Switch to PathBuf::with_capacity(self.len() + subpath.len() + 1) and
push self once up front. The worst-case capacity covers self, the
separator, and the full subpath, so the loop body's pushes never grow.

Bench (callgrind / CodSpeed CPU simulation formula, macOS+linux/arm64):

resolver/single-thread:
  accesses: 142,544,544 -> 140,995,923 (-1.09% step; -1.90% vs baseline)
  estimated_cycles: 174,865,264 -> 173,031,518 (-1.05% step; -1.96% vs baseline)

resolver/[single-threaded]resolve with many extensions:
  accesses: 323,358,526 -> 323,689,695 (+0.10% step; -0.62% vs baseline)
cache.value(self.path.join("node_modules")) and the package.json
lookup both rely on std::Path::join, which does self.to_path_buf()
(exact-size alloc) followed by .push(sub) — guaranteed to trigger a
Vec regrow + memcpy of the just-allocated bytes on every call.

Introduce path_join_preallocated that PathBuf::with_capacity(base.len +
sub.len + 1) before pushing, so the loop never grows. Use it at the
two hottest join sites (cached_node_modules' walk and package_json's
get_or_try_init).

Bench (callgrind / CodSpeed CPU simulation formula, macOS+linux/arm64):

resolver/single-thread:
  accesses: 140,995,923 -> 140,702,086 (-0.21% step; -2.11% vs baseline)
  estimated_cycles: 173,031,518 -> 172,695,646 (-0.19% step; -2.15% vs baseline)

resolver/[single-threaded]resolve with many extensions:
  accesses: 323,689,695 -> 322,333,719 (-0.42% step; -1.03% vs baseline)
Cache::value's recursion calls Path::parent for every cache miss to
chain up to the root, and std::Path::parent builds a Components
iterator just to walk one step back. The bench shows parse_next_component_back
weighs ~2M Ir on resolver/single-thread alone.

Add path_parent_unix that scans the raw bytes once for the last
non-separator and the previous separator, matching std's exact
semantics (verified with a new test against std::Path::parent across
absolute, relative, trailing-slash, repeated-slash, and root cases).
Cache::value uses it on cfg(unix), keeping the std path for windows.

Bench (callgrind / CodSpeed CPU simulation formula, macOS+linux/arm64):

resolver/single-thread:
  accesses: 140,702,086 -> 139,041,327 (-1.18% step; -3.26% vs baseline)
  estimated_cycles: 172,695,646 -> 171,064,577 (-0.94% step; -3.07% vs baseline)

resolver/[single-threaded]resolve with many extensions:
  accesses: 322,333,719 -> 316,116,887 (-1.93% step; -2.94% vs baseline)
…o symlinks

In CachedPathImpl::realpath, when the parent's canonical path matches
the parent's stored path byte-for-byte, no symlinks were found anywhere
up the chain. Cache None in that case instead of building Some(normalize_with(...)).
The outer wrapper already falls back to self.path on None, so behavior
is identical for the common (no-symlinks) input shape while skipping
one PathBuf allocation per cached path on first realpath.

Bench (callgrind / CodSpeed CPU simulation formula, macOS+linux/arm64):

resolver/single-thread:
  accesses: 139,041,327 -> 138,958,385 (-0.06% step; -3.32% vs baseline)
  estimated_cycles: 171,064,577 -> 170,887,380 (-0.10% step; -3.17% vs baseline)
@stormslowly stormslowly marked this pull request as ready for review May 21, 2026 23:19
Copilot AI review requested due to automatic review settings May 21, 2026 23:19
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets micro-optimizations in the resolver’s single-thread hot path, primarily by reducing allocations and avoiding repeated std::path::Components walks during dispatch, cache probing, and parent traversal.

Changes:

  • Add byte-level specifier classification to reduce dispatch overhead in require_without_parse.
  • Introduce preallocated path join/normalize helpers and use them in cache hot paths.
  • Add a Unix byte-level Path::parent implementation and adjust cache recursion/equality to reduce Components iteration.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/path.rs Adds path_join_preallocated and Unix path_parent_unix, plus preallocation in normalize_with and a std-equivalence test for path_parent_unix.
src/lib.rs Adds SpecifierKind dispatch via byte checks and avoids repeated format! allocations for "." + subpath in package exports resolution.
src/cache.rs Uses Unix byte-parent in Cache::value, swaps hot join call sites to path_join_preallocated, adds a no-symlink realpath fast-path, and uses raw-byte equality for cache lookup keys on Unix.
.gitignore Ignores optimization-artifacts/.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/cache.rs
Comment on lines +269 to 273
let parent_unchanged = parent_path.as_path() == &*parent.path;
if parent_unchanged {
return Ok(None);
}
return Ok(Some(
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8e6285b0c7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/path.rs
Comment on lines +44 to +55
let last_non_slash = bytes.iter().rposition(|&b| b != b'/')?;
let trimmed = &bytes[..=last_non_slash];
let parent_end = trimmed
.iter()
.rposition(|&b| b == b'/')
.map_or(0, |slash_pos| {
bytes[..slash_pos]
.iter()
.rposition(|&b| b != b'/')
.map_or_else(|| usize::from(bytes.first() == Some(&b'/')), |p| p + 1)
});
Some(Path::new(OsStr::from_bytes(&bytes[..parent_end])))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve Path::parent semantics for . components

path_parent_unix does not match std::path::Path::parent when the path contains . segments (for example, /a/. and a/.), because it only trims slashes and never normalizes . components. Since Cache::value now uses this helper to build parent links, non-normalized inputs can traverse the wrong ancestor chain and keep non-canonical /. segments in subsequent realpath results. Please handle . components the same way as std (or fall back to std for those cases) before using this as a drop-in replacement.

Useful? React with 👍 / 👎.

FileSystemOs's async trait methods previously called tokio::fs::*, which
internally spawn_blocking + acquire a semaphore + park/unpark per syscall.
The bench shows this scheduling layer costs ~20M Ir per single-thread
iteration on tokio runtime internals alone — dwarfing the actual stat/read
work.

Switch to sync std::fs::{metadata, symlink_metadata, read, read_to_string}
inside the async fn body. The trait signature is unchanged, callers
still await normally, and canonicalize was already sync (dunce::canonicalize).

Tradeoff: each fs call now blocks the runtime thread for the duration of
the syscall (microseconds). Multi-thread tokio users will lose some
concurrency overlap relative to the spawn_blocking model, but the
wins on per-call overhead are large enough that swc/oxc/ripgrep all
make the same tradeoff. wasm target keeps its existing std::fs path.

Bench (callgrind / CodSpeed CPU simulation formula, macOS+linux/arm64):

resolver/single-thread:
  accesses: 138,958,385 -> 100,380,124 (-27.76% step; -30.16% vs baseline)
  estimated_cycles: 170,887,380 -> 124,181,874 (-27.34% step; -29.64% vs baseline)

resolver/[single-threaded]resolve with many extensions:
  accesses: ~316M -> 216,209,862 (~-31% step; -33.63% vs baseline)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants