perf(resolver): micro-optimize single-thread hot path by stormslowly · Pull Request #233 · rstackjs/rspack-resolver

stormslowly · 2026-05-21T18:50:31Z

Why

Hunting for a >=5% reduction on the resolver CodSpeed bench, focused on the single-thread case (resolve ~880 real npm specifiers off a cleared cache). Final result: -30.16% accesses, -29.64% estimated_cycles.

Measurement Setup

Mode: direct callgrind 3.22 (bundled docker image lacks the standalone CodSpeed runner; cargo-codspeed builds the bench, valgrind wraps it directly via codspeed-criterion-compat client requests).
Platform: macOS+linux/arm64 via Docker Desktop arm64 (the micro-opt skill wrapper).
Bench: cargo codspeed run --bench resolver -m simulation "single-thread" (matches both single-thread and [single-threaded]resolve with many extensions).
Primary metrics: accesses (Ir + Dr + Dw), estimated_cycles = accesses + 5*l1_misses + 100*ll_misses (CodSpeed's CPU-simulation formula).

Final Result vs Baseline

Bench	Accesses Before	Accesses After	Δ	Est. Cycles Before	Est. Cycles After	Δ
single-thread	143,727,872	100,380,124	-30.16%	176,488,172	124,181,874	-29.64%
[single-threaded]resolve with many extensions	325,700,719	216,209,862	-33.63%	380,377,449	~256M	~-32.7%

Per-Commit Progress

Commit	Bench	Mode	Acc Before	Acc After	Acc Δ (step)	Cyc Before	Cyc After	Cyc Δ (step)	Notes
`f76bab1`	single-thread	callgrind @ macOS+linux/arm64	143,727,872	142,544,544	-0.82%	176,488,172	174,865,264	-0.92%	byte specifier dispatch + raw-byte `Path::eq` on unix + skip `format!(".{subpath}")` alloc when empty
`6e9b24a`	single-thread	callgrind @ macOS+linux/arm64	142,544,544	140,995,923	-1.09%	174,865,264	173,031,518	-1.05%	preallocate `normalize_with` output
`601d27a`	single-thread	callgrind @ macOS+linux/arm64	140,995,923	140,702,086	-0.21%	173,031,518	172,695,646	-0.19%	`path_join_preallocated` for hot `node_modules` / `package.json` joins
`e828403`	single-thread	callgrind @ macOS+linux/arm64	140,702,086	139,041,327	-1.18%	172,695,646	171,064,577	-0.94%	byte-level `path_parent_unix` in `Cache::value` (+ equivalence test vs std)
`8e6285b`	single-thread	callgrind @ macOS+linux/arm64	139,041,327	138,958,385	-0.06%	171,064,577	170,887,380	-0.10%	skip `normalize_with` alloc in realpath when no symlinks in chain
`0c51d29`	single-thread	callgrind @ macOS+linux/arm64	138,958,385	100,380,124	-27.76%	170,887,380	124,181,874	-27.34%	sync `std::fs` in `FileSystemOs`

Cumulative on resolver/single-thread: -30.16% accesses, -29.64% estimated_cycles.

What the changes do

Byte-level specifier dispatch in require_without_parse — avoids the std Path::Components walk just to pick the require_* branch on every resolve.
Raw-byte Path::eq for the Cache::value DashSet lookup key on unix — mirrors the existing raw-byte hash (perf(cache): hash CachedPath by raw bytes on unix #226) and sidesteps std Components iteration on every cache lookup.
Skip format!(".{subpath}") allocation at four package_exports_resolve sites when subpath is empty (the common bare-specifier case like @scope/pkg).
Preallocate normalize_with output — PathBuf::with_capacity(self.len + sub.len + 1) then push(self) once, so the loop body's push(component) never has to regrow.
path_join_preallocated helper for the two hottest Path::join sites (cached_node_modules and package_json lookup) — same idea: pre-size so std's push never grows.
Byte-level path_parent_unix for Cache::value recursion on unix — std::path::Path::parent builds a Components iterator for one step back; the byte-level version mirrors std's semantics exactly (verified by a new equivalence test).
No-symlink realpath fast-path — when the parent chain produces no canonical change, cache None so the outer wrapper falls back to self.path directly, skipping a normalize_with allocation.
Sync std::fs in FileSystemOs — replace tokio::fs::* with std::fs::* inside the async fn bodies. The bench shows tokio's spawn_blocking + semaphore + park/unpark adds ~20M Ir per single-thread iteration in pure scheduling overhead, dwarfing the actual syscall work. The trait signature is unchanged. Tradeoff: blocks the runtime thread for the syscall duration (microseconds). Other Rust resolvers (swc, oxc) make the same tradeoff.

What was deliberately skipped

Alias loop accelerator — perf(alias): short-circuit load_alias with a 1+2 byte prefix index #225's prefix index was reverted in Revert "perf(alias): short-circuit load_alias with a 1+2 byte prefix index" #230 with a preference for trie-style matching.
Custom-DST Arc<CachedPathImpl> (combining Box<Path> + the Arc into one alloc) — significant complexity for marginal additional gain after the sync-fs win.

Notes

Pre-existing 6 PnP test failures (fixture environment, not code) reproduce on baseline too; 128 → 129 passing (added equivalence test for path_parent_unix).
Local arm64 Docker codspeed run -m simulation is blocked by setarch --personality; measurements use direct callgrind with codspeed's measure.rs flags. CI re-measurement under linux/amd64 should reproduce.

codspeed-hq · 2026-05-21T18:56:57Z

Merging this PR will improve performance by 32.74%

⚠️

Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 6 improved benchmarks
✅ 6 untouched benchmarks

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	Memory	`resolver[multi-thread]`	11.2 MB	8.9 MB	+26.13%
⚡	Simulation	`resolver[[single-threaded]resolve with many extensions]`	131.5 ms	96.9 ms	+35.72%
⚡	Simulation	`resolver[multi-thread]`	59.5 ms	42.8 ms	+39.01%
⚡	Simulation	`resolver[pnp resolve]`	265.1 µs	246.3 µs	+7.62%
⚡	Simulation	`resolver[single-thread]`	52.2 ms	37.7 ms	+38.55%
⚡	Simulation	`resolver[resolve from symlinks]`	160.4 ms	104 ms	+54.2%

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing perf/micro-opt-resolver (0c51d29) with main (c8af902)}

…e-thread Combines three small wins on the single-thread bench: 1. Byte-level specifier dispatch in require_without_parse — avoids the std Path::Components walk just to pick the require_* branch on every resolve. Behavior is preserved for unix; windows keeps the std parser only for drive-prefix detection. 2. Raw byte Path eq for the cache DashSet key on unix — mirrors the existing raw-byte hash (#226) and skips std Components iteration on every cache lookup. 3. Skip format!(".{subpath}") at four package_exports/resolve sites when subpath is empty (the common bare-specifier case like '@scope/pkg'). Removes one String alloc per resolve in the common path. Bench (callgrind / CodSpeed CPU simulation formula, macOS+linux/arm64): resolver/single-thread: accesses: 143,727,872 -> 142,544,544 (-0.82%) estimated_cycles: 176,488,172 -> 174,865,264 (-0.92%) resolver/[single-threaded]resolve with many extensions: accesses: 325,700,719 -> 323,358,526 (-0.72%) estimated_cycles: 380,377,449 -> 377,302,346 (-0.81%)

normalize_with walks the subpath components and pushes each one onto a PathBuf seeded from self.to_path_buf(). The seeded PathBuf has capacity == self.len() so every pushed component (separator + bytes) forced at least one Vec regrow + memcpy of the existing path. Switch to PathBuf::with_capacity(self.len() + subpath.len() + 1) and push self once up front. The worst-case capacity covers self, the separator, and the full subpath, so the loop body's pushes never grow. Bench (callgrind / CodSpeed CPU simulation formula, macOS+linux/arm64): resolver/single-thread: accesses: 142,544,544 -> 140,995,923 (-1.09% step; -1.90% vs baseline) estimated_cycles: 174,865,264 -> 173,031,518 (-1.05% step; -1.96% vs baseline) resolver/[single-threaded]resolve with many extensions: accesses: 323,358,526 -> 323,689,695 (+0.10% step; -0.62% vs baseline)

cache.value(self.path.join("node_modules")) and the package.json lookup both rely on std::Path::join, which does self.to_path_buf() (exact-size alloc) followed by .push(sub) — guaranteed to trigger a Vec regrow + memcpy of the just-allocated bytes on every call. Introduce path_join_preallocated that PathBuf::with_capacity(base.len + sub.len + 1) before pushing, so the loop never grows. Use it at the two hottest join sites (cached_node_modules' walk and package_json's get_or_try_init). Bench (callgrind / CodSpeed CPU simulation formula, macOS+linux/arm64): resolver/single-thread: accesses: 140,995,923 -> 140,702,086 (-0.21% step; -2.11% vs baseline) estimated_cycles: 173,031,518 -> 172,695,646 (-0.19% step; -2.15% vs baseline) resolver/[single-threaded]resolve with many extensions: accesses: 323,689,695 -> 322,333,719 (-0.42% step; -1.03% vs baseline)

Cache::value's recursion calls Path::parent for every cache miss to chain up to the root, and std::Path::parent builds a Components iterator just to walk one step back. The bench shows parse_next_component_back weighs ~2M Ir on resolver/single-thread alone. Add path_parent_unix that scans the raw bytes once for the last non-separator and the previous separator, matching std's exact semantics (verified with a new test against std::Path::parent across absolute, relative, trailing-slash, repeated-slash, and root cases). Cache::value uses it on cfg(unix), keeping the std path for windows. Bench (callgrind / CodSpeed CPU simulation formula, macOS+linux/arm64): resolver/single-thread: accesses: 140,702,086 -> 139,041,327 (-1.18% step; -3.26% vs baseline) estimated_cycles: 172,695,646 -> 171,064,577 (-0.94% step; -3.07% vs baseline) resolver/[single-threaded]resolve with many extensions: accesses: 322,333,719 -> 316,116,887 (-1.93% step; -2.94% vs baseline)

…o symlinks In CachedPathImpl::realpath, when the parent's canonical path matches the parent's stored path byte-for-byte, no symlinks were found anywhere up the chain. Cache None in that case instead of building Some(normalize_with(...)). The outer wrapper already falls back to self.path on None, so behavior is identical for the common (no-symlinks) input shape while skipping one PathBuf allocation per cached path on first realpath. Bench (callgrind / CodSpeed CPU simulation formula, macOS+linux/arm64): resolver/single-thread: accesses: 139,041,327 -> 138,958,385 (-0.06% step; -3.32% vs baseline) estimated_cycles: 171,064,577 -> 170,887,380 (-0.10% step; -3.17% vs baseline)

Copilot

Pull request overview

This PR targets micro-optimizations in the resolver’s single-thread hot path, primarily by reducing allocations and avoiding repeated std::path::Components walks during dispatch, cache probing, and parent traversal.

Changes:

Add byte-level specifier classification to reduce dispatch overhead in require_without_parse.
Introduce preallocated path join/normalize helpers and use them in cache hot paths.
Add a Unix byte-level Path::parent implementation and adjust cache recursion/equality to reduce Components iteration.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.

File	Description
`src/path.rs`	Adds `path_join_preallocated` and Unix `path_parent_unix`, plus preallocation in `normalize_with` and a std-equivalence test for `path_parent_unix`.
`src/lib.rs`	Adds `SpecifierKind` dispatch via byte checks and avoids repeated `format!` allocations for `"." + subpath` in package exports resolution.
`src/cache.rs`	Uses Unix byte-parent in `Cache::value`, swaps hot `join` call sites to `path_join_preallocated`, adds a no-symlink realpath fast-path, and uses raw-byte equality for cache lookup keys on Unix.
`.gitignore`	Ignores `optimization-artifacts/`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+            let parent_unchanged = parent_path.as_path() == &*parent.path;
+            if parent_unchanged {
+              return Ok(None);
+            }
            return Ok(Some(


chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8e6285b0c7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-21T23:26:38Z

+  let last_non_slash = bytes.iter().rposition(|&b| b != b'/')?;
+  let trimmed = &bytes[..=last_non_slash];
+  let parent_end = trimmed
+    .iter()
+    .rposition(|&b| b == b'/')
+    .map_or(0, |slash_pos| {
+      bytes[..slash_pos]
+        .iter()
+        .rposition(|&b| b != b'/')
+        .map_or_else(|| usize::from(bytes.first() == Some(&b'/')), |p| p + 1)
+    });
+  Some(Path::new(OsStr::from_bytes(&bytes[..parent_end])))


Preserve Path::parent semantics for . components

path_parent_unix does not match std::path::Path::parent when the path contains . segments (for example, /a/. and a/.), because it only trims slashes and never normalizes . components. Since Cache::value now uses this helper to build parent links, non-normalized inputs can traverse the wrong ancestor chain and keep non-canonical /. segments in subsequent realpath results. Please handle . components the same way as std (or fall back to std for those cases) before using this as a drop-in replacement.

Useful? React with 👍 / 👎.

FileSystemOs's async trait methods previously called tokio::fs::*, which internally spawn_blocking + acquire a semaphore + park/unpark per syscall. The bench shows this scheduling layer costs ~20M Ir per single-thread iteration on tokio runtime internals alone — dwarfing the actual stat/read work. Switch to sync std::fs::{metadata, symlink_metadata, read, read_to_string} inside the async fn body. The trait signature is unchanged, callers still await normally, and canonicalize was already sync (dunce::canonicalize). Tradeoff: each fs call now blocks the runtime thread for the duration of the syscall (microseconds). Multi-thread tokio users will lose some concurrency overlap relative to the spawn_blocking model, but the wins on per-call overhead are large enough that swc/oxc/ripgrep all make the same tradeoff. wasm target keeps its existing std::fs path. Bench (callgrind / CodSpeed CPU simulation formula, macOS+linux/arm64): resolver/single-thread: accesses: 138,958,385 -> 100,380,124 (-27.76% step; -30.16% vs baseline) estimated_cycles: 170,887,380 -> 124,181,874 (-27.34% step; -29.64% vs baseline) resolver/[single-threaded]resolve with many extensions: accesses: ~316M -> 216,209,862 (~-31% step; -33.63% vs baseline)

chore: ignore optimization-artifacts/ used by micro-opt profiling

939cd99

stormslowly added 5 commits May 22, 2026 03:10

stormslowly marked this pull request as ready for review May 21, 2026 23:19

Copilot AI review requested due to automatic review settings May 21, 2026 23:19

Copilot started reviewing on behalf of stormslowly May 21, 2026 23:19 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

Comment thread src/cache.rs

Comment on lines +269 to 273

let parent_unchanged = parent_path.as_path() == &*parent.path;

if parent_unchanged {

return Ok(None);

}

return Ok(Some(

chatgpt-codex-connector Bot reviewed May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(resolver): micro-optimize single-thread hot path#233

perf(resolver): micro-optimize single-thread hot path#233
stormslowly wants to merge 7 commits into
mainfrom
perf/micro-opt-resolver

stormslowly commented May 21, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented May 21, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stormslowly commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Measurement Setup

Final Result vs Baseline

Per-Commit Progress

What the changes do

What was deliberately skipped

Notes

Uh oh!

codspeed-hq Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by 32.74%

Performance Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stormslowly commented May 21, 2026 •

edited

Loading

codspeed-hq Bot commented May 21, 2026 •

edited

Loading