refactor(walltime): hook-based CPU isolation for the walltime runner#427
refactor(walltime): hook-based CPU isolation for the walltime runner#427GuillaumeLagrange wants to merge 1 commit into
Conversation
Greptile SummaryThis PR moves walltime CPU isolation into a single hook-aware lifecycle. The main changes are:
Confidence Score: 4/5This is close, but these isolation paths should be fixed before merging.
src/executor/wall_time/isolation.rs Important Files Changed
Reviews (5): Last reviewed commit: "refactor(walltime): hook-based CPU isola..." | Re-trigger Greptile |
Merging this PR will not alter performance
|
a15e7ed to
40f9325
Compare
| IsolationMode::Cgroup { scope_dir } => { | ||
| // Move the runner (and the profiler it later spawns) onto the | ||
| // system cores before wrapping the benchmark for the bench cores. | ||
| place_runner_in_support(scope_dir)?; |
There was a problem hiding this comment.
Support leaf race In cgroup mode, this moves the runner into
<scope>/support/cgroup.procs before the code waits for any leaf to exist. If the privileged setup creates support and bench asynchronously, this write can fail with a missing support/cgroup.procs path even though the cgroup layout would be ready shortly after. The run then exits before the benchmark starts, so the support leaf needs the same readiness handling as the bench leaf.
40f9325 to
d2f2793
Compare
| Ok(o) if o.status.success() => {} | ||
| Ok(o) => debug!( | ||
| "{hook} exited {}: {}", | ||
| o.status, | ||
| String::from_utf8_lossy(&o.stderr) | ||
| ), | ||
| Err(e) => debug!("failed to run {hook}: {e}"), | ||
| } |
There was a problem hiding this comment.
This branch logs a failing codspeed-pre-bench hook and still selects hook isolation. When the pre hook fails because the support cgroup is not ready, permissions are wrong, or the hook install is partial, the runner continues with codspeed-wrap-bench and runs the profiler without sudo. The walltime run can then complete with the runner or profiler outside the intended support leaf, which produces incorrectly isolated benchmark results instead of a clear failure or fallback.
Replace the runner's built-in CPU-isolation mechanisms with a single,
machine-driven one. The runner previously hard-coded `systemd-run --scope` and a
`CODSPEED_ISOLATION=CGROUP:<dir>` mode with per-spawn cgroup-dir plumbing, split
across a `HookScriptsGuard` (which ran the pre/post-bench hooks) and
`isolation.rs` (which did the wrapping).
Now a single `Isolation` type owns the whole lifecycle: `resolve()` runs the
pre-bench hook, `wrap_bench()` pins the benchmark leaf, and `Drop` runs
post-bench. Cpuset logic lives on the machine behind three hooks
(`codspeed-{pre,wrap,post}-bench`); the runner only invokes them and is otherwise
oblivious to how cores are attributed. Discovery is by hook presence:
- an executable `codspeed-wrap-bench` selects the hook path — unprivileged, and
the benchmark stays a descendant of the profiler so it records without sudo;
- its absence falls back to `systemd-run --scope --slice=codspeed.slice`, so
hosts without the hook keep working unchanged.
The pre-bench hook is invoked with the runner's PID so the machine places the
runner (and the profiler it spawns) onto the system cores; the runner makes no
cgroup writes of its own. The profiler's `wrap_command` flag is renamed
`isolate` -> `requires_sudo`, now true only for the systemd fallback.
Refs COD-3012
Co-Authored-By: Claude <noreply@anthropic.com>
757ec17 to
6151305
Compare
| if std::env::var(ISOLATION_ENV).is_ok_and(|v| v == "false") { | ||
| return Isolation::None; | ||
| } |
There was a problem hiding this comment.
True Override Still Ignored
CODSPEED_ISOLATION=false disables isolation again, but CODSPEED_ISOLATION=true still no longer forces it. On a Linux host without codspeed-wrap-bench and without passwordless sudo, this path falls through to Isolation::None, so a job that explicitly requested isolation runs unpinned and produces noisy walltime results. The positive override needs to be handled separately from the auto-detected fallback.
| run_pre_bench(); | ||
| return Isolation::Hooks; | ||
| } |
There was a problem hiding this comment.
Pre Hook Failure Continues This branch returns hook isolation even if
codspeed-pre-bench fails. run_pre_bench() has no result, and run_hook() only logs failures at debug level, so a machine with an executable wrap hook but a missing or failing pre hook still runs the benchmark through hook mode. In that case the runner and profiler may never be moved to the support cores, and the run can finish with contaminated walltime results instead of failing or falling back.
Replace the runner's built-in CPU-isolation mechanisms with a single, machine-driven one.
The runner previously hard-coded
systemd-run --scopeand, in a first iteration, aCODSPEED_ISOLATION=CGROUP:<dir>mode with per-spawn cgroup-dir plumbing — with the work split across aHookScriptsGuard(which ran the pre/post-bench hooks) andisolation.rs(which did the wrapping).Now a single
Isolationtype owns the whole lifecycle:resolve()runs the pre-bench hook,wrap_bench()pins the benchmark leaf, andDropruns post-bench. Cpuset logic lives on the machine behind three hooks (codspeed-{pre,wrap,post}-bench); the runner only invokes them and is otherwise oblivious to how cores are attributed.Discovery is by hook presence:
codspeed-wrap-benchselects the hook path — unprivileged, and the benchmark stays a descendant of the profiler so it records without sudo;systemd-run --scope --slice=codspeed.slice, so hosts without the hook keep working unchanged.requires_sudois now true only for that fallback, so the profiler wrapping is effectively unchanged. The pre-bench hook is invoked with the runner's PID so the machine can place the runner (and the profiler it spawns) onto the system cores — the runner makes no cgroup writes of its own.Refs COD-3012