fix(distributed): gate per-worker SwiftShader probe to worker 0 only#956
Conversation
After #916 moved `assertSwiftShader` from `renderChunk()`'s eager probe session into `executeWorkerTask`, every parallel worker began running its own `chrome://gpu` / canvas-WebGL probe. At `chunkWorkerCount=6` (texture launch at chunks=3) that's 6 concurrent CDP page-loads per chunk × 3 chunks = 18 simultaneous probes. Bench data on dev (12 producer pods × 22 vCPU) showed c=3 worst-case wall-clock at 67.3s, 24.7s above c=6 worst (42.6s) — pod_total inflates 100s → 147s uniformly across all three chunks per slow iter, the signature of cluster-level CDP contention rather than within-pod contention. Workers within a chunk share the same Chrome binary, flags, and OS/driver state on a single pod, so worker 0's success is representative for the rest. Gate the probe via `shouldVerifyWorkerGpu(workerId, config)` so only worker 0 navigates to the probe page; workers 1..N-1 skip it. The fail-fast contract still holds at the chunk level (worker 0 still aborts the chunk if SwiftShader didn't load) — just without the concurrent CDP traffic. Expected wall-clock impact: c=3 worst drops from ~67s to in line with c=6 worst (~42-44s). c=6 (3 workers/pod) and c=8 (2 workers/pod) should see smaller wins; c=12 (1 worker/pod, sequential branch) is unaffected. Closes #955.
miguel-heygen
left a comment
There was a problem hiding this comment.
APPROVE — clean fix for a real performance regression.
Verified:
shouldVerifyWorkerGpu(workerId, config)correctly gates SwiftShader probe to worker 0 only in software GPU modeworkerIdis assigned sequentially from 0 indistributeFrames, so worker 0 is always the first in every chunk- Sequential branch (
chunkWorkerCount === 1) still probes unconditionally — correct Partial<EngineConfig> | undefinedparameter handling is sound (optional chaining)- 4 test cases cover the essential matrix (worker 0/non-zero × software/hardware/undefined)
- Workers 1..N-1 that start before worker 0's probe completes will have their frames discarded on failure — same behavior as before, no regression
The bench data showing ~24s contention at c=3 from 18 simultaneous probe navigations makes the fix well-motivated.
jrusso1020
left a comment
There was a problem hiding this comment.
Reviewed the diff and the full files (parallelCoordinator.ts, parallelCoordinator.test.ts, renderChunk.ts) plus the assertSwiftShader contract and the surrounding chunk/worker plumbing. The fix does what the PR body claims; gate logic is correct and the unit tests cover the helper well. One stale-comment nit and one minor behavioral observation below.
Correctness
shouldVerifyWorkerGpu(workerId, config)returnstrueiffbrowserGpuMode === "software" && workerId === 0. Matches the PR description.- Fail-fast at chunk granularity is preserved: when worker 0 hits a non-SwiftShader backend it throws
SwiftShaderAssertionErrorbeforeinitializeSession— no composition frames are captured on worker 0.executeParallelCapturecollects that as a result witherror, then throws[Parallel] Capture failed: …, which the chunk treats as a hard failure (and the worker adapter classifies as non-retryable viaBROWSER_GPU_NOT_SOFTWARE). The byte-identical-retry contract therefore still holds: workers 1..N-1 may have captured frames, but the chunk is discarded. - Workers within a chunk share the same Chrome binary, flags, OS image, and SwiftShader libs (single pod, single launcher), so worker 0's verification is genuinely representative. Per-worker drift would require a SwiftShader downgrade mid-chunk, which the current launch model can't produce.
- Sequential branch in
renderChunk.ts(around line 492,chunkWorkerCount === 1) is untouched — it still callsassertSwiftShaderagainst its own probe session beforeinitializeSession. Good.
Minor: workers 1..N-1 don't short-circuit on worker-0 failure
When worker 0 throws the assertion, executeParallelCapture uses Promise.all over executeWorkerTask calls that catch internally and return {error}. Workers 1..N-1 keep running to completion before the chunk-level error surfaces — wasted capture work on the same pod when SwiftShader is downgraded. Not a correctness issue (output is discarded), and there's no analogous pre-existing case in the c=1 branch (no siblings), so this is fine to defer. Flagging only because the new comment block on lines 220-234 emphasizes "worker 0 aborts before frames are captured" without noting that the other workers don't.
If you want to short-circuit later, the lightweight path is wiring an AbortController through signal and aborting from worker 0's failure inside executeWorkerTask — but that's a follow-up, not a blocker.
Nit: stale inline comment in renderChunk.ts
The upstream comment at lines 469-477 was updated to reflect "worker 0 only", but the inline comment block at lines 509-511 was missed:
hyperframes/packages/producer/src/services/distributed/renderChunk.ts
Lines 509 to 511 in da46de6
// chunkWorkerCount > 1: skip the probe entirely. Each parallel worker
// creates its own session and runs `assertSwiftShader` before its
// first frame.
After this PR that's no longer true — only worker 0 runs assertSwiftShader. Suggest tweaking the second sentence to something like "…each parallel worker creates its own session; worker 0 runs assertSwiftShader before its first frame (workers 1..N-1 piggy-back on that verification — see parallelCoordinator.ts:shouldVerifyWorkerGpu)."
Tests
The 4 new shouldVerifyWorkerGpu cases cover the matrix (worker-0 + software → true; non-zero workers + software → false; hardware/empty config → false; undefined config → false). That's the right coverage for a pure helper; the wired-up behavior in executeWorkerTask is covered by the Docker golden-baseline shards as noted.
Verdict
COMMENT — the gate logic is right and ships safely. Only the stale comment in renderChunk.ts lines 509-511 is worth fixing; happy for it to land as a follow-up or as a single-line tweak before merge.
vanceingalls
left a comment
There was a problem hiding this comment.
Clean, narrowly-scoped perf fix that lines up with the bench data in the PR body. The design — one SwiftShader probe per chunk gated to worker 0 — is the right call: workers within a chunk share the same Chrome binary, flags, and OS/driver state, so worker 0's verdict is representative. Tests pin the predicate's behavior across the four interesting axes (worker id × config presence × GPU mode), including the undefined-config edge case.
Calibrated strengths
- Pure-predicate factoring (
shouldVerifyWorkerGpuatpackages/engine/src/services/parallelCoordinator.ts:189) keepsexecuteWorkerTaskreadable and makes the rule unit-testable without spinning a Puppeteer page. The 4-case test (parallelCoordinator.test.ts:79-101) covers the right corners. - The renderChunk comment at
packages/producer/src/services/distributed/renderChunk.ts:469-477is updated in lockstep with the gate semantics so the eager-pre-probe rationale stays accurate. Easy thing to forget; nice catch. distributeFramesguarantees aworkerId === 0for every non-empty chunk (sequentialfor (let i = 0; ...)with thestartFrame >= rangeStart + totalFramesshort-circuit), so the gate is structurally safe — no risk of a chunk where worker 0 is missing.
Findings
- nit — PR-body wording overstates fail-fast: "worker 0 aborts the chunk before any frames are captured."
executeWorkerTaskcatches theSwiftShaderAssertionErrorand returns it onWorkerResult.error(parallelCoordinator.ts:274-284), andexecuteParallelCapturewaits onPromise.allbefore throwing (parallelCoordinator.ts:264-274). On a real downgrade, workers 1..N-1 will keep running their capture loop; incaptureStreamingStage's parallel branch (captureStreamingStage.ts:182), their frames stream straight into the encoder before worker 0's failure surfaces. The byte-identical-retry contract still holds — the chunk-level throw discards the output and the retry overwrites it — but "aborts before any frames are captured" is imprecise. Worth a one-liner in the gate's docblock noting that the per-worker abort is task-scoped, not chunk-scoped, and the retry contract is what carries the safety guarantee. No code change needed; clarification only. - nit —
config?: Partial<EngineConfig>is broader than the predicate needs.config?: Partial<Pick<EngineConfig, \"browserGpuMode\">>would tighten the contract and let future refactors of unrelatedEngineConfigfields skip recompiling the test. The current callers all pass full configs, so this is purely a typing nit. - nit — no integration test pins the wiring (i.e.,
executeWorkerTaskactually consultsshouldVerifyWorkerGpuinstead of an inline check). The predicate test alone wouldn't catch a regression where the if-statement atparallelCoordinator.ts:232reverts toconfig?.browserGpuMode === \"software\". Acceptable given the one-line call site, but worth a follow-up if this surface grows.
Verdict: APPROVE
Reasoning: Correct design, evidence-backed perf justification, predicate is small and tested. The fail-fast wording in the PR body is a nit on the description, not on the code; the byte-identical-retry contract is preserved by executeParallelCapture throwing on any worker error before encode succeeds.
Review by Vai
Summary
Closes #955.
After #916 moved
assertSwiftShaderfromrenderChunk()'s eager probe session intoexecuteWorkerTask, every parallel worker began running its ownchrome://gpu/ canvas-WebGL probe. AtchunkWorkerCount=6(texture-launch at chunks=3) that's 6 concurrent CDP page-loads per chunk × 3 chunks = 18 simultaneous probes hitting the dev fleet at once.Bench data on dev (12 producer pods × 22 vCPU,
--chunks 3,6,8,12 --iterations 5, sidecar v0.6.16 = post-#916 / pre-this PR):c=3 is 24.7s above c=6 worst (well above the 10s "real regression" threshold). The slow iters at c=3 show pod_total inflating from ~100s to ~147s uniformly across all three chunks per iter — that's the signature of cluster-level CDP contention rather than within-pod contention.
Fix
Workers within a chunk share the same Chrome binary, flags, and OS/driver state on a single pod, so worker 0's success is representative. A small helper
shouldVerifyWorkerGpu(workerId, config)returnstrueiffbrowserGpuMode === "software" && workerId === 0;executeWorkerTaskuses it instead of the inline check. Workers 1..N-1 skip the probe entirely.The fail-fast contract still holds at the chunk level: if SwiftShader didn't load on a pod, worker 0 aborts the chunk before any frames are captured. Workers 1..N-1 piggy-back on that guarantee.
Expected impact
Will validate on dev after release + DEV_DEPLOY and update #955 with the post-fix bench numbers.
Test plan
shouldVerifyWorkerGpu(worker-0/software → true, non-zero workers / non-software config / undefined config → false).bun run test src/services/parallelCoordinator.test.tspasses 11/11.bun run --cwd packages/engine buildandbun run --cwd packages/producer buildclean.bunx oxlint+bunx oxfmt --checkclean on the three touched files.🤖 Generated with Claude Code