Skip to content

fix(distributed): gate per-worker SwiftShader probe to worker 0 only#956

Merged
jrusso1020 merged 1 commit into
mainfrom
05-19-fix_distributed_gate_per_worker_swiftshader_probe_to_worker_0
May 19, 2026
Merged

fix(distributed): gate per-worker SwiftShader probe to worker 0 only#956
jrusso1020 merged 1 commit into
mainfrom
05-19-fix_distributed_gate_per_worker_swiftshader_probe_to_worker_0

Conversation

@jrusso1020
Copy link
Copy Markdown
Collaborator

Summary

Closes #955.

After #916 moved assertSwiftShader from renderChunk()'s eager probe session into executeWorkerTask, every parallel worker began running its own chrome://gpu / canvas-WebGL probe. At chunkWorkerCount=6 (texture-launch at chunks=3) that's 6 concurrent CDP page-loads per chunk × 3 chunks = 18 simultaneous probes hitting the dev fleet at once.

Bench data on dev (12 producer pods × 22 vCPU, --chunks 3,6,8,12 --iterations 5, sidecar v0.6.16 = post-#916 / pre-this PR):

chunks chunkWorkerCount worst total median total std
3 6 67.3s 52.9s 7.5s
6 3 42.6s 41.3s 0.6s
8 2 41.9s 40.0s 1.1s
12 1 38.9s 38.6s 0.2s

c=3 is 24.7s above c=6 worst (well above the 10s "real regression" threshold). The slow iters at c=3 show pod_total inflating from ~100s to ~147s uniformly across all three chunks per iter — that's the signature of cluster-level CDP contention rather than within-pod contention.

Fix

Workers within a chunk share the same Chrome binary, flags, and OS/driver state on a single pod, so worker 0's success is representative. A small helper shouldVerifyWorkerGpu(workerId, config) returns true iff browserGpuMode === "software" && workerId === 0; executeWorkerTask uses it instead of the inline check. Workers 1..N-1 skip the probe entirely.

The fail-fast contract still holds at the chunk level: if SwiftShader didn't load on a pod, worker 0 aborts the chunk before any frames are captured. Workers 1..N-1 piggy-back on that guarantee.

Expected impact

  • c=3 worst-case should drop from ~67s into the c=6 cluster (~42-44s).
  • c=6 / c=8 should see smaller wins.
  • c=12 is unaffected (sequential branch, no parallel workers).

Will validate on dev after release + DEV_DEPLOY and update #955 with the post-fix bench numbers.

Test plan

  • 4 new unit cases on shouldVerifyWorkerGpu (worker-0/software → true, non-zero workers / non-software config / undefined config → false). bun run test src/services/parallelCoordinator.test.ts passes 11/11.
  • bun run --cwd packages/engine build and bun run --cwd packages/producer build clean.
  • bunx oxlint + bunx oxfmt --check clean on the three touched files.
  • Docker golden baselines unchanged (the gate doesn't alter the captured frames; it just skips a probe page-load on workers 1..N-1).

🤖 Generated with Claude Code

After #916 moved `assertSwiftShader` from `renderChunk()`'s eager probe
session into `executeWorkerTask`, every parallel worker began running its
own `chrome://gpu` / canvas-WebGL probe. At `chunkWorkerCount=6` (texture
launch at chunks=3) that's 6 concurrent CDP page-loads per chunk × 3
chunks = 18 simultaneous probes. Bench data on dev (12 producer pods × 22
vCPU) showed c=3 worst-case wall-clock at 67.3s, 24.7s above c=6 worst
(42.6s) — pod_total inflates 100s → 147s uniformly across all three
chunks per slow iter, the signature of cluster-level CDP contention
rather than within-pod contention.

Workers within a chunk share the same Chrome binary, flags, and OS/driver
state on a single pod, so worker 0's success is representative for the
rest. Gate the probe via `shouldVerifyWorkerGpu(workerId, config)` so
only worker 0 navigates to the probe page; workers 1..N-1 skip it. The
fail-fast contract still holds at the chunk level (worker 0 still aborts
the chunk if SwiftShader didn't load) — just without the concurrent CDP
traffic.

Expected wall-clock impact: c=3 worst drops from ~67s to in line with
c=6 worst (~42-44s). c=6 (3 workers/pod) and c=8 (2 workers/pod) should
see smaller wins; c=12 (1 worker/pod, sequential branch) is unaffected.

Closes #955.
Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APPROVE — clean fix for a real performance regression.

Verified:

  • shouldVerifyWorkerGpu(workerId, config) correctly gates SwiftShader probe to worker 0 only in software GPU mode
  • workerId is assigned sequentially from 0 in distributeFrames, so worker 0 is always the first in every chunk
  • Sequential branch (chunkWorkerCount === 1) still probes unconditionally — correct
  • Partial<EngineConfig> | undefined parameter handling is sound (optional chaining)
  • 4 test cases cover the essential matrix (worker 0/non-zero × software/hardware/undefined)
  • Workers 1..N-1 that start before worker 0's probe completes will have their frames discarded on failure — same behavior as before, no regression

The bench data showing ~24s contention at c=3 from 18 simultaneous probe navigations makes the fix well-motivated.

Copy link
Copy Markdown
Collaborator Author

@jrusso1020 jrusso1020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the diff and the full files (parallelCoordinator.ts, parallelCoordinator.test.ts, renderChunk.ts) plus the assertSwiftShader contract and the surrounding chunk/worker plumbing. The fix does what the PR body claims; gate logic is correct and the unit tests cover the helper well. One stale-comment nit and one minor behavioral observation below.

Correctness

  • shouldVerifyWorkerGpu(workerId, config) returns true iff browserGpuMode === "software" && workerId === 0. Matches the PR description.
  • Fail-fast at chunk granularity is preserved: when worker 0 hits a non-SwiftShader backend it throws SwiftShaderAssertionError before initializeSession — no composition frames are captured on worker 0. executeParallelCapture collects that as a result with error, then throws [Parallel] Capture failed: …, which the chunk treats as a hard failure (and the worker adapter classifies as non-retryable via BROWSER_GPU_NOT_SOFTWARE). The byte-identical-retry contract therefore still holds: workers 1..N-1 may have captured frames, but the chunk is discarded.
  • Workers within a chunk share the same Chrome binary, flags, OS image, and SwiftShader libs (single pod, single launcher), so worker 0's verification is genuinely representative. Per-worker drift would require a SwiftShader downgrade mid-chunk, which the current launch model can't produce.
  • Sequential branch in renderChunk.ts (around line 492, chunkWorkerCount === 1) is untouched — it still calls assertSwiftShader against its own probe session before initializeSession. Good.

Minor: workers 1..N-1 don't short-circuit on worker-0 failure

When worker 0 throws the assertion, executeParallelCapture uses Promise.all over executeWorkerTask calls that catch internally and return {error}. Workers 1..N-1 keep running to completion before the chunk-level error surfaces — wasted capture work on the same pod when SwiftShader is downgraded. Not a correctness issue (output is discarded), and there's no analogous pre-existing case in the c=1 branch (no siblings), so this is fine to defer. Flagging only because the new comment block on lines 220-234 emphasizes "worker 0 aborts before frames are captured" without noting that the other workers don't.

If you want to short-circuit later, the lightweight path is wiring an AbortController through signal and aborting from worker 0's failure inside executeWorkerTask — but that's a follow-up, not a blocker.

Nit: stale inline comment in renderChunk.ts

The upstream comment at lines 469-477 was updated to reflect "worker 0 only", but the inline comment block at lines 509-511 was missed:

// at a `frameTimeTicks` it had just advanced to.
}
// chunkWorkerCount > 1: skip the probe entirely. Each parallel worker

// chunkWorkerCount > 1: skip the probe entirely. Each parallel worker
// creates its own session and runs `assertSwiftShader` before its
// first frame.

After this PR that's no longer true — only worker 0 runs assertSwiftShader. Suggest tweaking the second sentence to something like "…each parallel worker creates its own session; worker 0 runs assertSwiftShader before its first frame (workers 1..N-1 piggy-back on that verification — see parallelCoordinator.ts:shouldVerifyWorkerGpu)."

Tests

The 4 new shouldVerifyWorkerGpu cases cover the matrix (worker-0 + software → true; non-zero workers + software → false; hardware/empty config → false; undefined config → false). That's the right coverage for a pure helper; the wired-up behavior in executeWorkerTask is covered by the Docker golden-baseline shards as noted.

Verdict

COMMENT — the gate logic is right and ships safely. Only the stale comment in renderChunk.ts lines 509-511 is worth fixing; happy for it to land as a follow-up or as a single-line tweak before merge.

Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, narrowly-scoped perf fix that lines up with the bench data in the PR body. The design — one SwiftShader probe per chunk gated to worker 0 — is the right call: workers within a chunk share the same Chrome binary, flags, and OS/driver state, so worker 0's verdict is representative. Tests pin the predicate's behavior across the four interesting axes (worker id × config presence × GPU mode), including the undefined-config edge case.

Calibrated strengths

  • Pure-predicate factoring (shouldVerifyWorkerGpu at packages/engine/src/services/parallelCoordinator.ts:189) keeps executeWorkerTask readable and makes the rule unit-testable without spinning a Puppeteer page. The 4-case test (parallelCoordinator.test.ts:79-101) covers the right corners.
  • The renderChunk comment at packages/producer/src/services/distributed/renderChunk.ts:469-477 is updated in lockstep with the gate semantics so the eager-pre-probe rationale stays accurate. Easy thing to forget; nice catch.
  • distributeFrames guarantees a workerId === 0 for every non-empty chunk (sequential for (let i = 0; ...) with the startFrame >= rangeStart + totalFrames short-circuit), so the gate is structurally safe — no risk of a chunk where worker 0 is missing.

Findings

  • nit — PR-body wording overstates fail-fast: "worker 0 aborts the chunk before any frames are captured." executeWorkerTask catches the SwiftShaderAssertionError and returns it on WorkerResult.error (parallelCoordinator.ts:274-284), and executeParallelCapture waits on Promise.all before throwing (parallelCoordinator.ts:264-274). On a real downgrade, workers 1..N-1 will keep running their capture loop; in captureStreamingStage's parallel branch (captureStreamingStage.ts:182), their frames stream straight into the encoder before worker 0's failure surfaces. The byte-identical-retry contract still holds — the chunk-level throw discards the output and the retry overwrites it — but "aborts before any frames are captured" is imprecise. Worth a one-liner in the gate's docblock noting that the per-worker abort is task-scoped, not chunk-scoped, and the retry contract is what carries the safety guarantee. No code change needed; clarification only.
  • nitconfig?: Partial<EngineConfig> is broader than the predicate needs. config?: Partial<Pick<EngineConfig, \"browserGpuMode\">> would tighten the contract and let future refactors of unrelated EngineConfig fields skip recompiling the test. The current callers all pass full configs, so this is purely a typing nit.
  • nit — no integration test pins the wiring (i.e., executeWorkerTask actually consults shouldVerifyWorkerGpu instead of an inline check). The predicate test alone wouldn't catch a regression where the if-statement at parallelCoordinator.ts:232 reverts to config?.browserGpuMode === \"software\". Acceptable given the one-line call site, but worth a follow-up if this surface grows.

Verdict: APPROVE
Reasoning: Correct design, evidence-backed perf justification, predicate is small and tested. The fail-fast wording in the PR body is a nit on the description, not on the code; the byte-identical-retry contract is preserved by executeParallelCapture throwing on any worker error before encode succeeds.

Review by Vai

@jrusso1020 jrusso1020 merged commit 17f47f3 into main May 19, 2026
40 checks passed
@jrusso1020 jrusso1020 deleted the 05-19-fix_distributed_gate_per_worker_swiftshader_probe_to_worker_0 branch May 19, 2026 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(distributed): c=3 worst-case regression — per-worker SwiftShader probe contention at 6 workers/pod

3 participants