fix(distributed): gate per-worker SwiftShader probe to worker 0 only by jrusso1020 · Pull Request #956 · heygen-com/hyperframes

jrusso1020 · 2026-05-19T02:06:14Z

Summary

Closes #955.

After #916 moved assertSwiftShader from renderChunk()'s eager probe session into executeWorkerTask, every parallel worker began running its own chrome://gpu / canvas-WebGL probe. At chunkWorkerCount=6 (texture-launch at chunks=3) that's 6 concurrent CDP page-loads per chunk × 3 chunks = 18 simultaneous probes hitting the dev fleet at once.

Bench data on dev (12 producer pods × 22 vCPU, --chunks 3,6,8,12 --iterations 5, sidecar v0.6.16 = post-#916 / pre-this PR):

chunks	chunkWorkerCount	worst total	median total	std
3	6	67.3s	52.9s	7.5s
6	3	42.6s	41.3s	0.6s
8	2	41.9s	40.0s	1.1s
12	1	38.9s	38.6s	0.2s

c=3 is 24.7s above c=6 worst (well above the 10s "real regression" threshold). The slow iters at c=3 show pod_total inflating from ~100s to ~147s uniformly across all three chunks per iter — that's the signature of cluster-level CDP contention rather than within-pod contention.

Fix

Workers within a chunk share the same Chrome binary, flags, and OS/driver state on a single pod, so worker 0's success is representative. A small helper shouldVerifyWorkerGpu(workerId, config) returns true iff browserGpuMode === "software" && workerId === 0; executeWorkerTask uses it instead of the inline check. Workers 1..N-1 skip the probe entirely.

The fail-fast contract still holds at the chunk level: if SwiftShader didn't load on a pod, worker 0 aborts the chunk before any frames are captured. Workers 1..N-1 piggy-back on that guarantee.

Expected impact

c=3 worst-case should drop from ~67s into the c=6 cluster (~42-44s).
c=6 / c=8 should see smaller wins.
c=12 is unaffected (sequential branch, no parallel workers).

Will validate on dev after release + DEV_DEPLOY and update #955 with the post-fix bench numbers.

Test plan

4 new unit cases on shouldVerifyWorkerGpu (worker-0/software → true, non-zero workers / non-software config / undefined config → false). bun run test src/services/parallelCoordinator.test.ts passes 11/11.
bun run --cwd packages/engine build and bun run --cwd packages/producer build clean.
bunx oxlint + bunx oxfmt --check clean on the three touched files.
Docker golden baselines unchanged (the gate doesn't alter the captured frames; it just skips a probe page-load on workers 1..N-1).

🤖 Generated with Claude Code

After #916 moved `assertSwiftShader` from `renderChunk()`'s eager probe session into `executeWorkerTask`, every parallel worker began running its own `chrome://gpu` / canvas-WebGL probe. At `chunkWorkerCount=6` (texture launch at chunks=3) that's 6 concurrent CDP page-loads per chunk × 3 chunks = 18 simultaneous probes. Bench data on dev (12 producer pods × 22 vCPU) showed c=3 worst-case wall-clock at 67.3s, 24.7s above c=6 worst (42.6s) — pod_total inflates 100s → 147s uniformly across all three chunks per slow iter, the signature of cluster-level CDP contention rather than within-pod contention. Workers within a chunk share the same Chrome binary, flags, and OS/driver state on a single pod, so worker 0's success is representative for the rest. Gate the probe via `shouldVerifyWorkerGpu(workerId, config)` so only worker 0 navigates to the probe page; workers 1..N-1 skip it. The fail-fast contract still holds at the chunk level (worker 0 still aborts the chunk if SwiftShader didn't load) — just without the concurrent CDP traffic. Expected wall-clock impact: c=3 worst drops from ~67s to in line with c=6 worst (~42-44s). c=6 (3 workers/pod) and c=8 (2 workers/pod) should see smaller wins; c=12 (1 worker/pod, sequential branch) is unaffected. Closes #955.

miguel-heygen

APPROVE — clean fix for a real performance regression.

Verified:

shouldVerifyWorkerGpu(workerId, config) correctly gates SwiftShader probe to worker 0 only in software GPU mode
workerId is assigned sequentially from 0 in distributeFrames, so worker 0 is always the first in every chunk
Sequential branch (chunkWorkerCount === 1) still probes unconditionally — correct
Partial<EngineConfig> | undefined parameter handling is sound (optional chaining)
4 test cases cover the essential matrix (worker 0/non-zero × software/hardware/undefined)
Workers 1..N-1 that start before worker 0's probe completes will have their frames discarded on failure — same behavior as before, no regression

The bench data showing ~24s contention at c=3 from 18 simultaneous probe navigations makes the fix well-motivated.

jrusso1020

Reviewed the diff and the full files (parallelCoordinator.ts, parallelCoordinator.test.ts, renderChunk.ts) plus the assertSwiftShader contract and the surrounding chunk/worker plumbing. The fix does what the PR body claims; gate logic is correct and the unit tests cover the helper well. One stale-comment nit and one minor behavioral observation below.

Correctness

shouldVerifyWorkerGpu(workerId, config) returns true iff browserGpuMode === "software" && workerId === 0. Matches the PR description.
Fail-fast at chunk granularity is preserved: when worker 0 hits a non-SwiftShader backend it throws SwiftShaderAssertionError before initializeSession — no composition frames are captured on worker 0. executeParallelCapture collects that as a result with error, then throws [Parallel] Capture failed: …, which the chunk treats as a hard failure (and the worker adapter classifies as non-retryable via BROWSER_GPU_NOT_SOFTWARE). The byte-identical-retry contract therefore still holds: workers 1..N-1 may have captured frames, but the chunk is discarded.
Workers within a chunk share the same Chrome binary, flags, OS image, and SwiftShader libs (single pod, single launcher), so worker 0's verification is genuinely representative. Per-worker drift would require a SwiftShader downgrade mid-chunk, which the current launch model can't produce.
Sequential branch in renderChunk.ts (around line 492, chunkWorkerCount === 1) is untouched — it still calls assertSwiftShader against its own probe session before initializeSession. Good.

Minor: workers 1..N-1 don't short-circuit on worker-0 failure

When worker 0 throws the assertion, executeParallelCapture uses Promise.all over executeWorkerTask calls that catch internally and return {error}. Workers 1..N-1 keep running to completion before the chunk-level error surfaces — wasted capture work on the same pod when SwiftShader is downgraded. Not a correctness issue (output is discarded), and there's no analogous pre-existing case in the c=1 branch (no siblings), so this is fine to defer. Flagging only because the new comment block on lines 220-234 emphasizes "worker 0 aborts before frames are captured" without noting that the other workers don't.

If you want to short-circuit later, the lightweight path is wiring an AbortController through signal and aborting from worker 0's failure inside executeWorkerTask — but that's a follow-up, not a blocker.

Nit: stale inline comment in renderChunk.ts

The upstream comment at lines 469-477 was updated to reflect "worker 0 only", but the inline comment block at lines 509-511 was missed:

hyperframes/packages/producer/src/services/distributed/renderChunk.ts

Lines 509 to 511 in da46de6

    
             // at a `frameTimeTicks` it had just advanced to. 
        
           } 
        
           // chunkWorkerCount > 1: skip the probe entirely. Each parallel worker

// chunkWorkerCount > 1: skip the probe entirely. Each parallel worker
// creates its own session and runs `assertSwiftShader` before its
// first frame.

After this PR that's no longer true — only worker 0 runs assertSwiftShader. Suggest tweaking the second sentence to something like "…each parallel worker creates its own session; worker 0 runs assertSwiftShader before its first frame (workers 1..N-1 piggy-back on that verification — see parallelCoordinator.ts:shouldVerifyWorkerGpu)."

Tests

The 4 new shouldVerifyWorkerGpu cases cover the matrix (worker-0 + software → true; non-zero workers + software → false; hardware/empty config → false; undefined config → false). That's the right coverage for a pure helper; the wired-up behavior in executeWorkerTask is covered by the Docker golden-baseline shards as noted.

Verdict

COMMENT — the gate logic is right and ships safely. Only the stale comment in renderChunk.ts lines 509-511 is worth fixing; happy for it to land as a follow-up or as a single-line tweak before merge.

vanceingalls

Clean, narrowly-scoped perf fix that lines up with the bench data in the PR body. The design — one SwiftShader probe per chunk gated to worker 0 — is the right call: workers within a chunk share the same Chrome binary, flags, and OS/driver state, so worker 0's verdict is representative. Tests pin the predicate's behavior across the four interesting axes (worker id × config presence × GPU mode), including the undefined-config edge case.

Calibrated strengths

Pure-predicate factoring (shouldVerifyWorkerGpu at packages/engine/src/services/parallelCoordinator.ts:189) keeps executeWorkerTask readable and makes the rule unit-testable without spinning a Puppeteer page. The 4-case test (parallelCoordinator.test.ts:79-101) covers the right corners.
The renderChunk comment at packages/producer/src/services/distributed/renderChunk.ts:469-477 is updated in lockstep with the gate semantics so the eager-pre-probe rationale stays accurate. Easy thing to forget; nice catch.
distributeFrames guarantees a workerId === 0 for every non-empty chunk (sequential for (let i = 0; ...) with the startFrame >= rangeStart + totalFrames short-circuit), so the gate is structurally safe — no risk of a chunk where worker 0 is missing.

Findings

nit — PR-body wording overstates fail-fast: "worker 0 aborts the chunk before any frames are captured." executeWorkerTask catches the SwiftShaderAssertionError and returns it on WorkerResult.error (parallelCoordinator.ts:274-284), and executeParallelCapture waits on Promise.all before throwing (parallelCoordinator.ts:264-274). On a real downgrade, workers 1..N-1 will keep running their capture loop; in captureStreamingStage's parallel branch (captureStreamingStage.ts:182), their frames stream straight into the encoder before worker 0's failure surfaces. The byte-identical-retry contract still holds — the chunk-level throw discards the output and the retry overwrites it — but "aborts before any frames are captured" is imprecise. Worth a one-liner in the gate's docblock noting that the per-worker abort is task-scoped, not chunk-scoped, and the retry contract is what carries the safety guarantee. No code change needed; clarification only.
nit — config?: Partial<EngineConfig> is broader than the predicate needs. config?: Partial<Pick<EngineConfig, \"browserGpuMode\">> would tighten the contract and let future refactors of unrelated EngineConfig fields skip recompiling the test. The current callers all pass full configs, so this is purely a typing nit.
nit — no integration test pins the wiring (i.e., executeWorkerTask actually consults shouldVerifyWorkerGpu instead of an inline check). The predicate test alone wouldn't catch a regression where the if-statement at parallelCoordinator.ts:232 reverts to config?.browserGpuMode === \"software\". Acceptable given the one-line call site, but worth a follow-up if this surface grows.

Verdict: APPROVE
Reasoning: Correct design, evidence-backed perf justification, predicate is small and tested. The fail-fast wording in the PR body is a nit on the description, not on the code; the byte-identical-retry contract is preserved by executeParallelCapture throwing on any worker error before encode succeeds.

Review by Vai

miguel-heygen approved these changes May 19, 2026

View reviewed changes

jrusso1020 commented May 19, 2026

View reviewed changes

vanceingalls approved these changes May 19, 2026

View reviewed changes

jrusso1020 merged commit 17f47f3 into main May 19, 2026
40 checks passed

jrusso1020 deleted the 05-19-fix_distributed_gate_per_worker_swiftshader_probe_to_worker_0 branch May 19, 2026 02:30

vanceingalls mentioned this pull request May 19, 2026

refactor: delete orphan declarations flagged by fallow #949

Merged

jrusso1020 mentioned this pull request May 19, 2026

fix(producer): treat 4xx from Google Fonts as deterministic "not served", not as failClosed trigger #957

Merged

vanceingalls mentioned this pull request May 19, 2026

docs(lambda): document webm support + simplify-review fixes #953

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(distributed): gate per-worker SwiftShader probe to worker 0 only#956

fix(distributed): gate per-worker SwiftShader probe to worker 0 only#956
jrusso1020 merged 1 commit into
mainfrom
05-19-fix_distributed_gate_per_worker_swiftshader_probe_to_worker_0

jrusso1020 commented May 19, 2026

Uh oh!

miguel-heygen left a comment

Uh oh!

jrusso1020 left a comment

Uh oh!

vanceingalls left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	// at a `frameTimeTicks` it had just advanced to.
	}
	// chunkWorkerCount > 1: skip the probe entirely. Each parallel worker

Conversation

jrusso1020 commented May 19, 2026

Summary

Fix

Expected impact

Test plan

Uh oh!

miguel-heygen left a comment

Choose a reason for hiding this comment

Uh oh!

jrusso1020 left a comment

Choose a reason for hiding this comment

Correctness

Minor: workers 1..N-1 don't short-circuit on worker-0 failure

Nit: stale inline comment in renderChunk.ts

Tests

Verdict

Uh oh!

vanceingalls left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants