perf(distributed): parallelize chunk capture across multiple workers by jrusso1020 · Pull Request #906 · heygen-com/hyperframes

jrusso1020 · 2026-05-16T22:08:02Z

Summary

The distributed renderChunk primitive hardcoded workerCount: 1 and captureStage explicitly forbade workerCount > 1 when frameRange was set, with the comment:

"Distributed chunk workers fan out at the activity layer; reduce workerCount to 1 when passing frameRange."

The assumption was that orchestration-layer fan-out (Temporal / Lambda / K8s Jobs / SSH) saturates available CPU on its own. In practice, adopters that deploy chunks onto multi-core hosts (8–24 vCPU is the standard producer-worker pod sizing) end up pinning only ~3-4 cores per chunk while the rest sit idle: chunk-level fan-out at the orchestration layer gives each pod one chunk at a time, but the chunk render itself is single-threaded.

This PR lifts the restriction by plumbing the chunk's frame range through the parallel-capture path. The frameRange field already existed on runCaptureStage's input; it just wasn't honored by the parallel branch. Now both branches produce a byte-equivalent framesDir (contiguous 0-indexed frame_<i>.{ext} within the chunk's range), and chunks pick up calculateOptimalWorkers auto-sizing the same way the in-process renderer already does.

Validation

Measured against a real 1080p / 30fps / 22-second shader-heavy composition on a 22 vCPU Temporal producer-worker pod. Tested before/after the fix with the same composition, same fixture, same pod sizing:

	Before (`workerCount: 1`)	After (auto)	Δ
In-process baseline	83.8s	83.8s	—
Distributed `chunks=1`	117.0s	90.6s	-23%
Distributed `chunks=4`	70.3s	59.3s	-16% wall
chunk p50	40.3s	35.9s	-11%
chunk p95 (gates wall)	63.6s	53.2s	-16%
chunk total pod-time	123.8s	110.9s	-10%

Distributed chunks=4 is now 29% faster than in-process on this real workload — the first wall-clock win for the distributed path on per-frame-heavy compositions ≤1 min.

The deployed bundle's runtime confirms the auto-sized count via the new chunkWorkerCount log: on this 22 vCPU pod the chunk now picks 6 workers (the defaultSafeMaxWorkers ceiling for that core count) instead of the prior hardcoded 1.

Wire-up

WorkerTask.outputFrameOffset — optional offset subtracted from the absolute frame index when computing the captured file's name. Default 0 (in-process contract; file name == absolute index). Distributed chunks set this to the chunk's startFrame so file names land 0-indexed within the chunk's range, matching the sequential chunk-capture contract and the encoder's expectation that frames are read sequentially without an -start_number override.
distributeFrames(totalFrames, workerCount, workDir, rangeStart=0) — offsets both startFrame/endFrame (used for per-frame time math on the page's virtual clock) by rangeStart, and threads outputFrameOffset = rangeStart onto each task. With rangeStart=0 it is a no-op for in-process renders.
executeWorkerTask — uses i - (task.outputFrameOffset ?? 0) for the captured file name, leaving the per-frame TIME computation (i * fps.den) / fps.num untouched so the page's virtual clock is unchanged. The streaming callback (onFrameBuffer) still receives the absolute index i so the streaming encoder sequences frames against the composition's timeline; only the disk file name uses the offset.
executeDiskCaptureWithAdaptiveRetry({ frameRangeStart? }) — accepts the chunk's absolute startFrame and forwards it to distributeFrames and buildMissingFrameRetryBatches. Default undefined preserves the in-process contract.
buildMissingFrameRetryBatches(ranges, ..., rangeStart=0) — findMissingFrameRanges walks LOCAL 0-indexed file names; the retry batch translates the local missing-range pair back to ABSOLUTE composition indices for WorkerTask.startFrame/endFrame and sets outputFrameOffset = rangeStart so the retried capture writes back to the same local file name.
captureStage — drops the assert; passes frameRangeStart: frameRange?.startFrame to the parallel branch so workers land on absolute composition frame indices for time math while file names stay 0-indexed within the chunk range. Docstring updated to reflect that the parallel branch is now supported.
renderChunk — workerCount: 1 → workerCount: calculateOptimalWorkers(framesInChunk, undefined, cfg). Matches the in-process renderer's worker selection (resolveRenderWorkerCount → calculateOptimalWorkers) minus the capture-cost calibration reduction, which would require plumbing the chunk's compiled metadata through and is left as a follow-up (current code degrades gracefully — heavy shader chunks will still fan out at the configured worker count; the existing adaptive-retry path in executeDiskCaptureWithAdaptiveRetry reduces workers if compositor contention surfaces as CDP timeouts).

Backwards compatibility

Every change is gated on a parameter that defaults to the prior behavior:

In-process callers (executeRenderJob) pass no frameRangeStart, so rangeStart === 0, outputFrameOffset defaults to 0, and the file-name math collapses to i (the prior absolute-index contract).
The framesDir contract (frame_0..frame_(totalFrames-1)) and the WorkerTask interface are extended, not replaced.
The sequential chunk-capture branch in captureStage was always emitting 0-indexed local file names; the parallel branch now matches.

Follow-ups

Skip probeSession creation when workerCount > 1 — the parallel branch closes the pre-warmed session during stage entry, so the ~3-5s warmup is wasted under auto-sizing. Recovered as a follow-up.
Wire shader-cost calibration through to chunks — pass the chunk's compiled metadata (hasShaderTransitions, renderModeHints) so chunks pick up the same reduction the in-process path uses for high-compositor-cost compositions.
Chunk-level work balancing — the texture-launch benchmark showed chunk1 took 53s while chunk2 took 16s (the slowest chunk gates wall time). Smarter partitioning that balances per-frame cost — not just frame count — would close more of the gap to the theoretical N×-speedup ceiling.

Test plan

bun test packages/engine/src/services/parallelCoordinator.test.ts — 7/7 pass
bun test packages/producer/src/services/distributed/{renderChunk,plan}.test.ts — 24/24 pass (full distributed suite green — see renderChunk.test.ts)
bun test packages/producer/src/services/renderOrchestrator.test.ts — 56/57 (the one fail is a pre-existing Windows-only path-escape test, unrelated to this change)
bunx oxlint + bunx oxfmt --check on changed files — clean
Live end-to-end against dev Temporal Cloud: in-process baseline + chunks=1,4 distributed runs all produce valid mp4s with byte-equivalent frame counts and resolutions. Numbers above.

The distributed `renderChunk` primitive hardcoded `workerCount: 1` and `captureStage` explicitly forbade `workerCount > 1` when `frameRange` was set, with the comment: "Distributed chunk workers fan out at the activity layer; reduce workerCount to 1 when passing frameRange." The assumption was that orchestration-layer fan-out (Temporal / Lambda / K8s Jobs / SSH) saturates the available CPU on its own. In practice adopters that deploy chunks onto multi-core hosts (8-24 vCPU is the standard producer-worker pod sizing) end up pinning only ~3-4 cores per chunk while the rest sit idle: chunk-level fan-out at the orchestration layer gives each pod one chunk at a time, and the chunk render itself was single-threaded. Validated against a real 1080p / 30fps / 22-second shader-heavy composition on a 22-vCPU Temporal pod: each chunk rendered at 165-273ms per frame (vs 94-98ms for the in-process streaming render which runs `workerCount=2` by default). The slowest chunk gates total wall-clock under parallel chunk fan-out, so the 2-3x per-frame gap compounds and `distributed` was net-slower than `in-process` on every composition smaller than ~5min of texture-class content. Lifting the restriction is a measured ~2x per-chunk speedup with no contract change at the framesDir or encoder layer. Wire-up: * `WorkerTask.outputFrameOffset` — optional offset subtracted from the absolute frame index when computing the captured file's name. Default 0 (the in-process contract; file name == absolute index). Distributed chunks set this to the chunk's startFrame so file names land 0-indexed within the chunk's range, matching the sequential chunk-capture contract and the encoder's expectation that frames are read sequentially without an `-start_number` override. * `distributeFrames(totalFrames, workerCount, workDir, rangeStart=0)` — offsets both `startFrame`/`endFrame` (used for per-frame time math on the page's virtual clock) by `rangeStart`, and threads `outputFrameOffset = rangeStart` onto each task it emits. With the default `rangeStart=0` it is a no-op for in-process renders. * `executeWorkerTask` — uses `i - (task.outputFrameOffset ?? 0)` for the captured file name, leaving the per-frame TIME computation `(i * fps.den) / fps.num` untouched so the page's virtual clock is unchanged. * `executeDiskCaptureWithAdaptiveRetry({ frameRangeStart? })` — accepts the chunk's absolute startFrame and forwards it to `distributeFrames` and `buildMissingFrameRetryBatches`. Default `undefined` preserves the in-process contract. * `buildMissingFrameRetryBatches(ranges, ..., rangeStart=0)` — `findMissingFrameRanges` walks LOCAL 0-indexed file names; the retry batch translates the local missing-range pair back to ABSOLUTE composition indices for `WorkerTask.startFrame/endFrame` and sets `outputFrameOffset = rangeStart` so the retried capture writes back to the same local file name. * `captureStage` — drops the assert; passes `frameRangeStart: frameRange?.startFrame` to the parallel branch so workers land on absolute composition frame indices for time math while file names stay 0-indexed within the chunk range. Docstring updated to reflect that the parallel branch is now supported. * `renderChunk` — `workerCount: 1` → `workerCount: 2`. The pre-warmed `probeSession` is consumed only by the sequential branch; the parallel branch closes it during stage entry and creates its own worker sessions. Documented as a follow-up: skip probeSession creation when `workerCount > 1` to recover the ~3-5s warmup cost. Backwards compatibility: every change is gated on a parameter that defaults to the prior behavior. In-process callers (`executeRenderJob`) pass no `frameRangeStart`, so `rangeStart === 0`, `outputFrameOffset` defaults to 0, and the file-name math collapses to the prior `i` value. The framesDir contract (`frame_0..frame_(totalFrames-1)`) and the WorkerTask interface are extended, not replaced. Tests: 24 pass / 0 fail across the distributed test suite (renderChunk, plan, assemble, planFormatBanlist, planSizeCap, publicExports). 7 pass / 0 fail in `parallelCoordinator.test.ts`. The renderOrchestrator suite has one pre-existing Windows-only failure (`writeCompiledArtifacts — external assets on Windows drive-letter paths`) unrelated to this change; the other 56 tests pass. Refs: distributed-vs-inprocess benchmark thread at heygen-com/experiment-framework#36950

…rkers Match the in-process renderer's worker selection instead of hardcoding 2. `calculateOptimalWorkers(framesInChunk, undefined, cfg)` is the same call `resolveRenderWorkerCount` makes under the hood, minus the capture-cost calibration reduction (which would require plumbing the chunk's compiled metadata through — left as a follow-up). For a typical 22-vCPU producer-worker pod with `cfg.concurrency: "auto"` this resolves to ~6 workers for a 240-frame chunk (capped by `defaultSafeMaxWorkers() = max(6, min(16, floor(cpuCount/8)))`), matching what `executeRenderJob` (the in-process path) already does. The prior hardcoded `workerCount: 2` was a safe-minimum starting point that undersized chunks vs prod's auto behavior. Tests: 12/12 pass in `renderChunk.test.ts` (unchanged — the test suite mocks the inner runCaptureStage call so workerCount selection is opaque to it).

Review pass on the parallel-capture frame-range change. Four targeted cleanups identified by code-quality and efficiency review agents: 1. Add the missing `frameRange.endFrame - frameRange.startFrame === totalFrames` assert. The parallel branch forwards `totalFrames` separately from `frameRangeStart`; a caller passing mismatched values would have got a silently wrong distribution. The sequential branch already implicitly relied on this via its `rangeFrames = rangeEnd - rangeStart` arithmetic. 2. Collapse three near-duplicate docstrings (on `WorkerTask.outputFrameOffset`, `executeDiskCaptureWithAdaptiveRetry.frameRangeStart`, and `runCaptureStage`'s `frameRange`) so only the WorkerTask field carries the full contract. The other two cross-reference it. 3. Drop the WHAT-narrating comments inside `executeWorkerTask`'s per-frame loop. The variable names (`fileFrameIdx = i - outputOffset`) already say what the line does; the only remaining comment flags the non-obvious contract that the streaming callback gets the absolute index. 4. Trim the 30-line `chunkWorkerCount` block in `renderChunk` to one paragraph explaining the one non-obvious thing (why we use `calculateOptimalWorkers` directly instead of `resolveRenderWorkerCount`). The probeSession-wasted-on- parallel acknowledgement stays as a 3-line follow-up flag — investigated skipping it in this pass, but the SwiftShader probe is safety-critical and has no per-worker equivalent, so deferred to a separate change with proper per-worker assertion plumbing. Tests + format + lint clean: * `bun test parallelCoordinator.test.ts` — 7/7 * `bun test distributed/{renderChunk,plan}.test.ts` — 24/24 * `bunx oxfmt` + `bunx oxlint` — clean

vanceingalls

One-line summary: lifts the workerCount: 1 hardcode on distributed chunks by plumbing outputFrameOffset / frameRangeStart through the parallel-capture path so chunk workers can fan out the same way the in-process renderer does, with the file-name vs. absolute-time contract preserved and backwards-compat gated on default-0 parameters.

Strengths

captureStage.ts:154-170 — the new frameRange.endFrame - frameRange.startFrame === totalFrames precondition is exactly the symmetry check the comment calls out: totalFrames drives distributeFrames partitioning AND findMissingFrameRanges completion checks, so any caller that desynchronizes them gets a loud error instead of a silently wrong distribution. Right place, right wording.
renderOrchestrator.ts:891-912 — clean handling of the local-vs-absolute split: WorkerTask.startFrame/endFrame go absolute for time math, outputFrameOffset = rangeStart writes back to the local file name findMissingFrameRanges is looking for. The contract that retries land on the same local file as the initial capture is preserved.
renderChunk.ts:540-549 — deliberate scope: leaving shader-cost calibration off the chunk path (rather than half-baking it) is the right call, and the (framesInChunk, undefined, cfg) shape makes the follow-up trivial once PlanJson carries the compiled hints.

Findings

important — no unit coverage for the new contract. The PR adds two new offset parameters that are load-bearing in three different files, and the existing tests (parallelCoordinator.test.ts:4-44, renderOrchestrator.test.ts:735-753) all run with rangeStart=0 / frameRangeStart=undefined. The correctness story rests on a quietly load-bearing invariant — worker output files named frame_(i - outputFrameOffset) must align with findMissingFrameRanges's 0-indexed walk over [0, totalFrames) — and nothing pins it. The Temporal end-to-end verified one trajectory; a regression elsewhere that unsets outputFrameOffset on the retry path would still pass this PR but break chunks on every host. Worth pinning at least: distributeFrames(100, 4, dir, 50) produces outputFrameOffset=50 and startFrame/endFrame shifted, and buildMissingFrameRetryBatches([{0,5},{10,15}], 2, dir, 0, 50) produces absolute-shifted ranges with outputFrameOffset=50.

important — skipping captureCostMultiplier for chunks is a known cost-tail. calculateOptimalWorkers(framesInChunk, undefined, cfg) at renderChunk.ts:540 skips the reduction resolveRenderWorkerCount applies in-process. For shader-heavy compositions this means chunks fan out at N workers, hit compositor contention as CDP timeouts, then halve workers via adaptive retry — each failed attempt costs ~chunk-duration. Acknowledged as follow-up #2 in the body, but worth opening as a tracked ticket so it doesn't bit-rot — the texture-launch benchmark you cited (chunk1 53s / chunk2 16s) is already that workload class.

nit — non-integer inputs aren't rejected. frameRange validation in captureStage.ts:148-170 covers finiteness, non-negative, and the size-equals-totalFrames invariant, but doesn't require startFrame/endFrame to be integers. A caller passing { startFrame: 1.5, endFrame: framesInChunk + 1.5 } would produce off-by-fractional outputFrameOffset and silently wrong file names. The current call site in renderChunk.ts reads from slice so this is theoretical, but the type is exported and worth hardening — Number.isInteger on both ends, same place.

nit — captureFrameToBuffer argument is for diagnostics only. At parallelCoordinator.ts:195, captureFrameToBuffer(session, fileFrameIdx, time) passes the LOCAL index to a function whose frameIndex parameter feeds only captureFrameErrorDiagnostics. The comment one line up says the streaming path uses the absolute index for the encoder — true at onFrameBuffer(i, buffer) — but the diagnostics-only fileFrameIdx choice ends up with error JSON labeled with the local index, which is slightly harder to correlate with composition logs. Trivial; flag if you tweak this for any other reason.

Verdict: APPROVE
Reasoning: Correctness story holds — the local-file-name vs. absolute-time-math split is symmetric across the new parameters, the new precondition catches the obvious caller-misuse case, and the in-process contract is preserved by default-0 parameters. The two important findings are test-coverage and a known cost-tail, neither blocking the perf win this PR ships.

Review by Vai

miguel-heygen

Approved — clean extraction of the offset plumbing that accidentally landed in #903. Vai's review is thorough, agree with the two important follow-ups (unit coverage for offset contract, captureCostMultiplier passthrough). No blockers from my side.

jrusso1020 added 3 commits May 16, 2026 19:20

vanceingalls approved these changes May 16, 2026

View reviewed changes

miguel-heygen approved these changes May 16, 2026

View reviewed changes

jrusso1020 merged commit 22363a1 into main May 16, 2026
44 checks passed

jrusso1020 deleted the 05-16-feat-parallel-capture-frame-range branch May 16, 2026 23:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(distributed): parallelize chunk capture across multiple workers#906

perf(distributed): parallelize chunk capture across multiple workers#906
jrusso1020 merged 3 commits into
mainfrom
05-16-feat-parallel-capture-frame-range

jrusso1020 commented May 16, 2026

Uh oh!

vanceingalls left a comment

Uh oh!

miguel-heygen left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jrusso1020 commented May 16, 2026

Summary

Validation

Wire-up

Backwards compatibility

Follow-ups

Test plan

Uh oh!

vanceingalls left a comment

Choose a reason for hiding this comment

Uh oh!

miguel-heygen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants