perf(distributed): parallelize chunk capture across multiple workers#906
Conversation
The distributed `renderChunk` primitive hardcoded `workerCount: 1` and
`captureStage` explicitly forbade `workerCount > 1` when `frameRange` was
set, with the comment:
"Distributed chunk workers fan out at the activity layer; reduce
workerCount to 1 when passing frameRange."
The assumption was that orchestration-layer fan-out (Temporal / Lambda /
K8s Jobs / SSH) saturates the available CPU on its own. In practice
adopters that deploy chunks onto multi-core hosts (8-24 vCPU is the
standard producer-worker pod sizing) end up pinning only ~3-4 cores per
chunk while the rest sit idle: chunk-level fan-out at the orchestration
layer gives each pod one chunk at a time, and the chunk render itself
was single-threaded.
Validated against a real 1080p / 30fps / 22-second shader-heavy
composition on a 22-vCPU Temporal pod: each chunk rendered at
165-273ms per frame (vs 94-98ms for the in-process streaming render
which runs `workerCount=2` by default). The slowest chunk gates total
wall-clock under parallel chunk fan-out, so the 2-3x per-frame gap
compounds and `distributed` was net-slower than `in-process` on every
composition smaller than ~5min of texture-class content. Lifting the
restriction is a measured ~2x per-chunk speedup with no contract
change at the framesDir or encoder layer.
Wire-up:
* `WorkerTask.outputFrameOffset` — optional offset subtracted from the
absolute frame index when computing the captured file's name.
Default 0 (the in-process contract; file name == absolute index).
Distributed chunks set this to the chunk's startFrame so file names
land 0-indexed within the chunk's range, matching the sequential
chunk-capture contract and the encoder's expectation that frames
are read sequentially without an `-start_number` override.
* `distributeFrames(totalFrames, workerCount, workDir, rangeStart=0)` —
offsets both `startFrame`/`endFrame` (used for per-frame time math
on the page's virtual clock) by `rangeStart`, and threads
`outputFrameOffset = rangeStart` onto each task it emits. With the
default `rangeStart=0` it is a no-op for in-process renders.
* `executeWorkerTask` — uses `i - (task.outputFrameOffset ?? 0)` for
the captured file name, leaving the per-frame TIME computation
`(i * fps.den) / fps.num` untouched so the page's virtual clock is
unchanged.
* `executeDiskCaptureWithAdaptiveRetry({ frameRangeStart? })` — accepts
the chunk's absolute startFrame and forwards it to `distributeFrames`
and `buildMissingFrameRetryBatches`. Default `undefined` preserves
the in-process contract.
* `buildMissingFrameRetryBatches(ranges, ..., rangeStart=0)` —
`findMissingFrameRanges` walks LOCAL 0-indexed file names; the retry
batch translates the local missing-range pair back to ABSOLUTE
composition indices for `WorkerTask.startFrame/endFrame` and sets
`outputFrameOffset = rangeStart` so the retried capture writes back
to the same local file name.
* `captureStage` — drops the assert; passes
`frameRangeStart: frameRange?.startFrame` to the parallel branch so
workers land on absolute composition frame indices for time math
while file names stay 0-indexed within the chunk range. Docstring
updated to reflect that the parallel branch is now supported.
* `renderChunk` — `workerCount: 1` → `workerCount: 2`. The pre-warmed
`probeSession` is consumed only by the sequential branch; the
parallel branch closes it during stage entry and creates its own
worker sessions. Documented as a follow-up: skip probeSession
creation when `workerCount > 1` to recover the ~3-5s warmup cost.
Backwards compatibility: every change is gated on a parameter that
defaults to the prior behavior. In-process callers (`executeRenderJob`)
pass no `frameRangeStart`, so `rangeStart === 0`, `outputFrameOffset`
defaults to 0, and the file-name math collapses to the prior `i` value.
The framesDir contract (`frame_0..frame_(totalFrames-1)`) and the
WorkerTask interface are extended, not replaced.
Tests: 24 pass / 0 fail across the distributed test suite (renderChunk,
plan, assemble, planFormatBanlist, planSizeCap, publicExports). 7 pass /
0 fail in `parallelCoordinator.test.ts`. The renderOrchestrator suite
has one pre-existing Windows-only failure
(`writeCompiledArtifacts — external assets on Windows drive-letter
paths`) unrelated to this change; the other 56 tests pass.
Refs: distributed-vs-inprocess benchmark thread at
heygen-com/experiment-framework#36950
…rkers Match the in-process renderer's worker selection instead of hardcoding 2. `calculateOptimalWorkers(framesInChunk, undefined, cfg)` is the same call `resolveRenderWorkerCount` makes under the hood, minus the capture-cost calibration reduction (which would require plumbing the chunk's compiled metadata through — left as a follow-up). For a typical 22-vCPU producer-worker pod with `cfg.concurrency: "auto"` this resolves to ~6 workers for a 240-frame chunk (capped by `defaultSafeMaxWorkers() = max(6, min(16, floor(cpuCount/8)))`), matching what `executeRenderJob` (the in-process path) already does. The prior hardcoded `workerCount: 2` was a safe-minimum starting point that undersized chunks vs prod's auto behavior. Tests: 12/12 pass in `renderChunk.test.ts` (unchanged — the test suite mocks the inner runCaptureStage call so workerCount selection is opaque to it).
Review pass on the parallel-capture frame-range change. Four targeted
cleanups identified by code-quality and efficiency review agents:
1. Add the missing `frameRange.endFrame - frameRange.startFrame === totalFrames`
assert. The parallel branch forwards `totalFrames` separately from
`frameRangeStart`; a caller passing mismatched values would have got a
silently wrong distribution. The sequential branch already implicitly
relied on this via its `rangeFrames = rangeEnd - rangeStart` arithmetic.
2. Collapse three near-duplicate docstrings (on `WorkerTask.outputFrameOffset`,
`executeDiskCaptureWithAdaptiveRetry.frameRangeStart`, and `runCaptureStage`'s
`frameRange`) so only the WorkerTask field carries the full contract. The
other two cross-reference it.
3. Drop the WHAT-narrating comments inside `executeWorkerTask`'s per-frame
loop. The variable names (`fileFrameIdx = i - outputOffset`) already say
what the line does; the only remaining comment flags the non-obvious
contract that the streaming callback gets the absolute index.
4. Trim the 30-line `chunkWorkerCount` block in `renderChunk` to one paragraph
explaining the one non-obvious thing (why we use `calculateOptimalWorkers`
directly instead of `resolveRenderWorkerCount`). The probeSession-wasted-on-
parallel acknowledgement stays as a 3-line follow-up flag — investigated
skipping it in this pass, but the SwiftShader probe is safety-critical and
has no per-worker equivalent, so deferred to a separate change with proper
per-worker assertion plumbing.
Tests + format + lint clean:
* `bun test parallelCoordinator.test.ts` — 7/7
* `bun test distributed/{renderChunk,plan}.test.ts` — 24/24
* `bunx oxfmt` + `bunx oxlint` — clean
vanceingalls
left a comment
There was a problem hiding this comment.
One-line summary: lifts the workerCount: 1 hardcode on distributed chunks by plumbing outputFrameOffset / frameRangeStart through the parallel-capture path so chunk workers can fan out the same way the in-process renderer does, with the file-name vs. absolute-time contract preserved and backwards-compat gated on default-0 parameters.
Strengths
captureStage.ts:154-170— the newframeRange.endFrame - frameRange.startFrame === totalFramesprecondition is exactly the symmetry check the comment calls out:totalFramesdrivesdistributeFramespartitioning ANDfindMissingFrameRangescompletion checks, so any caller that desynchronizes them gets a loud error instead of a silently wrong distribution. Right place, right wording.renderOrchestrator.ts:891-912— clean handling of the local-vs-absolute split:WorkerTask.startFrame/endFramego absolute for time math,outputFrameOffset = rangeStartwrites back to the local file namefindMissingFrameRangesis looking for. The contract that retries land on the same local file as the initial capture is preserved.renderChunk.ts:540-549— deliberate scope: leaving shader-cost calibration off the chunk path (rather than half-baking it) is the right call, and the(framesInChunk, undefined, cfg)shape makes the follow-up trivial oncePlanJsoncarries the compiled hints.
Findings
important — no unit coverage for the new contract. The PR adds two new offset parameters that are load-bearing in three different files, and the existing tests (parallelCoordinator.test.ts:4-44, renderOrchestrator.test.ts:735-753) all run with rangeStart=0 / frameRangeStart=undefined. The correctness story rests on a quietly load-bearing invariant — worker output files named frame_(i - outputFrameOffset) must align with findMissingFrameRanges's 0-indexed walk over [0, totalFrames) — and nothing pins it. The Temporal end-to-end verified one trajectory; a regression elsewhere that unsets outputFrameOffset on the retry path would still pass this PR but break chunks on every host. Worth pinning at least: distributeFrames(100, 4, dir, 50) produces outputFrameOffset=50 and startFrame/endFrame shifted, and buildMissingFrameRetryBatches([{0,5},{10,15}], 2, dir, 0, 50) produces absolute-shifted ranges with outputFrameOffset=50.
important — skipping captureCostMultiplier for chunks is a known cost-tail. calculateOptimalWorkers(framesInChunk, undefined, cfg) at renderChunk.ts:540 skips the reduction resolveRenderWorkerCount applies in-process. For shader-heavy compositions this means chunks fan out at N workers, hit compositor contention as CDP timeouts, then halve workers via adaptive retry — each failed attempt costs ~chunk-duration. Acknowledged as follow-up #2 in the body, but worth opening as a tracked ticket so it doesn't bit-rot — the texture-launch benchmark you cited (chunk1 53s / chunk2 16s) is already that workload class.
nit — non-integer inputs aren't rejected. frameRange validation in captureStage.ts:148-170 covers finiteness, non-negative, and the size-equals-totalFrames invariant, but doesn't require startFrame/endFrame to be integers. A caller passing { startFrame: 1.5, endFrame: framesInChunk + 1.5 } would produce off-by-fractional outputFrameOffset and silently wrong file names. The current call site in renderChunk.ts reads from slice so this is theoretical, but the type is exported and worth hardening — Number.isInteger on both ends, same place.
nit — captureFrameToBuffer argument is for diagnostics only. At parallelCoordinator.ts:195, captureFrameToBuffer(session, fileFrameIdx, time) passes the LOCAL index to a function whose frameIndex parameter feeds only captureFrameErrorDiagnostics. The comment one line up says the streaming path uses the absolute index for the encoder — true at onFrameBuffer(i, buffer) — but the diagnostics-only fileFrameIdx choice ends up with error JSON labeled with the local index, which is slightly harder to correlate with composition logs. Trivial; flag if you tweak this for any other reason.
Verdict: APPROVE
Reasoning: Correctness story holds — the local-file-name vs. absolute-time-math split is symmetric across the new parameters, the new precondition catches the obvious caller-misuse case, and the in-process contract is preserved by default-0 parameters. The two important findings are test-coverage and a known cost-tail, neither blocking the perf win this PR ships.
Review by Vai
miguel-heygen
left a comment
There was a problem hiding this comment.
Approved — clean extraction of the offset plumbing that accidentally landed in #903. Vai's review is thorough, agree with the two important follow-ups (unit coverage for offset contract, captureCostMultiplier passthrough). No blockers from my side.
Summary
The distributed
renderChunkprimitive hardcodedworkerCount: 1andcaptureStageexplicitly forbadeworkerCount > 1whenframeRangewas set, with the comment:The assumption was that orchestration-layer fan-out (Temporal / Lambda / K8s Jobs / SSH) saturates available CPU on its own. In practice, adopters that deploy chunks onto multi-core hosts (8–24 vCPU is the standard producer-worker pod sizing) end up pinning only ~3-4 cores per chunk while the rest sit idle: chunk-level fan-out at the orchestration layer gives each pod one chunk at a time, but the chunk render itself is single-threaded.
This PR lifts the restriction by plumbing the chunk's frame range through the parallel-capture path. The
frameRangefield already existed onrunCaptureStage's input; it just wasn't honored by the parallel branch. Now both branches produce a byte-equivalent framesDir (contiguous 0-indexedframe_<i>.{ext}within the chunk's range), and chunks pick upcalculateOptimalWorkersauto-sizing the same way the in-process renderer already does.Validation
Measured against a real 1080p / 30fps / 22-second shader-heavy composition on a 22 vCPU Temporal producer-worker pod. Tested before/after the fix with the same composition, same fixture, same pod sizing:
workerCount: 1)chunks=1chunks=4Distributed
chunks=4is now 29% faster than in-process on this real workload — the first wall-clock win for the distributed path on per-frame-heavy compositions ≤1 min.The deployed bundle's runtime confirms the auto-sized count via the new
chunkWorkerCountlog: on this 22 vCPU pod the chunk now picks 6 workers (thedefaultSafeMaxWorkersceiling for that core count) instead of the prior hardcoded 1.Wire-up
WorkerTask.outputFrameOffset— optional offset subtracted from the absolute frame index when computing the captured file's name. Default 0 (in-process contract; file name == absolute index). Distributed chunks set this to the chunk'sstartFrameso file names land 0-indexed within the chunk's range, matching the sequential chunk-capture contract and the encoder's expectation that frames are read sequentially without an-start_numberoverride.distributeFrames(totalFrames, workerCount, workDir, rangeStart=0)— offsets bothstartFrame/endFrame(used for per-frame time math on the page's virtual clock) byrangeStart, and threadsoutputFrameOffset = rangeStartonto each task. WithrangeStart=0it is a no-op for in-process renders.executeWorkerTask— usesi - (task.outputFrameOffset ?? 0)for the captured file name, leaving the per-frame TIME computation(i * fps.den) / fps.numuntouched so the page's virtual clock is unchanged. The streaming callback (onFrameBuffer) still receives the absolute indexiso the streaming encoder sequences frames against the composition's timeline; only the disk file name uses the offset.executeDiskCaptureWithAdaptiveRetry({ frameRangeStart? })— accepts the chunk's absolute startFrame and forwards it todistributeFramesandbuildMissingFrameRetryBatches. Defaultundefinedpreserves the in-process contract.buildMissingFrameRetryBatches(ranges, ..., rangeStart=0)—findMissingFrameRangeswalks LOCAL 0-indexed file names; the retry batch translates the local missing-range pair back to ABSOLUTE composition indices forWorkerTask.startFrame/endFrameand setsoutputFrameOffset = rangeStartso the retried capture writes back to the same local file name.captureStage— drops the assert; passesframeRangeStart: frameRange?.startFrameto the parallel branch so workers land on absolute composition frame indices for time math while file names stay 0-indexed within the chunk range. Docstring updated to reflect that the parallel branch is now supported.renderChunk—workerCount: 1→workerCount: calculateOptimalWorkers(framesInChunk, undefined, cfg). Matches the in-process renderer's worker selection (resolveRenderWorkerCount→calculateOptimalWorkers) minus the capture-cost calibration reduction, which would require plumbing the chunk's compiled metadata through and is left as a follow-up (current code degrades gracefully — heavy shader chunks will still fan out at the configured worker count; the existing adaptive-retry path inexecuteDiskCaptureWithAdaptiveRetryreduces workers if compositor contention surfaces as CDP timeouts).Backwards compatibility
Every change is gated on a parameter that defaults to the prior behavior:
executeRenderJob) pass noframeRangeStart, sorangeStart === 0,outputFrameOffsetdefaults to 0, and the file-name math collapses toi(the prior absolute-index contract).frame_0..frame_(totalFrames-1)) and theWorkerTaskinterface are extended, not replaced.captureStagewas always emitting 0-indexed local file names; the parallel branch now matches.Follow-ups
workerCount > 1— the parallel branch closes the pre-warmed session during stage entry, so the ~3-5s warmup is wasted under auto-sizing. Recovered as a follow-up.hasShaderTransitions,renderModeHints) so chunks pick up the same reduction the in-process path uses for high-compositor-cost compositions.Test plan
bun test packages/engine/src/services/parallelCoordinator.test.ts— 7/7 passbun test packages/producer/src/services/distributed/{renderChunk,plan}.test.ts— 24/24 pass (full distributed suite green — see renderChunk.test.ts)bun test packages/producer/src/services/renderOrchestrator.test.ts— 56/57 (the one fail is a pre-existing Windows-only path-escape test, unrelated to this change)bunx oxlint+bunx oxfmt --checkon changed files — cleanchunks=1,4distributed runs all produce valid mp4s with byte-equivalent frame counts and resolutions. Numbers above.