Skip to content

perf(distributed): parallelize chunk capture across multiple workers#906

Merged
jrusso1020 merged 3 commits into
mainfrom
05-16-feat-parallel-capture-frame-range
May 16, 2026
Merged

perf(distributed): parallelize chunk capture across multiple workers#906
jrusso1020 merged 3 commits into
mainfrom
05-16-feat-parallel-capture-frame-range

Conversation

@jrusso1020
Copy link
Copy Markdown
Collaborator

Summary

The distributed renderChunk primitive hardcoded workerCount: 1 and captureStage explicitly forbade workerCount > 1 when frameRange was set, with the comment:

"Distributed chunk workers fan out at the activity layer; reduce workerCount to 1 when passing frameRange."

The assumption was that orchestration-layer fan-out (Temporal / Lambda / K8s Jobs / SSH) saturates available CPU on its own. In practice, adopters that deploy chunks onto multi-core hosts (8–24 vCPU is the standard producer-worker pod sizing) end up pinning only ~3-4 cores per chunk while the rest sit idle: chunk-level fan-out at the orchestration layer gives each pod one chunk at a time, but the chunk render itself is single-threaded.

This PR lifts the restriction by plumbing the chunk's frame range through the parallel-capture path. The frameRange field already existed on runCaptureStage's input; it just wasn't honored by the parallel branch. Now both branches produce a byte-equivalent framesDir (contiguous 0-indexed frame_<i>.{ext} within the chunk's range), and chunks pick up calculateOptimalWorkers auto-sizing the same way the in-process renderer already does.

Validation

Measured against a real 1080p / 30fps / 22-second shader-heavy composition on a 22 vCPU Temporal producer-worker pod. Tested before/after the fix with the same composition, same fixture, same pod sizing:

Before (workerCount: 1) After (auto) Δ
In-process baseline 83.8s 83.8s
Distributed chunks=1 117.0s 90.6s -23%
Distributed chunks=4 70.3s 59.3s -16% wall
chunk p50 40.3s 35.9s -11%
chunk p95 (gates wall) 63.6s 53.2s -16%
chunk total pod-time 123.8s 110.9s -10%

Distributed chunks=4 is now 29% faster than in-process on this real workload — the first wall-clock win for the distributed path on per-frame-heavy compositions ≤1 min.

The deployed bundle's runtime confirms the auto-sized count via the new chunkWorkerCount log: on this 22 vCPU pod the chunk now picks 6 workers (the defaultSafeMaxWorkers ceiling for that core count) instead of the prior hardcoded 1.

Wire-up

  • WorkerTask.outputFrameOffset — optional offset subtracted from the absolute frame index when computing the captured file's name. Default 0 (in-process contract; file name == absolute index). Distributed chunks set this to the chunk's startFrame so file names land 0-indexed within the chunk's range, matching the sequential chunk-capture contract and the encoder's expectation that frames are read sequentially without an -start_number override.

  • distributeFrames(totalFrames, workerCount, workDir, rangeStart=0) — offsets both startFrame/endFrame (used for per-frame time math on the page's virtual clock) by rangeStart, and threads outputFrameOffset = rangeStart onto each task. With rangeStart=0 it is a no-op for in-process renders.

  • executeWorkerTask — uses i - (task.outputFrameOffset ?? 0) for the captured file name, leaving the per-frame TIME computation (i * fps.den) / fps.num untouched so the page's virtual clock is unchanged. The streaming callback (onFrameBuffer) still receives the absolute index i so the streaming encoder sequences frames against the composition's timeline; only the disk file name uses the offset.

  • executeDiskCaptureWithAdaptiveRetry({ frameRangeStart? }) — accepts the chunk's absolute startFrame and forwards it to distributeFrames and buildMissingFrameRetryBatches. Default undefined preserves the in-process contract.

  • buildMissingFrameRetryBatches(ranges, ..., rangeStart=0)findMissingFrameRanges walks LOCAL 0-indexed file names; the retry batch translates the local missing-range pair back to ABSOLUTE composition indices for WorkerTask.startFrame/endFrame and sets outputFrameOffset = rangeStart so the retried capture writes back to the same local file name.

  • captureStage — drops the assert; passes frameRangeStart: frameRange?.startFrame to the parallel branch so workers land on absolute composition frame indices for time math while file names stay 0-indexed within the chunk range. Docstring updated to reflect that the parallel branch is now supported.

  • renderChunkworkerCount: 1workerCount: calculateOptimalWorkers(framesInChunk, undefined, cfg). Matches the in-process renderer's worker selection (resolveRenderWorkerCountcalculateOptimalWorkers) minus the capture-cost calibration reduction, which would require plumbing the chunk's compiled metadata through and is left as a follow-up (current code degrades gracefully — heavy shader chunks will still fan out at the configured worker count; the existing adaptive-retry path in executeDiskCaptureWithAdaptiveRetry reduces workers if compositor contention surfaces as CDP timeouts).

Backwards compatibility

Every change is gated on a parameter that defaults to the prior behavior:

  • In-process callers (executeRenderJob) pass no frameRangeStart, so rangeStart === 0, outputFrameOffset defaults to 0, and the file-name math collapses to i (the prior absolute-index contract).
  • The framesDir contract (frame_0..frame_(totalFrames-1)) and the WorkerTask interface are extended, not replaced.
  • The sequential chunk-capture branch in captureStage was always emitting 0-indexed local file names; the parallel branch now matches.

Follow-ups

  1. Skip probeSession creation when workerCount > 1 — the parallel branch closes the pre-warmed session during stage entry, so the ~3-5s warmup is wasted under auto-sizing. Recovered as a follow-up.
  2. Wire shader-cost calibration through to chunks — pass the chunk's compiled metadata (hasShaderTransitions, renderModeHints) so chunks pick up the same reduction the in-process path uses for high-compositor-cost compositions.
  3. Chunk-level work balancing — the texture-launch benchmark showed chunk1 took 53s while chunk2 took 16s (the slowest chunk gates wall time). Smarter partitioning that balances per-frame cost — not just frame count — would close more of the gap to the theoretical N×-speedup ceiling.

Test plan

  • bun test packages/engine/src/services/parallelCoordinator.test.ts — 7/7 pass
  • bun test packages/producer/src/services/distributed/{renderChunk,plan}.test.ts — 24/24 pass (full distributed suite green — see renderChunk.test.ts)
  • bun test packages/producer/src/services/renderOrchestrator.test.ts — 56/57 (the one fail is a pre-existing Windows-only path-escape test, unrelated to this change)
  • bunx oxlint + bunx oxfmt --check on changed files — clean
  • Live end-to-end against dev Temporal Cloud: in-process baseline + chunks=1,4 distributed runs all produce valid mp4s with byte-equivalent frame counts and resolutions. Numbers above.

The distributed `renderChunk` primitive hardcoded `workerCount: 1` and
`captureStage` explicitly forbade `workerCount > 1` when `frameRange` was
set, with the comment:

  "Distributed chunk workers fan out at the activity layer; reduce
   workerCount to 1 when passing frameRange."

The assumption was that orchestration-layer fan-out (Temporal / Lambda /
K8s Jobs / SSH) saturates the available CPU on its own. In practice
adopters that deploy chunks onto multi-core hosts (8-24 vCPU is the
standard producer-worker pod sizing) end up pinning only ~3-4 cores per
chunk while the rest sit idle: chunk-level fan-out at the orchestration
layer gives each pod one chunk at a time, and the chunk render itself
was single-threaded.

Validated against a real 1080p / 30fps / 22-second shader-heavy
composition on a 22-vCPU Temporal pod: each chunk rendered at
165-273ms per frame (vs 94-98ms for the in-process streaming render
which runs `workerCount=2` by default). The slowest chunk gates total
wall-clock under parallel chunk fan-out, so the 2-3x per-frame gap
compounds and `distributed` was net-slower than `in-process` on every
composition smaller than ~5min of texture-class content. Lifting the
restriction is a measured ~2x per-chunk speedup with no contract
change at the framesDir or encoder layer.

Wire-up:

  * `WorkerTask.outputFrameOffset` — optional offset subtracted from the
    absolute frame index when computing the captured file's name.
    Default 0 (the in-process contract; file name == absolute index).
    Distributed chunks set this to the chunk's startFrame so file names
    land 0-indexed within the chunk's range, matching the sequential
    chunk-capture contract and the encoder's expectation that frames
    are read sequentially without an `-start_number` override.

  * `distributeFrames(totalFrames, workerCount, workDir, rangeStart=0)` —
    offsets both `startFrame`/`endFrame` (used for per-frame time math
    on the page's virtual clock) by `rangeStart`, and threads
    `outputFrameOffset = rangeStart` onto each task it emits. With the
    default `rangeStart=0` it is a no-op for in-process renders.

  * `executeWorkerTask` — uses `i - (task.outputFrameOffset ?? 0)` for
    the captured file name, leaving the per-frame TIME computation
    `(i * fps.den) / fps.num` untouched so the page's virtual clock is
    unchanged.

  * `executeDiskCaptureWithAdaptiveRetry({ frameRangeStart? })` — accepts
    the chunk's absolute startFrame and forwards it to `distributeFrames`
    and `buildMissingFrameRetryBatches`. Default `undefined` preserves
    the in-process contract.

  * `buildMissingFrameRetryBatches(ranges, ..., rangeStart=0)` —
    `findMissingFrameRanges` walks LOCAL 0-indexed file names; the retry
    batch translates the local missing-range pair back to ABSOLUTE
    composition indices for `WorkerTask.startFrame/endFrame` and sets
    `outputFrameOffset = rangeStart` so the retried capture writes back
    to the same local file name.

  * `captureStage` — drops the assert; passes
    `frameRangeStart: frameRange?.startFrame` to the parallel branch so
    workers land on absolute composition frame indices for time math
    while file names stay 0-indexed within the chunk range. Docstring
    updated to reflect that the parallel branch is now supported.

  * `renderChunk` — `workerCount: 1` → `workerCount: 2`. The pre-warmed
    `probeSession` is consumed only by the sequential branch; the
    parallel branch closes it during stage entry and creates its own
    worker sessions. Documented as a follow-up: skip probeSession
    creation when `workerCount > 1` to recover the ~3-5s warmup cost.

Backwards compatibility: every change is gated on a parameter that
defaults to the prior behavior. In-process callers (`executeRenderJob`)
pass no `frameRangeStart`, so `rangeStart === 0`, `outputFrameOffset`
defaults to 0, and the file-name math collapses to the prior `i` value.
The framesDir contract (`frame_0..frame_(totalFrames-1)`) and the
WorkerTask interface are extended, not replaced.

Tests: 24 pass / 0 fail across the distributed test suite (renderChunk,
plan, assemble, planFormatBanlist, planSizeCap, publicExports). 7 pass /
0 fail in `parallelCoordinator.test.ts`. The renderOrchestrator suite
has one pre-existing Windows-only failure
(`writeCompiledArtifacts — external assets on Windows drive-letter
paths`) unrelated to this change; the other 56 tests pass.

Refs: distributed-vs-inprocess benchmark thread at
heygen-com/experiment-framework#36950
…rkers

Match the in-process renderer's worker selection instead of hardcoding 2.
`calculateOptimalWorkers(framesInChunk, undefined, cfg)` is the same call
`resolveRenderWorkerCount` makes under the hood, minus the capture-cost
calibration reduction (which would require plumbing the chunk's compiled
metadata through — left as a follow-up).

For a typical 22-vCPU producer-worker pod with `cfg.concurrency: "auto"`
this resolves to ~6 workers for a 240-frame chunk (capped by
`defaultSafeMaxWorkers() = max(6, min(16, floor(cpuCount/8)))`), matching
what `executeRenderJob` (the in-process path) already does. The prior
hardcoded `workerCount: 2` was a safe-minimum starting point that
undersized chunks vs prod's auto behavior.

Tests: 12/12 pass in `renderChunk.test.ts` (unchanged — the test suite
mocks the inner runCaptureStage call so workerCount selection is opaque
to it).
Review pass on the parallel-capture frame-range change. Four targeted
cleanups identified by code-quality and efficiency review agents:

1. Add the missing `frameRange.endFrame - frameRange.startFrame === totalFrames`
   assert. The parallel branch forwards `totalFrames` separately from
   `frameRangeStart`; a caller passing mismatched values would have got a
   silently wrong distribution. The sequential branch already implicitly
   relied on this via its `rangeFrames = rangeEnd - rangeStart` arithmetic.

2. Collapse three near-duplicate docstrings (on `WorkerTask.outputFrameOffset`,
   `executeDiskCaptureWithAdaptiveRetry.frameRangeStart`, and `runCaptureStage`'s
   `frameRange`) so only the WorkerTask field carries the full contract. The
   other two cross-reference it.

3. Drop the WHAT-narrating comments inside `executeWorkerTask`'s per-frame
   loop. The variable names (`fileFrameIdx = i - outputOffset`) already say
   what the line does; the only remaining comment flags the non-obvious
   contract that the streaming callback gets the absolute index.

4. Trim the 30-line `chunkWorkerCount` block in `renderChunk` to one paragraph
   explaining the one non-obvious thing (why we use `calculateOptimalWorkers`
   directly instead of `resolveRenderWorkerCount`). The probeSession-wasted-on-
   parallel acknowledgement stays as a 3-line follow-up flag — investigated
   skipping it in this pass, but the SwiftShader probe is safety-critical and
   has no per-worker equivalent, so deferred to a separate change with proper
   per-worker assertion plumbing.

Tests + format + lint clean:
  * `bun test parallelCoordinator.test.ts` — 7/7
  * `bun test distributed/{renderChunk,plan}.test.ts` — 24/24
  * `bunx oxfmt` + `bunx oxlint` — clean
Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One-line summary: lifts the workerCount: 1 hardcode on distributed chunks by plumbing outputFrameOffset / frameRangeStart through the parallel-capture path so chunk workers can fan out the same way the in-process renderer does, with the file-name vs. absolute-time contract preserved and backwards-compat gated on default-0 parameters.

Strengths

  • captureStage.ts:154-170 — the new frameRange.endFrame - frameRange.startFrame === totalFrames precondition is exactly the symmetry check the comment calls out: totalFrames drives distributeFrames partitioning AND findMissingFrameRanges completion checks, so any caller that desynchronizes them gets a loud error instead of a silently wrong distribution. Right place, right wording.
  • renderOrchestrator.ts:891-912 — clean handling of the local-vs-absolute split: WorkerTask.startFrame/endFrame go absolute for time math, outputFrameOffset = rangeStart writes back to the local file name findMissingFrameRanges is looking for. The contract that retries land on the same local file as the initial capture is preserved.
  • renderChunk.ts:540-549 — deliberate scope: leaving shader-cost calibration off the chunk path (rather than half-baking it) is the right call, and the (framesInChunk, undefined, cfg) shape makes the follow-up trivial once PlanJson carries the compiled hints.

Findings

important — no unit coverage for the new contract. The PR adds two new offset parameters that are load-bearing in three different files, and the existing tests (parallelCoordinator.test.ts:4-44, renderOrchestrator.test.ts:735-753) all run with rangeStart=0 / frameRangeStart=undefined. The correctness story rests on a quietly load-bearing invariant — worker output files named frame_(i - outputFrameOffset) must align with findMissingFrameRanges's 0-indexed walk over [0, totalFrames) — and nothing pins it. The Temporal end-to-end verified one trajectory; a regression elsewhere that unsets outputFrameOffset on the retry path would still pass this PR but break chunks on every host. Worth pinning at least: distributeFrames(100, 4, dir, 50) produces outputFrameOffset=50 and startFrame/endFrame shifted, and buildMissingFrameRetryBatches([{0,5},{10,15}], 2, dir, 0, 50) produces absolute-shifted ranges with outputFrameOffset=50.

important — skipping captureCostMultiplier for chunks is a known cost-tail. calculateOptimalWorkers(framesInChunk, undefined, cfg) at renderChunk.ts:540 skips the reduction resolveRenderWorkerCount applies in-process. For shader-heavy compositions this means chunks fan out at N workers, hit compositor contention as CDP timeouts, then halve workers via adaptive retry — each failed attempt costs ~chunk-duration. Acknowledged as follow-up #2 in the body, but worth opening as a tracked ticket so it doesn't bit-rot — the texture-launch benchmark you cited (chunk1 53s / chunk2 16s) is already that workload class.

nit — non-integer inputs aren't rejected. frameRange validation in captureStage.ts:148-170 covers finiteness, non-negative, and the size-equals-totalFrames invariant, but doesn't require startFrame/endFrame to be integers. A caller passing { startFrame: 1.5, endFrame: framesInChunk + 1.5 } would produce off-by-fractional outputFrameOffset and silently wrong file names. The current call site in renderChunk.ts reads from slice so this is theoretical, but the type is exported and worth hardening — Number.isInteger on both ends, same place.

nit — captureFrameToBuffer argument is for diagnostics only. At parallelCoordinator.ts:195, captureFrameToBuffer(session, fileFrameIdx, time) passes the LOCAL index to a function whose frameIndex parameter feeds only captureFrameErrorDiagnostics. The comment one line up says the streaming path uses the absolute index for the encoder — true at onFrameBuffer(i, buffer) — but the diagnostics-only fileFrameIdx choice ends up with error JSON labeled with the local index, which is slightly harder to correlate with composition logs. Trivial; flag if you tweak this for any other reason.

Verdict: APPROVE
Reasoning: Correctness story holds — the local-file-name vs. absolute-time-math split is symmetric across the new parameters, the new precondition catches the obvious caller-misuse case, and the in-process contract is preserved by default-0 parameters. The two important findings are test-coverage and a known cost-tail, neither blocking the perf win this PR ships.

Review by Vai

Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved — clean extraction of the offset plumbing that accidentally landed in #903. Vai's review is thorough, agree with the two important follow-ups (unit coverage for offset contract, captureCostMultiplier passthrough). No blockers from my side.

@jrusso1020 jrusso1020 merged commit 22363a1 into main May 16, 2026
44 checks passed
@jrusso1020 jrusso1020 deleted the 05-16-feat-parallel-capture-frame-range branch May 16, 2026 23:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants