Skip to content

perf(distributed): skip eager probe session when chunkWorkerCount > 1#916

Merged
jrusso1020 merged 1 commit into
mainfrom
perf-skip-probe-when-parallel
May 17, 2026
Merged

perf(distributed): skip eager probe session when chunkWorkerCount > 1#916
jrusso1020 merged 1 commit into
mainfrom
perf-skip-probe-when-parallel

Conversation

@jrusso1020
Copy link
Copy Markdown
Collaborator

What

When renderChunk resolves chunkWorkerCount > 1, skip the eager
createCaptureSession + assertSwiftShader + initializeSession pre-warmup
that captureStage's parallel branch immediately closes anyway. Move the
SwiftShader safety probe into executeWorkerTask so each parallel worker
validates its own GPU backend before its first frame.

Why

The OSS comment at the probeSession: session callsite already flagged
this:

// The parallel branch closes this session and spins up its own
// worker sessions, wasting the ~3-5s of pre-warmed setup. Worth a
// follow-up to skip pre-warmup when the resolved workerCount > 1.
probeSession: session,

In the dev distributed-render benchmark (12 producer-worker pods, v0.6.15
sidecar), texture-launch's chunk_p95 lands at ~25-35s with most of the
non-capture time being per-chunk fixed overhead. ~3-5s of that is the
probe pipeline we throw away — meaningful at the small chunks that real
maxParallelChunks settings produce.

How

  • packages/engine/src/utils/readWebGlVendorInfoFromCanvas.ts — new file.
    Moved from producer/services/distributed/renderChunk.ts; both
    producer and engine now need it. renderChunk.ts keeps a re-export
    via export { readWebGlVendorInfoFromCanvas } from "@hyperframes/engine";
    so @hyperframes/producer/distributed's public surface is unchanged
    (the existing publicExports.test.ts assertion still passes).

  • packages/engine/src/services/parallelCoordinator.ts
    executeWorkerTask now runs
    assertSwiftShader(session.page, readWebGlVendorInfoFromCanvas)
    after createCaptureSession when config.browserGpuMode === "software".
    Each parallel worker self-validates SwiftShader before its first
    frame.

    In-process renders default to browserGpuMode: "software", so they
    also pick up this safety net. Cost is one about:blank navigation +
    one page.evaluate per worker (~100-200ms, concurrent across workers
    so wall-clock impact ≈ slowest worker probe). Hardware-GL paths
    (browserGpuMode !== "software") are untouched.

  • packages/producer/src/services/distributed/renderChunk.ts
    resolve chunkWorkerCount up-front, skip the entire pre-warmup
    branch when chunkWorkerCount > 1, pass probeSession: null in
    that case. Sequential path (chunkWorkerCount === 1) is unchanged:
    it still pre-warms because captureStage's sequential branch reuses
    the probe.

Test plan

  • Unit tests added/updated — parallelCoordinator.test.ts's
    existing distributeFrames/calculateOptimalWorkers coverage is
    unaffected; behavior gating is small enough to verify by reading.
  • Existing tests pass:
    • packages/engine — all 605 tests pass (bun run test).
    • packages/producer distributed unit tests — 47 pass
      (assemble, plan, publicExports, chunkBoundary,
      planFormatBanlist, planSizeCap).
  • Lint + format — bunx oxlint and bunx oxfmt --check clean
    across the four touched files.
  • Docker-based regression renderChunk.test.ts byte-identical-retry
    test — runs the fixture at chunkWorkerCount=1 (5 frames), exercising
    the sequential path which is unchanged. The parallel path will be
    exercised by re-benchmarking on dev after release.
  • Re-benchmark texture-launch on dev after v0.6.16 release. Expect
    chunk_p95 to drop ~3-5s at chunks ∈ {3, 6, 8} (where
    chunkWorkerCount > 1 and the probe is currently wasted). chunks=12
    hits chunkWorkerCount=1 and is unchanged.

Notes for reviewers

  • The flagged // follow-up comment in
    services/distributed/renderChunk.ts is removed.
  • readWebGlVendorInfoFromCanvas is unchanged in behavior — just
    relocated.
  • I considered a per-task assertSoftwareGl?: boolean flag instead of
    gating on cfg.browserGpuMode === "software", but the latter matches
    the existing safety contract semantics cleanly: "if you declared
    software GL, we verify it." Open to flipping if preferred.

`renderChunk` was pre-warming a `createCaptureSession + assertSwiftShader +
initializeSession` pipeline before every chunk render. When the resolved
`chunkWorkerCount > 1`, the parallel branch in `captureStage` immediately
closes that probe and spins up fresh per-worker sessions — wasting the
~3-5s of pre-warmed setup. (The OSS comment at `runCaptureStage(...
probeSession: session ...)` flagged this as a follow-up.)

Move the SwiftShader assertion into `executeWorkerTask` so each parallel
worker validates its own GPU backend against `chrome://gpu`-style canvas
probe (canvas + WEBGL_debug_renderer_info works on both regular Chrome
and `chrome-headless-shell`). Gated on `cfg.browserGpuMode === "software"`
so in-process renders that opt into software GL also pick up the safety
net, while hardware-GL paths are untouched.

In `renderChunk`, compute `chunkWorkerCount` up-front and skip the entire
pre-warmup (createCaptureSession + assert + initialize) when > 1 — the
parallel workers cover it. Sequential path (chunkWorkerCount === 1) is
unchanged: it still pre-warms because `captureStage`'s sequential branch
reuses the probe.

Move `readWebGlVendorInfoFromCanvas` from
`packages/producer/src/services/distributed/renderChunk.ts` to
`packages/engine/src/utils/readWebGlVendorInfoFromCanvas.ts` (both
producer and engine need it now). `renderChunk.ts` re-exports the
function from `@hyperframes/engine` so downstream consumers that import
it from `@hyperframes/producer/distributed` keep working (the
`publicExports.test.ts` assertion is preserved).

Expected impact on the texture-launch fixture (dev, 12 producer-worker
pods, v0.6.15 sidecar; baseline from re-run sweep with `--chunk-size 10`):

  chunks=3   chunkWorkerCount=6 → ~3-5s/chunk saved (~5-10% wall)
  chunks=6   chunkWorkerCount=3 → ~3-5s/chunk saved (~7-12% wall)
  chunks=8   chunkWorkerCount=2 → ~3-5s/chunk saved (~10-13% wall)
  chunks=12  chunkWorkerCount=1 → no change (sequential path reuses probe)
@jrusso1020 jrusso1020 merged commit f01fccb into main May 17, 2026
40 checks passed
@jrusso1020 jrusso1020 deleted the perf-skip-probe-when-parallel branch May 17, 2026 08:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant