Skip to content

feat(lambda): add Lambda handler, ZIP bundling, and BeginFrame probe#878

Open
jrusso1020 wants to merge 1 commit into
mainfrom
05-15-feat_lambda_add_lambda_handler_zip_bundling_and_beginframe_probe
Open

feat(lambda): add Lambda handler, ZIP bundling, and BeginFrame probe#878
jrusso1020 wants to merge 1 commit into
mainfrom
05-15-feat_lambda_add_lambda_handler_zip_bundling_and_beginframe_probe

Conversation

@jrusso1020
Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 commented May 15, 2026

What

Adds a new packages/aws-lambda/ workspace package — the AWS Lambda
adapter for HyperFrames distributed rendering. One Lambda handler
dispatches on event.Action ∈ {plan, renderChunk, assemble} to the
matching OSS primitive from @hyperframes/producer/distributed, plus
the ZIP build pipeline and a BeginFrame regression-guard probe.

This is PR 6.1 in the 8-PR Phase 6 stack (see
DISTRIBUTED-RENDERING-PLAN.md §11 Phase 6 + §15). Phase 6a (PRs
6.1-6.3) validates the architecture on real AWS; Phase 6b (6.4-6.8)
wraps it in the user-facing CLI + CDK construct + docs.

Why

Phase 5 (low-level CLI) was dropped because the real adopter paths are
either Temporal (HeyGen, internal) or Lambda (OSS users). For the
Lambda path to work, we need a Lambda-deployable artifact that wraps
the merged OSS distributed primitives.

The load-bearing architectural assumption is that
@sparticuz/chromium's chrome-headless-shell build honours CDP
HeadlessExperimental.beginFrame with screenshot: true — the only
rendering path the engine knows about post-Phase-1 cleanup. If it
didn't, the architecture would change (fall back to bundling our own
pinned chrome-headless-shell).

The probe at scripts/probe-beginframe.ts is a permanent regression
guard, not a one-shot verification. It runs the probe inside
public.ecr.aws/lambda/nodejs:22 — the same image real Lambda uses —
and asserts a PNG buffer comes back.

How

Architecture (ZIP deploy, not container image; matches Remotion Lambda's
ergonomics without copying their code):

Step Functions state machine
  ├── Plan Lambda          → s3://{bucket}/{planHash}/plan.tar.gz
  ├── Map(N) RenderChunk   → s3://{bucket}/{planHash}/chunks/NNNN.mp4
  └── Assemble Lambda      → s3://{bucket}/{planHash}/output.mp4

dist/handler.zip contents (one Lambda function, three roles):
  handler.mjs                       — esbuild bundle, dispatches on Action
  bin/ffmpeg + bin/ffprobe          — ffmpeg-static + ffprobe-static
  node_modules/@sparticuz/chromium/ — primary Chrome (decompresses to /tmp)
  node_modules/puppeteer-core/      — required for the engine's launch path
  node_modules/tar/                 — pure-JS tar (Lambda image has no tar binary)
  hyperframe.manifest.json          — runtime IIFE manifest the producer's loader resolves
  hyperframe.runtime.iife.js        — same

Handler glue is intentionally thin: parse event → S3 download → call
OSS primitive (pure function over local paths) → S3 upload → return
small JSON. Result payloads sized for Step Functions history budgets
(<200 bytes per chunk per §2.4).

Chrome source is selectable at build time via --source=sparticuz (default)
or --source=chrome-headless-shell (fallback). Handler reads
HYPERFRAMES_LAMBDA_CHROME_SOURCE env var at boot.

Producer bug fix bundled

While running the eval against the PR 6.3 stack, many-cuts (a fixture
with audio) tripped PLAN_HASH_MISMATCH. Diagnosed locally — repros in
the producer regardless of Lambda. Root cause: plan() cleans up
<planDir>/.plan-work/ AFTER freezePlan() walks the planDir. When
the audio mix downloads source mp3s, they land in
.plan-work/downloads/ and freezePlan hashes them. Chunk workers see
the cleaned planDir (no .plan-work/) so their recomputed hash
differs.

Fix: move the .plan-work/ cleanup to BEFORE freezePlan().

Added regression test in plan.test.ts that asserts
recomputePlanHashFromPlanDir(planDir) === result.planHash for any
plan — catches this whole class of bug going forward.

Sizes (sparticuz source)

  • handler.zip: 124 MB
  • Unzipped: 245 MB (well under the 248 MiB in-house gate, 250 MB Lambda hard cap)
  • ffmpeg 80 MB + ffprobe 62 MB + @sparticuz/chromium 70 MB dominate

ZIP package shape:

  • Lambda Node 22 runtime ships neither tar nor unzip in /usr/bin — we use the tar npm package over node:zlib to handle on-the-wire .tar.gz archives.
  • esbuild ESM's __require shim throws "Dynamic require of path" for any CJS dep that does require('path') at top level (postcss does). Fixed via createRequire banner in the bundle.

Test plan

  • bun run --filter @hyperframes/aws-lambda test — 25 tests pass
  • bun test packages/producer/src/services/distributed/plan.test.ts — 16 tests pass (incl. the new planHash regression test)
  • bunx oxlint + bunx oxfmt --check — clean
  • bun run --cwd packages/aws-lambda build:zip — produces a 124 MB ZIP / 245 MiB unzipped
  • bun run --cwd packages/aws-lambda verify:zip-size — passes
  • Real-AWS eval (via PR 6.3's eval.sh) renders 4 fixtures end-to-end on Lambda; PSNR vs baseline 33-52 dB across all fixtures; speedup 1.44-1.72× at N=4 chunks for compositions ≥ 4s.

🤖 Generated with Claude Code

Copy link
Copy Markdown
Collaborator Author

jrusso1020 commented May 15, 2026

// probe so we don't add a dependency surface that could mask a
// Chrome-side issue.
const htmlPath = join(tmpdir(), `hf-beginframe-probe-${Date.now()}.html`);
await fs.writeFile(htmlPath, PROBE_HTML, "utf-8");
// `du -sb` returns "<bytes>\t<path>". macOS coreutils doesn't ship `-b` —
// build-zip is CI-side anyway, where Linux coreutils is present.
try {
const stdout = execSync(`du -sb ${JSON.stringify(dir)}`, { encoding: "utf-8" });
@jrusso1020 jrusso1020 force-pushed the 05-15-feat_lambda_add_lambda_handler_zip_bundling_and_beginframe_probe branch 2 times, most recently from 5eabdf4 to ef55431 Compare May 16, 2026 01:50
Phase 6 of the distributed rendering plan: AWS Lambda turnkey adoption
(see DISTRIBUTED-RENDERING-PLAN.md §11 Phase 6 + §15).

This PR adds the new packages/aws-lambda/ workspace package that wraps
the OSS plan/renderChunk/assemble primitives in an AWS Lambda handler,
plus a build pipeline that bundles the handler + Chromium runtime +
ffmpeg into a deployable ZIP.

Architecture: ZIP deploy (not Docker image), Chrome via @sparticuz/chromium
with chrome-headless-shell fallback, dispatch on event.Action ∈ {plan,
renderChunk, assemble}.

The load-bearing concern — does @sparticuz/chromium's chrome-headless-shell
build honour CDP HeadlessExperimental.beginFrame? — is pinned by the new
scripts/probe-beginframe.ts regression guard. Probe boots the runtime
inside public.ecr.aws/lambda/nodejs:22, navigates to a static page, and
asserts beginFrame returns a PNG buffer. Verified locally + inside the
Docker container; both pass with hasDamage=true.

Sizes (sparticuz source): unzipped 157 MiB, zipped 99 MiB. Well under
the 240 MiB / 150 MiB in-house gates and the Lambda 250 MiB hard ceiling.

This is part of a stack of 8 PRs (3 in Phase 6a, 5 in Phase 6b); this is
PR 6.1.
@jrusso1020 jrusso1020 force-pushed the 05-15-feat_lambda_add_lambda_handler_zip_bundling_and_beginframe_probe branch from ef55431 to a1d2874 Compare May 16, 2026 02:24
Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: feat(lambda) -- Lambda handler, ZIP bundling, BeginFrame probe

Read the full diff (19 files, +2648/-24). Examined all source files, tests, scripts, Dockerfile, and the plan.ts bugfix. CI notes: build/test/lint/typecheck/format all pass; regression-shard failures are Docker infra flakiness (all 8 shards fail at image setup, not test execution), unrelated to this PR's changes.

Architecture

Clean, well-structured adapter package. The handler dispatch pattern (one Lambda function, three roles via event.Action) is the right call for Step Functions ergonomics. The dependency injection via HandlerDeps keeps the handler fully testable without module-mocking gymnastics. Event unwrapping with depth limits prevents infinite loops on malformed Step Functions payloads. Exhaustiveness checks on Action and Format via never assignments catch future additions at compile time.

plan.ts fix

The .plan-work/ cleanup reorder is correct and the root cause analysis in the commit message is precise: freezePlan walked planDir while .plan-work/downloads/ still contained intermediate audio artifacts, baking transient files into the hash. Chunk workers never see those files, so their recomputed hash diverged. Moving rmSync(workDir) before freezePlan() is the right fix. The regression test (recomputePlanHashFromPlanDir(planDir) === result.planHash) is a solid structural invariant that catches the entire class of "planDir-state-at-freeze-time diverges from planDir-state-at-chunk-time" bugs.

Security and input validation

  • S3 URI parsing (parseS3Uri) validates scheme, requires both bucket and key, and throws on malformed input. Good.
  • Event unwrapping caps depth at 4, preventing unbounded recursion on crafted payloads.
  • The isLambdaAction guard is a strict allowlist; unknown actions are rejected before any I/O.
  • No user-controlled strings are interpolated into shell commands (tar/zip use the tar npm package or spawnSync with argument arrays, not string interpolation).
  • HYPERFRAMES_LAMBDA_CHROME_PATH is validated with existsSync before use.

Test coverage

25 tests for the lambda package covering: Chrome source resolution (env var parsing, case insensitivity, fallback), S3 URI parsing edge cases, tar round-trip with stale-file cleanup, handler dispatch for all three actions + envelope unwrapping + unknown action rejection. The plan.test.ts addition is a targeted regression guard. Coverage is appropriate for thin glue code where the heavy logic lives in the producer primitives.

Minor observations (non-blocking)

  1. _setSparticuzChromiumForTests exported from the public index -- packages/aws-lambda/src/index.ts re-exports the test-only _setSparticuzChromiumForTests seam. The underscore prefix signals "internal" but it still appears in the package's public API surface. Consider excluding it from the barrel export or gating behind a @internal JSDoc + conditional export. Not blocking since this is a 0.0.1 package.

  2. build-zip.ts uses require() inside stageChromeHeadlessShell and walkSize -- The file is ESM (uses import.meta.url) but calls require("node:fs") in two function bodies. These work because the script itself injects createRequire at runtime, but it is inconsistent with the top-of-file import { ... } from "node:fs" pattern. The same readdirSync/statSync are already imported at the top. Could just use the existing imports.

  3. Regression shards -- All 8 regression-shard CI jobs fail at Docker image build, not at test execution. This appears to be pre-existing CI infra flakiness, not caused by this PR.

Approving. The plan.ts fix is correct and well-tested, the Lambda handler is clean thin glue with proper input validation, and the build pipeline is solid with appropriate size gates.

Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phase 6a Lambda adapter — solid bones, but ships with a regression-CI blocker, Windows-CI failure, and several real-AWS-shaped gaps. The handler shape is right (thin glue, DI seam for tests, pure-primitive boundary), chromium.ts source-selection is clean, and the plan.ts cleanup-order fix is correct. The integration is far enough along to be worth iterating, but it's not mergeable in the current state.

Calibrated strengths

  • packages/aws-lambda/src/handler.ts:80-101handler is properly a thin dispatcher; the unwrapEvent loop with MAX_ENVELOPE_DEPTH=4 and the _exhaustive: never switch arm are the right shape for a Step Functions entry point.
  • packages/aws-lambda/src/handler.ts:341-369downloadChunkObjects uses a pre-sized array indexed by i rather than push order; assemble's chunk-ordering invariant is preserved under Promise.all. Easy to get wrong, got it right.
  • packages/producer/src/services/distributed/plan.ts:790-808 — the rmSync(workDir) move + comment is correct (workDir = join(planDir, ".plan-work"), freezePlan walks planDir via listPlanFiles recursive walk → including .plan-work/ before cleanup → poisoned planHash). Reasoning in the comment matches the actual hash code in freezePlan.ts:158-181.

Findings

blocker — Dockerfile.test does not COPY packages/aws-lambda/package.json, breaking every regression shard. All 8 regression-shards are failing with bun install --frozen-lockfile: error: lockfile had changes, but lockfile is frozen at Dockerfile.test:76. Lines 64-71 of Dockerfile.test enumerate workspace package.json files individually; the new packages/aws-lambda/ workspace member is in bun.lock but its package.json never makes it into the build context, so bun sees the workspace shape diverging from the lockfile and refuses. Fix: add COPY packages/aws-lambda/package.json packages/aws-lambda/package.json to Dockerfile.test next to the other workspace COPY lines. This is the single root cause of all 8 shard failures — they all fail at the same Dockerfile step, with identical stderr. Verified by reading shard-3's log directly.

blocker — Windows tests fail on module-load of handler.test.ts (ENOENT reading "...packages\\producer\\node_modules\\hono"). chromium.test.ts (9 pass) + s3Transport.test.ts (8 pass) clear, but handler.test.ts (importing @hyperframes/producer/distributed) crashes at module load before any test runs (17 pass / 1 fail / 1 error, exit 1). The Windows runner is required (Tests on windows-latest is in the required-check set per the rollup). Either this is a real Windows incompatibility of the producer's transitive hono dep through the new aws-lambda import edge, or it's a bun-on-Windows workspace-link bug surfacing because aws-lambda is the first package to import @hyperframes/producer/distributed from a sibling workspace. Either way it has to be diagnosed and either fixed or the package opted out of the Windows test matrix with a // not supported on Windows rationale. Right now CI is red on a required check.

blocker — CodeQL javascript:js/shell-command-injected-from-environment is failing on build-zip.ts:411. directorySizeBytes() shells out via execSync(\du -sb ${JSON.stringify(dir)}`). The argument is JSON-quoted (good), but execSyncparses the whole string through/bin/sh, so dircontaining$(…)or backticks is still a shell-eval surface even with the JSON quoting around it (the JSON-quote escapes"but doesn't neutralize$()inside the quoted form —bash -c 'du -sb "$(date)"'runsdate). The trust boundary today is "internal CI dev script over dist/staging", so the practical risk is near zero — but CodeQL is a required check and the rule's own Recommendation is "use a parameterised execution form." Trivial fix: replace execSync(`du -sb ${JSON.stringify(dir)}`)withspawnSync("du", ["-sb", dir], { encoding: "utf-8" })`. Same Linux-only constraint, no shell, CodeQL clears. Apply Rule 10 — don't iterate against the predicate, follow the Recommendation.

important — the new plan.test.ts regression test does not exercise the audio path the PR body says caused the bug. The PR description states many-cuts (audio-bearing) tripped PLAN_HASH_MISMATCH because audio-mix downloaded source mp3s into .plan-work/downloads/, and the hash baked in the bytes. But the new test at plan.test.ts:212-241 uses the test file's existing FIXTURE_HTML (no <audio> elements; composition.audios.length === 0), so runAudioStage short-circuits at audioStage.ts:50 (if (audios.length > 0)) and writes nothing to workDir. After the compile-stage renameSync(compiledDir, finalCompiledDir) at plan.ts:716, .plan-work/ is essentially empty on the no-audio path. On the pre-fix code, freezePlan would walk an empty .plan-work/, see nothing in there to hash, and likely produce the same hash as the post-cleanup state. So this test would NOT have caught the original bug; it pins the contract on a path the bug didn't take. The fix is right; the test as written doesn't lock the regression. Either (a) add an audio-bearing fixture (mp3 in projectDir) so .plan-work/audio-work/ materialises, or (b) seed .plan-work/ with a sentinel file inside the test before calling plan() to simulate the bug surface — option (a) is closer to the wild-caught failure.

important — handler has zero structured logging. No console.log / console.error anywhere in handler.ts — not on event receipt, not per-action, not on error before the throw rethrows. Result payloads carry DurationMs but CloudWatch will see only stack traces on failures (from Lambda's default unhandled-error path). For ops triage on an event-driven Lambda where you can't attach a debugger, you need at minimum: a structured log line at boot with event.Action + the relevant S3 URIs, and an error log before the throw in unwrapEvent and at the catch sites in each handle* function. CloudWatch Logs Insights queries are the only triage tool on Lambda; without grep-able structured lines, ops is blind.

important — RenderChunkEvent.PlanHash is plumbed in the event type (events.ts:55) but never verified in the handler. handleRenderChunk ignores event.PlanHash entirely. The producer's renderChunk does the recompute-and-check at renderChunk.ts:404-413, so this is defense-in-depth, not a correctness hole — but the field exists in the event schema for a reason. If the intent is "trust the producer to enforce it," delete the field from RenderChunkEvent so it doesn't become a confusing contract. If the intent is "verify at the handler boundary before paying the untar cost," add the check before untarDirectory. Right now it's the worst of both: schema bloat with no enforcement.

important — build-time npm install inside staging bypasses bun's resolution + lockfile. build-zip.ts:248-275 writes a one-off package.json into dist/staging listing only puppeteer-core + (optionally) @sparticuz/chromium and runs npm install --no-package-lock inside that staging dir. Since puppeteer-core: ^24.39.1 is a caret range and --no-package-lock is set, the bundle ships whatever npm registry serves at build time, not what bun installed into the workspace. @sparticuz/chromium: 148.0.0 is exact-pinned (good), so the Chrome side is deterministic — but puppeteer-core's caret range means CI builds can ship 24.40+ while tests ran against 24.39, and Lambda binary determinism breaks across consecutive builds. Either (a) read the resolved versions from bun.lock's aws-lambda block and pin them exactly into the staging package.json, or (b) acknowledge the float in the README and add a CI gate that diffs staging/node_modules/<dep>/package.json against the workspace install. Today's behaviour is silently non-reproducible.

important — handler test uses spawnSync("tar", …) but s3Transport uses the npm tar package. handler.test.ts:332 + :359 shell out via spawnSync("tar", ["-czf", …]) to build the fixture tarballs. s3Transport.ts:113-116 uses the npm tar package (correctly — Lambda's Amazon Linux base has no tar binary in /usr/bin). The handler's own untarDirectory is what the test exercises, and that goes through npm tar, so the path-under-test is fine. But the test's fixture-building dep on the system tar binary is a flake surface — Windows ships bsdtar as tar (different invocation), bare Alpine containers don't ship tar at all. Worse, the s3Transport.ts:14 JSDoc says "We shell out to the system tar binary" — that comment is stale (the code uses the npm package). Tighten: use the npm tar package consistently in tests for fixture building, and fix the JSDoc.

nit — verify-zip-size.ts and several PR-body lines conflate MB and MiB. PR body says "245 MiB unzipped (well under the 248 MiB in-house gate, 250 MB Lambda hard cap)" — AWS's actual cap is 250 MiB (per aws lambda get-function-configuration docs), so the in-house 248 MiB is a 2 MiB margin, not a ~5 MB margin as the mixed units imply. Pick one unit (MiB) and stick with it in code, comments, and the README size table.

nit — stageChromeHeadlessShell picks "latest" Chrome cache via string sort. build-zip.ts:392-403 does readdirSync(baseDir).sort().reverse() on Chrome version directories. For 131.0.0, 131.0.1, 99.0.0, lexicographic descending picks 99.0.0 first. Today it works because Chrome versions in the cache are all 3-digit major. Will break the day Chrome ships 141.x and 99.x coexist. Use semver sort or filter for the newest by mtime.

nit — probe-beginframe.ts:65 insecure-tempfile (CodeQL). join(tmpdir(), \hf-beginframe-probe-${Date.now()}.html`)is predictable. CI-side probe, low risk, but trivial swap tomkdtempSync(join(tmpdir(), "hf-beginframe-"))` clears the alert.

nit — comment on s3Transport.ts:14-16 is wrong. Claims "We shell out to the system tar binary" — actual implementation uses the npm tar package. Misleading for the next reader.

Pre-existing reviews

GitHub Advanced Security already filed the two CodeQL inline comments (probe-beginframe.ts:65 insecure tempfile, build-zip.ts:411 shell-from-env). I've covered both above with concrete fixes; the build-zip.ts one is blocker (failing required check), the probe one is nit.

Verdict: REQUEST CHANGES

Reasoning: Three blockers — required-check regression-shards red (Dockerfile.test missing the new package COPY → all 8 shards fail at bun install --frozen-lockfile), required-check Windows tests red (handler.test.ts ENOENT on hono at module load), and required-check CodeQL red (shell-injection via execSync in build-zip.ts). The Lambda handler design and the producer fix are right; ship-ability is gated on CI + the regression-test-doesn't-cover-the-bug gap.

Review by Vai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants