Skip to content

feat(lambda): add real-AWS smoke + benchmark script#880

Open
jrusso1020 wants to merge 1 commit into
06-2-feat_lambda_add_sam_templatefrom
06-3-feat_lambda_real_aws_smoke_workflow
Open

feat(lambda): add real-AWS smoke + benchmark script#880
jrusso1020 wants to merge 1 commit into
06-2-feat_lambda_add_sam_templatefrom
06-3-feat_lambda_real_aws_smoke_workflow

Conversation

@jrusso1020
Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 commented May 15, 2026

What

Adds two local bash scripts under examples/aws-lambda/scripts/ that
deploy the PR 6.2 SAM template to your AWS account and exercise the
Lambda render path against real fixtures.

  • smoke.sh — one fixture, one or more chunk counts, PSNR-compare
    against the in-process baseline, tear down. The "did the architecture
    work end-to-end on real AWS" gate.
  • eval.sh — multiple fixtures, in-process vs Lambda wall-clock
    comparison, PSNR-compare both against the in-process baseline. The
    "is Lambda actually faster and visually equivalent" benchmark.

This is PR 6.3 of the 8-PR Phase 6 stack — the last PR of Phase 6a
(architecture validation on real AWS). Stacked on top of #879 (PR 6.2:
SAM template) and #878 (PR 6.1: handler + ZIP).

Why

PR 6.1 proves BeginFrame works on @sparticuz/chromium inside a
Lambda-like Docker container. PR 6.2 produces a deployable SAM
template. But neither proves that:

  • the deployed Lambda actually launches Chrome in real Lambda's /tmp
  • the Step Functions Map state actually fans out chunk renders
  • the assembled mp4 actually matches the in-process baseline at PSNR
  • per-stage wall-clock and cost numbers match what we'd expect

smoke.sh is the architectural gate. eval.sh answers the broader
"should we use Lambda for this composition?" question by running
several fixtures side-by-side with in-process.

Originally drafted as a GitHub Actions workflow (OIDC + repo secrets).
Walked back to local scripts: the validation gate is maintainer-run
once or twice before a release, not on every PR. Local scripts are
simpler — no OIDC, no role trust policy, no cross-account secrets in
CI — and run against whatever AWS profile you already have configured.

Eval results (from running this stack)

Fixture Frames Local in-process Lambda (N=4) Speedup PSNR (Lambda vs baseline)
mp4-h264-sdr 60 (2s) 7.1s 12.0s 0.59× (slower) 47.46 dB
many-cuts 168 (5.6s) 25.6s 17.7s 1.44× 37.08 dB
gsap-letters-render-compat 120 (4s) 20.7s 12.0s 1.72× 52.61 dB
heygen-promo-preview-assets 480 (16s) (host err) 63.1s n/a 33.63 dB

Findings:

  • Architecture works end-to-end on real AWS — every fixture produces a valid mp4.
  • Lambda is faster for compositions ≥ ~4 seconds — 1.44-1.72× at N=4 chunks.
  • Tiny compositions (60 frames) are overhead-bound — Lambda is 1.7× slower for 2-second videos. Cold-start + Chrome decompression + S3 round-trip dominate.
  • Visual equivalence: PSNR 33-52 dB across all fixtures. The Docker-baked baseline uses a different ffmpeg + chrome-headless-shell build than Lambda's sparticuz Chromium + ffmpeg-static, so 3-15 dB drift is expected. Lower numbers (33 dB on heygen-promo, 37 dB on many-cuts) come from richer fixtures with sub-comps and audio.

The default eval.sh --psnr-threshold is 40 dB; the OSS regression
harness's Docker-vs-Docker check uses 50 dB because the environments
match.

Two bugs found and addressed during the eval

  1. PLAN_HASH_MISMATCH on audio-bearing fixturesplan() left
    the .plan-work/downloads/*.mp3 temp directory inside planDir while
    freezePlan() hashed it, then cleaned up afterward. Chunk workers
    recomputed a different hash. Fixed in PR 6.1 by moving the cleanup
    before freezePlan. Regression test added to plan.test.ts.

  2. spawn ffprobe ENOENT in Lambda for fixtures using audio
    pad/trim — ffmpeg-static only ships ffmpeg, not ffprobe.
    Fixed in PR 6.1 by adding ffprobe-static and staging both
    binaries.

Usage

# Smoke — one fixture, full architecture gate
./examples/aws-lambda/scripts/smoke.sh

# Eval — multiple fixtures, in-process vs Lambda
./examples/aws-lambda/scripts/eval.sh \
  --fixtures mp4-h264-sdr,many-cuts,gsap-letters-render-compat \
  --chunk-count 4

AWS_PROFILE=<your-profile> ./examples/aws-lambda/scripts/eval.sh

Both scripts deploy a per-run stack name so concurrent runs don't race;
both unconditionally tear down (unless --keep-stack). Artifacts land
under lambda-smoke-artifacts/ and lambda-eval-artifacts/.

Test plan

  • bash -n — both scripts pass shellcheck-equivalent
  • bunx oxfmt --check — clean
  • Real-AWS smoke run — all 3 chunk counts succeed
  • Real-AWS eval run — 4 fixtures, results above
  • Bugs found in eval (planHash, ffprobe) fixed in PR 6.1 and verified by re-running the eval

🤖 Generated with Claude Code

Copy link
Copy Markdown
Collaborator Author

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@jrusso1020 jrusso1020 force-pushed the 06-3-feat_lambda_real_aws_smoke_workflow branch from 0dea4f8 to 608bae8 Compare May 15, 2026 23:18
@jrusso1020 jrusso1020 changed the title feat(lambda): add real-AWS smoke + benchmark workflow feat(lambda): add real-AWS smoke + benchmark script May 15, 2026
@jrusso1020 jrusso1020 force-pushed the 06-3-feat_lambda_real_aws_smoke_workflow branch from 608bae8 to 025f62c Compare May 15, 2026 23:23
@jrusso1020 jrusso1020 force-pushed the 06-2-feat_lambda_add_sam_template branch from 43f5015 to 8601666 Compare May 16, 2026 00:17
@jrusso1020 jrusso1020 force-pushed the 06-3-feat_lambda_real_aws_smoke_workflow branch 2 times, most recently from 3b1996b to b9f343e Compare May 16, 2026 01:50
@jrusso1020 jrusso1020 force-pushed the 06-2-feat_lambda_add_sam_template branch from 8601666 to cf8d228 Compare May 16, 2026 01:50
@jrusso1020 jrusso1020 force-pushed the 06-3-feat_lambda_real_aws_smoke_workflow branch from b9f343e to c1d696c Compare May 16, 2026 02:09
Phase 6.3 of the distributed rendering plan
(DISTRIBUTED-RENDERING-PLAN.md §11 Phase 6a). Local bash script that
deploys the PR 6.2 SAM template to your own AWS account, renders a
fixture composition through the Step Functions state machine at several
chunk counts, PSNR-compares each output against the in-process
baseline, and tears the stack down.

This is the gate that proves the architecture works on real Lambda
infrastructure. After it runs green and the numbers are good, Phase 6b
(CLI, CDK, docs) can proceed with confidence.

Script lives at examples/aws-lambda/scripts/smoke.sh. Defaults:
  - fixture:        mp4-h264-sdr
  - chunk_counts:   2,4,8
  - psnr-threshold: 50 dB
  - region:         us-east-1
  - stack-name:     hyperframes-lambda-smoke-<timestamp>

AWS credentials come from the standard resolution chain (env vars,
~/.aws/credentials, SSO, IMDS). Pin a specific profile via --profile or
AWS_PROFILE; the script doesn't ship a default.

Workflow:
  1. Pre-flight: verify aws/sam/bun/ffmpeg/jq/zip on PATH; check
     credentials via sts:GetCallerIdentity.
  2. Build the PR 6.1 ZIP and run verify:zip-size.
  3. sam validate --lint + sam deploy under a per-run stack name.
  4. Zip the fixture's src/ and upload to the render bucket.
  5. For each chunk count, start a Step Functions execution, poll for
     completion (25-min cap), download the output mp4 and the execution
     history JSON, ffmpeg-psnr against the LFS-tracked in-process
     baseline, and append to results.json.
  6. Gate on the PSNR threshold.
  7. Empty the bucket + sam delete (unless --keep-stack).

Outputs land under ./lambda-smoke-artifacts/:
  - results.json (chunkCount x wallClockMs x psnrAvgDb)
  - renders/N<N>-output.mp4
  - renders/N<N>-history.json (full Step Functions execution history)

Per-run stack name with concurrency-safe AWS resource isolation. Run
multiple smokes in parallel without races; teardown guards against
stale stacks via cleanup_and_exit on every failure path.

Distinction from CI: this is a maintainer-run gate, not part of regular
CI. The architecture's per-PR safety net is the local Docker-based
BeginFrame probe (PR 6.1) and the upcoming Lambda RIE smoke mode (PR
6.6). No GitHub Actions / OIDC / cross-account secrets required.

This is part of the 8-PR Phase 6 stack; PR 6.3 of 8 — the last PR of
Phase 6a (validation). Phase 6b (CLI + CDK + docs) starts once 6.3's
benchmark numbers come back.
@jrusso1020 jrusso1020 force-pushed the 06-3-feat_lambda_real_aws_smoke_workflow branch from c1d696c to 07f4ae1 Compare May 16, 2026 02:24
@jrusso1020 jrusso1020 force-pushed the 06-2-feat_lambda_add_sam_template branch from cf8d228 to d76e0b3 Compare May 16, 2026 02:24
Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #880 — Smoke test + benchmark scripts

Reviewed: examples/aws-lambda/scripts/smoke.sh, examples/aws-lambda/scripts/eval.sh, README additions.

Critical (90-100)

eval.sh: FAILED variable lost in pipe subshell — gate never fails (confidence: 95)
examples/aws-lambda/scripts/eval.sh, lines ~383-397

The PSNR/audio gate at the bottom of eval.sh pipes into a while loop:

FAILED=0
tail -n +2 "$RESULTS_CSV" | while IFS=, read -r fixture localMs lambdaMs speedup psnr audioStatus audioRms; do
  ...
  FAILED=1
  ...
done
cleanup_and_exit "${FAILED:-0}"

In bash, cmd | while ... done runs the while body in a subshell. Assignments to FAILED inside the loop are lost when the subshell exits. The parent shell's FAILED stays 0 forever, so cleanup_and_exit always exits 0 — the eval gate never fails, even when PSNR is below threshold or audio drifts.

smoke.sh correctly uses process substitution (while ... done < <(jq -c ...)) which avoids the subshell. Apply the same pattern in eval.sh:

FAILED=0
while IFS=, read -r fixture localMs lambdaMs speedup psnr audioStatus audioRms; do
  if awk -v p="$psnr" -v t="$PSNR_THRESHOLD" 'BEGIN{exit !(p<t)}'; then
    echo "FAIL: $fixture lambda-vs-baseline PSNR=$psnr dB below threshold $PSNR_THRESHOLD" >&2
    FAILED=1
  fi
  case "$audioStatus" in
    OK|no-audio-on-either) ;;
    *)
      echo "FAIL: $fixture audio status=$audioStatus (residual RMS=$audioRms dBFS)" >&2
      FAILED=1
      ;;
  esac
done < <(tail -n +2 "$RESULTS_CSV")

cleanup_and_exit "$FAILED"

Important (80-89)

eval.sh: chunkSize missing from Lambda Config JSON (confidence: 82)
examples/aws-lambda/scripts/eval.sh, line ~253

The eval script's INPUT_JSON for Lambda renders specifies maxParallelChunks but omits chunkSize. Without it, the plan step uses whatever default the handler picks. smoke.sh has the same omission but that's less surprising since it tests a single fixture where the default is fine. For a multi-fixture benchmark comparing wall-clocks and speedups, omitting chunkSize means each fixture may get a different chunk granularity, making the speedup numbers not directly comparable. Consider adding --argjson cs "$CHUNK_SIZE" and chunkSize:$cs to the Config, or at least documenting the reliance on the handler's default.


Everything else is clean:

  • No hardcoded credentials — AWS credentials come from the standard resolution chain.
  • Pre-flight checks for required tools are thorough (aws, sam, bun, ffmpeg, jq, zip).
  • smoke.sh validates the STS identity before building anything.
  • Both scripts use unique per-run stack names to prevent races.
  • Cleanup on failure is handled via the cleanup_and_exit helper with bucket emptying + sam delete.
  • PSNR methodology (per-frame averaging via stats_file) is correct and handles edge cases.
  • Audio residual RMS in eval.sh mirrors the Rio convention correctly.
  • set -euo pipefail is present in both scripts.
  • --keep-stack flag for debugging is a nice touch.
  • README additions are clear and complete.

Requesting changes for the subshell bug — eval.sh's quality gate silently passes everything today.

Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of the smoke + benchmark scripts for the lambda adapter. Audited both scripts/eval.sh and scripts/smoke.sh end-to-end; spot-checked examples/aws-lambda/README.md additions and verified the regression harness's argv handling (packages/producer/src/regression-harness.ts:102).

Strengths

  • smoke.sh:732-740 uses while ... done < <(jq ...) (process substitution), which correctly keeps FAILED mutations in the parent shell. Good pattern — see blocker below for where the sibling script diverged from this.
  • smoke.sh:471 + extensive comment block at smoke.sh:461-469 document why the PSNR threshold drops from 50 dB to 40 dB across the heterogeneous ffmpeg/Chromium build pair. That kind of "this number changed because the runtime stack diverged" justification is exactly what future-me wants when an eval drifts.
  • eval.sh:336-346 — the explicit if/then/fi explanation for why the -inf normalization can't be [ A ] || [ B ] && C is good defensive scripting under set -e. Caught a real footgun.
  • eval.sh:178-189 fixture-dir lookup falls back across two paths (distributed/<F> and tests/<F>), making it easier to add fixtures without script edits.

Blockers

  1. eval.sh:382-397 — gating is broken. The script always exits 0 even when fixtures fail PSNR or audio thresholds. The loop runs in a subshell because of the tail | while pipeline, so FAILED=1 mutations are discarded when the subshell exits. The outer ${FAILED:-0} at eval.sh:397 is always 0. The "FAIL: …" lines print to stderr, but cleanup_and_exit receives 0 and the script exits clean. This is the script's headline gate — a maintainer running eval.sh for the "is Lambda actually safe to ship" decision will get a green exit code on a failing run.

    Fix: mirror smoke.sh:732-740's process-substitution form:

    FAILED=0
    while IFS=, read -r fixture localMs lambdaMs speedup psnr audioStatus audioRms; do
      ...
    done < <(tail -n +2 "$RESULTS_CSV")
    cleanup_and_exit "$FAILED"
  2. No trap on EXIT/ERR in either script — stack + bucket leak on unexpected failure paths. Both scripts run set -euo pipefail and rely on calling cleanup_and_exit from explicit failure branches. But any non-routed failure (e.g., aws s3 cp at smoke.sh:622, the jq pipeline at smoke.sh:724-728, ffmpeg failures, SIGINT from an operator) will trip set -e and exit the script with the deployed stack + S3 bucket still running. At ReservedConcurrency=16 and a 16-fixture eval, that's real ongoing cost.

    Fix: trap 'cleanup_and_exit $?' EXIT after cleanup_and_exit is defined. Then drop the explicit cleanup_and_exit N calls in error branches (or change them to plain exit N and let the trap handle the teardown).

Important

  1. eval.sh:323 audio residual is artificially -6 dB lower than the true difference. amix defaults normalize=true, so with 2 inputs each is scaled by 1/2 before mixing — meaning the reported "residual RMS" is the true residual minus 6 dB. The Rio -50 dBFS gate is effectively a -44 dBFS gate as measured. Set amix=inputs=2:duration=shortest:dropout_transition=0:normalize=0 to disable input normalization and get the actual residual. (Phase inversion via volume=-1 is correct — volume interprets a bare number as a multiplier.)

  2. eval.sh:206-217 — local wall-clock includes bun + tsx + harness scaffolding, not just render time. The Lambda timing (eval.sh:255-269) measures pure Step-Functions execution. The "speedup" column is comparing warm cloud render against cold local tooling, which biases against Lambda on tiny fixtures and in favor of Lambda on larger ones. The PR body's "Lambda faster for compositions ≥4s" claim is downstream of this bias. Either (a) document the bias in the script's header + README, or (b) isolate just the producer's render phase via a harness flag.

  3. No iteration / sample size — single run per (fixture, chunkCount). Lambda cold-start variance can be ±5-10s; one sample isn't statistically meaningful for the speedup table the PR body publishes. For a maintainer benchmark, at least add a --iterations N flag and report median. Acknowledged implicitly in the PR body, but the eval CSV reports one point as if it were ground truth.

  4. eval.sh:222 cp "$BASELINE_MP4" "$LOCAL_MP4" is dead code. LOCAL_MP4 is never read after this — psnr_of is called with LAMBDA_MP4 and BASELINE_MP4 directly. The comment block at eval.sh:198-204 explains the harness discards its tempdir, which makes this copy doubly pointless. Either compute PSNR against the copied file (no-op since it equals the baseline) or delete the copy. Delete is the right call.

  5. README inconsistency (README.md:16 vs smoke.sh:470). README says --psnr-threshold 50 is the default; script default is 40. Same threshold appears twice in the README (line 16, line 23) — pick one and align with the script.

  6. Cost surface undocumented. Per-run cost for an eval pass (16 chunks × 4 fixtures × Lambda price × deploy/teardown S3 ops) — operators have no estimate. Even a one-line "expect ~$X per eval pass at default settings" in the README would prevent surprises. Also note: ReservedConcurrency=16 is hardcoded into both scripts (eval.sh:165, smoke.sh:599) — surface this as a flag so cost-conscious runs can throttle.

Nits

  1. eval.sh:104 and smoke.sh:489 help text uses sed -n '2,30p' "$0" — fragile to header reflows. Move help into a usage() heredoc.
  2. smoke.sh:702-720 and eval.sh:285-292 share the same PSNR-from-stats-file awk parser. Extract to a shared lib.sh since these scripts will continue to drift otherwise.
  3. PR body's "Test plan" claims bunx oxfmt --check clean, but the body also says bash -n is "shellcheck-equivalent." It's not — bash -n only does syntax-parse, not lint. Run actual shellcheck; the issues above (subshell-pipeline FAILED, missing trap) would have been flagged by SC2031 and SC2155 respectively.
  4. eval.sh:328-330 — fallback RMS level dB : pattern. Some ffmpeg builds emit RMS level: without the dB unit. Worth widening the fallback or pinning the ffmpeg version expectation in the README's prereqs.

Notes (not blocking)

  • mergeStateStatus: UNSTABLE is from optional Perf / Mintlify checks skipping on a docs+scripts diff. Not a real CI signal here.
  • No existing reviews or inline comments on the PR — this is the first one.
  • Scripts won't run in CI; they're maintainer-only by design. That justifies a lighter test-plan bar than CI-gating code would warrant — but the eval-gate-broken blocker is still a real defect in the script's stated contract ("did the architecture work + is it visually equivalent" gate).

Verdict: REQUEST CHANGES
Reasoning: The eval.sh gate is silently broken (blocker 1) — the script's entire purpose is to fail loudly when Lambda output diverges from baseline, and it can't. Combined with the missing trap-on-exit cleanup (blocker 2), the script is shippable as a "look at the numbers" tool but not as the architecture-validation gate the PR body claims it to be.

— Vai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants