feat(lambda): add real-AWS smoke + benchmark script#880
Conversation
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
0dea4f8 to
608bae8
Compare
608bae8 to
025f62c
Compare
43f5015 to
8601666
Compare
3b1996b to
b9f343e
Compare
8601666 to
cf8d228
Compare
b9f343e to
c1d696c
Compare
Phase 6.3 of the distributed rendering plan
(DISTRIBUTED-RENDERING-PLAN.md §11 Phase 6a). Local bash script that
deploys the PR 6.2 SAM template to your own AWS account, renders a
fixture composition through the Step Functions state machine at several
chunk counts, PSNR-compares each output against the in-process
baseline, and tears the stack down.
This is the gate that proves the architecture works on real Lambda
infrastructure. After it runs green and the numbers are good, Phase 6b
(CLI, CDK, docs) can proceed with confidence.
Script lives at examples/aws-lambda/scripts/smoke.sh. Defaults:
- fixture: mp4-h264-sdr
- chunk_counts: 2,4,8
- psnr-threshold: 50 dB
- region: us-east-1
- stack-name: hyperframes-lambda-smoke-<timestamp>
AWS credentials come from the standard resolution chain (env vars,
~/.aws/credentials, SSO, IMDS). Pin a specific profile via --profile or
AWS_PROFILE; the script doesn't ship a default.
Workflow:
1. Pre-flight: verify aws/sam/bun/ffmpeg/jq/zip on PATH; check
credentials via sts:GetCallerIdentity.
2. Build the PR 6.1 ZIP and run verify:zip-size.
3. sam validate --lint + sam deploy under a per-run stack name.
4. Zip the fixture's src/ and upload to the render bucket.
5. For each chunk count, start a Step Functions execution, poll for
completion (25-min cap), download the output mp4 and the execution
history JSON, ffmpeg-psnr against the LFS-tracked in-process
baseline, and append to results.json.
6. Gate on the PSNR threshold.
7. Empty the bucket + sam delete (unless --keep-stack).
Outputs land under ./lambda-smoke-artifacts/:
- results.json (chunkCount x wallClockMs x psnrAvgDb)
- renders/N<N>-output.mp4
- renders/N<N>-history.json (full Step Functions execution history)
Per-run stack name with concurrency-safe AWS resource isolation. Run
multiple smokes in parallel without races; teardown guards against
stale stacks via cleanup_and_exit on every failure path.
Distinction from CI: this is a maintainer-run gate, not part of regular
CI. The architecture's per-PR safety net is the local Docker-based
BeginFrame probe (PR 6.1) and the upcoming Lambda RIE smoke mode (PR
6.6). No GitHub Actions / OIDC / cross-account secrets required.
This is part of the 8-PR Phase 6 stack; PR 6.3 of 8 — the last PR of
Phase 6a (validation). Phase 6b (CLI + CDK + docs) starts once 6.3's
benchmark numbers come back.
c1d696c to
07f4ae1
Compare
cf8d228 to
d76e0b3
Compare
miguel-heygen
left a comment
There was a problem hiding this comment.
Review: PR #880 — Smoke test + benchmark scripts
Reviewed: examples/aws-lambda/scripts/smoke.sh, examples/aws-lambda/scripts/eval.sh, README additions.
Critical (90-100)
eval.sh: FAILED variable lost in pipe subshell — gate never fails (confidence: 95)
examples/aws-lambda/scripts/eval.sh, lines ~383-397
The PSNR/audio gate at the bottom of eval.sh pipes into a while loop:
FAILED=0
tail -n +2 "$RESULTS_CSV" | while IFS=, read -r fixture localMs lambdaMs speedup psnr audioStatus audioRms; do
...
FAILED=1
...
done
cleanup_and_exit "${FAILED:-0}"In bash, cmd | while ... done runs the while body in a subshell. Assignments to FAILED inside the loop are lost when the subshell exits. The parent shell's FAILED stays 0 forever, so cleanup_and_exit always exits 0 — the eval gate never fails, even when PSNR is below threshold or audio drifts.
smoke.sh correctly uses process substitution (while ... done < <(jq -c ...)) which avoids the subshell. Apply the same pattern in eval.sh:
FAILED=0
while IFS=, read -r fixture localMs lambdaMs speedup psnr audioStatus audioRms; do
if awk -v p="$psnr" -v t="$PSNR_THRESHOLD" 'BEGIN{exit !(p<t)}'; then
echo "FAIL: $fixture lambda-vs-baseline PSNR=$psnr dB below threshold $PSNR_THRESHOLD" >&2
FAILED=1
fi
case "$audioStatus" in
OK|no-audio-on-either) ;;
*)
echo "FAIL: $fixture audio status=$audioStatus (residual RMS=$audioRms dBFS)" >&2
FAILED=1
;;
esac
done < <(tail -n +2 "$RESULTS_CSV")
cleanup_and_exit "$FAILED"Important (80-89)
eval.sh: chunkSize missing from Lambda Config JSON (confidence: 82)
examples/aws-lambda/scripts/eval.sh, line ~253
The eval script's INPUT_JSON for Lambda renders specifies maxParallelChunks but omits chunkSize. Without it, the plan step uses whatever default the handler picks. smoke.sh has the same omission but that's less surprising since it tests a single fixture where the default is fine. For a multi-fixture benchmark comparing wall-clocks and speedups, omitting chunkSize means each fixture may get a different chunk granularity, making the speedup numbers not directly comparable. Consider adding --argjson cs "$CHUNK_SIZE" and chunkSize:$cs to the Config, or at least documenting the reliance on the handler's default.
Everything else is clean:
- No hardcoded credentials — AWS credentials come from the standard resolution chain.
- Pre-flight checks for required tools are thorough (
aws,sam,bun,ffmpeg,jq,zip). smoke.shvalidates the STS identity before building anything.- Both scripts use unique per-run stack names to prevent races.
- Cleanup on failure is handled via the
cleanup_and_exithelper with bucket emptying +sam delete. - PSNR methodology (per-frame averaging via
stats_file) is correct and handles edge cases. - Audio residual RMS in
eval.shmirrors the Rio convention correctly. set -euo pipefailis present in both scripts.--keep-stackflag for debugging is a nice touch.- README additions are clear and complete.
Requesting changes for the subshell bug — eval.sh's quality gate silently passes everything today.
vanceingalls
left a comment
There was a problem hiding this comment.
Review of the smoke + benchmark scripts for the lambda adapter. Audited both scripts/eval.sh and scripts/smoke.sh end-to-end; spot-checked examples/aws-lambda/README.md additions and verified the regression harness's argv handling (packages/producer/src/regression-harness.ts:102).
Strengths
smoke.sh:732-740useswhile ... done < <(jq ...)(process substitution), which correctly keepsFAILEDmutations in the parent shell. Good pattern — see blocker below for where the sibling script diverged from this.smoke.sh:471+ extensive comment block atsmoke.sh:461-469document why the PSNR threshold drops from 50 dB to 40 dB across the heterogeneous ffmpeg/Chromium build pair. That kind of "this number changed because the runtime stack diverged" justification is exactly what future-me wants when an eval drifts.eval.sh:336-346— the explicit if/then/fi explanation for why the-infnormalization can't be[ A ] || [ B ] && Cis good defensive scripting underset -e. Caught a real footgun.eval.sh:178-189fixture-dir lookup falls back across two paths (distributed/<F>andtests/<F>), making it easier to add fixtures without script edits.
Blockers
-
eval.sh:382-397— gating is broken. The script always exits 0 even when fixtures fail PSNR or audio thresholds. The loop runs in a subshell because of thetail | whilepipeline, soFAILED=1mutations are discarded when the subshell exits. The outer${FAILED:-0}ateval.sh:397is always0. The "FAIL: …" lines print to stderr, butcleanup_and_exitreceives0and the script exits clean. This is the script's headline gate — a maintainer runningeval.shfor the "is Lambda actually safe to ship" decision will get a green exit code on a failing run.Fix: mirror
smoke.sh:732-740's process-substitution form:FAILED=0 while IFS=, read -r fixture localMs lambdaMs speedup psnr audioStatus audioRms; do ... done < <(tail -n +2 "$RESULTS_CSV") cleanup_and_exit "$FAILED"
-
No
trapon EXIT/ERR in either script — stack + bucket leak on unexpected failure paths. Both scripts runset -euo pipefailand rely on callingcleanup_and_exitfrom explicit failure branches. But any non-routed failure (e.g.,aws s3 cpatsmoke.sh:622, thejqpipeline atsmoke.sh:724-728,ffmpegfailures, SIGINT from an operator) will tripset -eand exit the script with the deployed stack + S3 bucket still running. AtReservedConcurrency=16and a 16-fixture eval, that's real ongoing cost.Fix:
trap 'cleanup_and_exit $?' EXITaftercleanup_and_exitis defined. Then drop the explicitcleanup_and_exit Ncalls in error branches (or change them to plainexit Nand let the trap handle the teardown).
Important
-
eval.sh:323audio residual is artificially -6 dB lower than the true difference.amixdefaultsnormalize=true, so with 2 inputs each is scaled by 1/2 before mixing — meaning the reported "residual RMS" is the true residual minus 6 dB. The Rio -50 dBFS gate is effectively a -44 dBFS gate as measured. Setamix=inputs=2:duration=shortest:dropout_transition=0:normalize=0to disable input normalization and get the actual residual. (Phase inversion viavolume=-1is correct —volumeinterprets a bare number as a multiplier.) -
eval.sh:206-217— local wall-clock includesbun+tsx+ harness scaffolding, not just render time. The Lambda timing (eval.sh:255-269) measures pure Step-Functions execution. The "speedup" column is comparing warm cloud render against cold local tooling, which biases against Lambda on tiny fixtures and in favor of Lambda on larger ones. The PR body's "Lambda faster for compositions ≥4s" claim is downstream of this bias. Either (a) document the bias in the script's header + README, or (b) isolate just the producer's render phase via a harness flag. -
No iteration / sample size — single run per (fixture, chunkCount). Lambda cold-start variance can be ±5-10s; one sample isn't statistically meaningful for the speedup table the PR body publishes. For a maintainer benchmark, at least add a
--iterations Nflag and report median. Acknowledged implicitly in the PR body, but the eval CSV reports one point as if it were ground truth. -
eval.sh:222cp "$BASELINE_MP4" "$LOCAL_MP4"is dead code.LOCAL_MP4is never read after this —psnr_ofis called withLAMBDA_MP4andBASELINE_MP4directly. The comment block ateval.sh:198-204explains the harness discards its tempdir, which makes this copy doubly pointless. Either compute PSNR against the copied file (no-op since it equals the baseline) or delete the copy. Delete is the right call. -
README inconsistency (
README.md:16vssmoke.sh:470). README says--psnr-threshold 50is the default; script default is 40. Same threshold appears twice in the README (line 16, line 23) — pick one and align with the script. -
Cost surface undocumented. Per-run cost for an eval pass (16 chunks × 4 fixtures × Lambda price × deploy/teardown S3 ops) — operators have no estimate. Even a one-line "expect ~$X per eval pass at default settings" in the README would prevent surprises. Also note:
ReservedConcurrency=16is hardcoded into both scripts (eval.sh:165,smoke.sh:599) — surface this as a flag so cost-conscious runs can throttle.
Nits
eval.sh:104andsmoke.sh:489help text usessed -n '2,30p' "$0"— fragile to header reflows. Move help into ausage()heredoc.smoke.sh:702-720andeval.sh:285-292share the same PSNR-from-stats-file awk parser. Extract to a sharedlib.shsince these scripts will continue to drift otherwise.- PR body's "Test plan" claims
bunx oxfmt --checkclean, but the body also saysbash -nis "shellcheck-equivalent." It's not —bash -nonly does syntax-parse, not lint. Run actualshellcheck; the issues above (subshell-pipelineFAILED, missing trap) would have been flagged by SC2031 and SC2155 respectively. eval.sh:328-330— fallbackRMS level dB :pattern. Some ffmpeg builds emitRMS level:without thedBunit. Worth widening the fallback or pinning the ffmpeg version expectation in the README's prereqs.
Notes (not blocking)
mergeStateStatus: UNSTABLEis from optionalPerf/Mintlifychecks skipping on a docs+scripts diff. Not a real CI signal here.- No existing reviews or inline comments on the PR — this is the first one.
- Scripts won't run in CI; they're maintainer-only by design. That justifies a lighter test-plan bar than CI-gating code would warrant — but the eval-gate-broken blocker is still a real defect in the script's stated contract ("did the architecture work + is it visually equivalent" gate).
Verdict: REQUEST CHANGES
Reasoning: The eval.sh gate is silently broken (blocker 1) — the script's entire purpose is to fail loudly when Lambda output diverges from baseline, and it can't. Combined with the missing trap-on-exit cleanup (blocker 2), the script is shippable as a "look at the numbers" tool but not as the architecture-validation gate the PR body claims it to be.
— Vai

What
Adds two local bash scripts under
examples/aws-lambda/scripts/thatdeploy the PR 6.2 SAM template to your AWS account and exercise the
Lambda render path against real fixtures.
smoke.sh— one fixture, one or more chunk counts, PSNR-compareagainst the in-process baseline, tear down. The "did the architecture
work end-to-end on real AWS" gate.
eval.sh— multiple fixtures, in-process vs Lambda wall-clockcomparison, PSNR-compare both against the in-process baseline. The
"is Lambda actually faster and visually equivalent" benchmark.
This is PR 6.3 of the 8-PR Phase 6 stack — the last PR of Phase 6a
(architecture validation on real AWS). Stacked on top of #879 (PR 6.2:
SAM template) and #878 (PR 6.1: handler + ZIP).
Why
PR 6.1 proves BeginFrame works on
@sparticuz/chromiuminside aLambda-like Docker container. PR 6.2 produces a deployable SAM
template. But neither proves that:
/tmpsmoke.shis the architectural gate.eval.shanswers the broader"should we use Lambda for this composition?" question by running
several fixtures side-by-side with in-process.
Originally drafted as a GitHub Actions workflow (OIDC + repo secrets).
Walked back to local scripts: the validation gate is maintainer-run
once or twice before a release, not on every PR. Local scripts are
simpler — no OIDC, no role trust policy, no cross-account secrets in
CI — and run against whatever AWS profile you already have configured.
Eval results (from running this stack)
Findings:
The default
eval.sh --psnr-thresholdis 40 dB; the OSS regressionharness's Docker-vs-Docker check uses 50 dB because the environments
match.
Two bugs found and addressed during the eval
PLAN_HASH_MISMATCHon audio-bearing fixtures —plan()leftthe
.plan-work/downloads/*.mp3temp directory inside planDir whilefreezePlan()hashed it, then cleaned up afterward. Chunk workersrecomputed a different hash. Fixed in PR 6.1 by moving the cleanup
before freezePlan. Regression test added to
plan.test.ts.spawn ffprobe ENOENTin Lambda for fixtures using audiopad/trim —
ffmpeg-staticonly ships ffmpeg, not ffprobe.Fixed in PR 6.1 by adding
ffprobe-staticand staging bothbinaries.
Usage
Both scripts deploy a per-run stack name so concurrent runs don't race;
both unconditionally tear down (unless
--keep-stack). Artifacts landunder
lambda-smoke-artifacts/andlambda-eval-artifacts/.Test plan
bash -n— both scripts pass shellcheck-equivalentbunx oxfmt --check— clean🤖 Generated with Claude Code