Skip to content

feat(producer): add Rio-style residual-RMS check to regression harness#882

Open
jrusso1020 wants to merge 1 commit into
mainfrom
feat/producer-audio-residual-rms-regression
Open

feat(producer): add Rio-style residual-RMS check to regression harness#882
jrusso1020 wants to merge 1 commit into
mainfrom
feat/producer-audio-residual-rms-regression

Conversation

@jrusso1020
Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 commented May 16, 2026

What

Adds Rio-style residual-RMS audio equivalence check to the producer
regression harness. The existing envelope-correlation gate stays
in place; the new check is a complementary, sample-level test of
"does the rendered audio actually match the snapshot, bit-by-bit
after resampling".

Why

Surfaced while validating real-AWS Lambda renders (Phase 6 stack). The
existing audio comparison uses Pearson correlation over RMS envelopes
(2048-sample windows). That measures shape similarity but is
insensitive to level shifts, phase drift, and codec-quantization noise.
Two streams with correlation > 0.95 can still differ audibly.

Rio's audio check (rio/tests/checksum.py::compare_audio_files_ffmpeg)
is sample-level instead: subtract one stream from the other via ffmpeg
amix, run astats, read the residual Overall RMS in dBFS. Identical
streams cancel to silence:

ffmpeg ... -filter_complex \
  "[0:a]aresample=48000,pan=stereo|c0=c0|c1=c1[a0];
   [1:a]aresample=48000,pan=stereo|c0=c0|c1=c1,volume=-1[a1];
   [a0][a1]amix=inputs=2:duration=shortest,astats[out]" \
  -map "[out]" -f null -

Verified on the Lambda render of many-cuts (a 5.6s composition with 4
audio elements): residual RMS = -90.36 dBFS vs the in-process
baseline. Well below Rio's -50 dBFS pass threshold.

How

Three changes:

  1. packages/producer/src/utils/audioRegression.ts — new
    computeAudioResidualRmsDb(rendered, snapshot, maxResidualRmsDb=-50)
    that spawns ffmpeg with Rio's filter graph, parses the Overall RMS level (or per-channel max as fallback), and returns
    { overallDb, ok, rmsLines }. Handles -inf (perfect cancellation),
    per-channel vs overall lines, and missing-audio-stream inputs.

  2. packages/producer/src/utils/audioRegression.test.ts — three
    new vitest cases covering identical sines (-inf), drifted sines
    (440 Hz vs 880 Hz), and missing-audio-stream input. Skips
    automatically when ffmpeg isn't on PATH (matches the existing
    harness invariant).

  3. packages/producer/src/regression-harness.ts — optional
    maxAudioResidualRmsDb field in meta.json validated alongside
    the existing audio knobs. When present, runs the residual check
    after envelope correlation and folds the result into audioPassed.
    When absent (default), skips the residual gate — legacy fixtures
    aren't retroactively broken. New residualRmsDb field on the
    audio_comparison_complete JSON event and on the pretty log line.

Test plan

  • bunx vitest run packages/producer/src/utils/audioRegression.test.ts — 4 tests pass (the existing one + 3 new)
  • bun run --filter @hyperframes/producer typecheck — clean
  • bunx oxlint + bunx oxfmt --check — clean
  • Verified end-to-end against Lambda renders: many-cuts lambda output vs Docker baseline = -90.36 dBFS residual (passes Rio's -50 dBFS gate)

Adoption path

Existing fixtures: no change. The check is opt-in.

New fixtures: add "maxAudioResidualRmsDb": -50 to meta.json. Existing
audio-bearing fixtures can adopt one at a time as their residuals are
verified.

🤖 Generated with Claude Code

The existing audio comparison in the regression harness measures the
Pearson correlation between RMS envelopes of the rendered and snapshot
streams. That catches shape-level drift but is insensitive to level
shifts, phase offsets, or codec-quantization noise — two streams can
correlate >0.9 while differing audibly.

Rio's approach (rio/tests/checksum.py:compare_audio_files_ffmpeg) is
sample-level: subtract the snapshot from the rendered stream, run
`astats`, read the residual Overall RMS in dBFS. Identical streams
cancel to silence (-inf, or sub -90 dBFS for AAC-vs-AAC); anything
>= -50 dBFS is considered drift.

This commit adds the same check as an optional secondary gate:

  - utils/audioRegression.ts: new `computeAudioResidualRmsDb()` that
    spawns ffmpeg with the same filter graph Rio uses (aresample +
    pan + volume=-1 + amix + astats) and returns the parsed Overall
    RMS plus a pass/fail flag.
  - utils/audioRegression.test.ts: 3 new tests covering identical
    streams (-inf result), drifted streams (440Hz vs 880Hz sine),
    and missing-audio-stream input.
  - regression-harness.ts: optional `maxAudioResidualRmsDb` field in
    meta.json. Default is undefined (skip the check) so legacy
    fixtures aren't retroactively gated; new fixtures opt in by
    setting a threshold (e.g. -50). Harness emits `residualRmsDb` in
    the audio_comparison_complete JSON event and the pretty log line.

The existing correlation check stays in place; the new residual check
is independent. They measure complementary properties (shape vs
sample-cancellation) and both should hold for a faithful render.
Copy link
Copy Markdown
Collaborator Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@miguel-heygen
Copy link
Copy Markdown
Collaborator

remove rio mentions btw @jrusso1020

Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #882 — Rio-style residual-RMS audio check

Reviewed: packages/producer/src/utils/audioRegression.ts, packages/producer/src/utils/audioRegression.test.ts, packages/producer/src/regression-harness.ts.

No critical or important issues found. Clean implementation.

What I verified

Math correctness of computeAudioResidualRmsDb: The ffmpeg filter graph is correct — resample both streams to 48kHz stereo, phase-invert the snapshot via volume=-1, sum via amix (shortest duration, no dropout), and read astats. This is exactly Rio's approach. The asetpts=N/SR/TB normalization prevents PTS mismatch from causing misalignment. Using amix with volume=-1 instead of the normalize option is the right call for ffmpeg 4.x compatibility.

RMS parsing: pickRms correctly handles both ffmpeg output formats (Overall RMS level dB: -90.32 and Overall RMS level: -inf dB). The regex captures -inf, inf, and numeric values. Mapping both -inf and inf to NEGATIVE_INFINITY is correct for this domain (both represent silence/perfect cancellation). The max mode fallback to per-channel values when the Overall line is absent is a good portability measure.

Boundary conditions:

  • NEGATIVE_INFINITY <= -50 evaluates to true, so perfect cancellation passes. Correct.
  • NaN comparisons return false, so parse failures produce ok: false. Correct.
  • Missing audio streams return { overallDb: NaN, ok: false }. Correct.
  • The maxResidualRmsDb default of -50 matches the Rio convention.

Opt-in mechanism: The maxAudioResidualRmsDb field in meta.json is optional and validated only when present (!== undefined). Existing fixtures are unaffected. The validation correctly requires a finite number when the field exists. The envelope-correlation gate remains in place regardless. Good backward compatibility.

Test coverage: Three ffmpeg-spawning tests cover the key cases (identical streams, drifted streams, missing audio). The skipIf(!HAS_FFMPEG) guard is correct — matches the harness invariant that ffmpeg is only guaranteed in Dockerfile.test and dev boxes.

Harness integration: The residual result is folded into audioPassed (AND'd with the existing correlation gate), logged in both pretty output and structured JSON events. The -inf display formatting (Number.isFinite check) is correct.

Approving — well-structured opt-in addition with solid test coverage.

Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid, well-scoped addition. The opt-in maxAudioResidualRmsDb gate is the right shape — legacy fixtures untouched, new fixtures adopt per-fixture as they verify residual values. End-to-end validation on a real Lambda render (many-cuts at -90.36 dBFS) is the kind of evidence I want to see on a regression-harness change. All required CI green.

Strengths

  • audioRegression.ts (filter chain composition): resample → stereo pan → asetpts=N/SR/TBvolume=-1amix duration=shortest matches Rio's intent. Side-stepping amix normalize avoids the ffmpeg-4.x compat trap explicitly called out in the comment.
  • audioRegression.ts (-inf handling): normalization to Number.NEGATIVE_INFINITY is exactly the API I'd want at the callsite — no string-comparison footguns, value <= maxResidualRmsDb Just Works for -Inf.
  • regression-harness.ts:1088-1107: opt-in gating is correctly placed inside the renderedAudio.length > 0 && snapshotAudio.length > 0 branch, so a fixture with maxAudioResidualRmsDb: -50 but a silent/no-audio rendered output skips residual instead of failing with a spurious NaN-driven ok=false. Subtle but right.

Findings

importantaudioRegression.ts:184-186: the "Overall RMS level" regex never matches on modern ffmpeg. Verified locally against ffmpeg 8.0.1: the summary emits Overall on its own line followed by RMS level dB: <value> on the next line — no single-line match for Overall RMS level\s*dB:. So overall is always null and the function falls through to the channel-max path. Result is coincidentally correct (the channel-max regex matches all three RMS level dB: lines — ch1, ch2, and the one under Overall — and Math.max returns the Overall value since Overall RMS = max(ch1,ch2) in practice), but:

  1. The doc comment ("Prefer the Overall line if it appears; otherwise take the max per-channel") describes a code path that never fires on modern ffmpeg. Future readers debugging this will be misled.
  2. The -inf test passes for both paths, so the dead path is invisible to regression.

Fix shape: stateful parse — detect a line literally ending Overall and take the next RMS level dB: line. Or drop the dead Overall regex and pin channel-max as the documented path on modern ffmpeg.

importantregression-harness.ts:1112-1117 + :85-89: TestResult.audio type isn't extended with residualRmsDb. The JSON event at line 1120 emits it (good), but result.audio doesn't carry it. Downstream consequences:

  • regression-harness.ts:692-707 writes audio-failures.json without the residual value. When a fixture fails only because residual exceeded the threshold (correlation fine, lag fine), the failure report shows correlationBelowThreshold: false, lagExceedsLimit: false and no clue that residual was the actual fail reason. Confusing debug surface for the first fixture that hits this gate.
  • The pretty-log "Audio quality: FAILED (correlation: 0.987, threshold: 0.95, residualRMS: -42.10 dBFS)" frames threshold as the correlation threshold even when residual was the failure cause.

Fix shape: add residualRmsDb?: number | null to TestResult.audio, set it at line 1112, include it in the failure-report summary, add a residualAboveThreshold analysis bit.

importantaudioRegression.ts (filter chain, amix duration=shortest): silently truncates the longer input. If rendered is 5s and snapshot is 1s (or vice versa), residual is computed only over the first 1s and any divergence in the tail is dropped. Common regression-harness failure modes (end-trim drift, last-frame off-by-one) are invisible to this gate. Rio's harness compares fixed-duration golden masters where shortest is benign; here rendered vs snapshot can drift in length. Two options:

  1. Probe both inputs' durations first (extractMediaMetadata already used elsewhere in the harness); if they differ by more than ~1 frame, fail the gate up-front before residual is computed.
  2. Use duration=longest + apad on both sides; tail divergence then surfaces as a high-amplitude residual.

Option 1 is cleaner and matches the spirit of "sample-level equivalence."

importantaudioRegression.ts (spawnSync call): proc.error, proc.status, proc.signal are all ignored. If ffmpeg fails before astats runs (codec missing, OOM, killed), rmsLines is empty → { NaN, ok: false }. This is fail-closed by accident — fine for the gate result, but a fixture author chasing a false negative has no signal that ffmpeg itself died. At minimum, when proc.status !== 0, surface a stderr tail into a thrown error or an error?: string field on AudioResidualRms. Harness already requires ffmpeg; spawn failure should be loud.

nitaudioRegression.test.ts:30: HAS_FFMPEG evaluates at module-load time via spawnSync. Other tests in the file (buildRmsEnvelope / compareAudioEnvelopes) pay this cost on every import. A lazy check inside beforeAll would be cheaper.

nitregression-harness.ts:1128-1135: residualSuffix formatting branches on Number.isFiniteNEGATIVE_INFINITY prints as "-inf" (correct), but NaN also prints as "-inf" because Number.isFinite(NaN) === false. A parse-failure result will be displayed as "residualRMS: -inf dBFS" — exactly the opposite of what happened. Split the branch on Number.isNaN separately.

Verdict: APPROVE
Reasoning: Opt-in, additive, CI green, end-to-end validated, with no current-day correctness regression. The four important items are all fixable-in-a-followup quality gaps (dead code path on modern ffmpeg, debug-surface drift on the failure report, length-mismatch blind spot, silent spawn failures) — recommend folding them into a follow-up before the first fixture opts in, otherwise the first real failure will be painful to diagnose.

Review by Vai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants