Skip to content

chore(engine): observability for sub-composition timeline poll#945

Closed
jrusso1020 wants to merge 2 commits into
mainfrom
chore/regression-poll-observability
Closed

chore(engine): observability for sub-composition timeline poll#945
jrusso1020 wants to merge 2 commits into
mainfrom
chore/regression-poll-observability

Conversation

@jrusso1020
Copy link
Copy Markdown
Collaborator

What

Adds a single structured log line per render in pollSubCompositionTimelines (packages/engine/src/services/frameCapture.ts) capturing:

  • hostIds — every [data-composition-id] host on the page
  • timelineIdsBefore / timelineIdsAfterwindow.__timelines keys around the poll
  • addedDuringPoll — the diff (which timelines registered while we waited)
  • pollMs — wall-clock time spent polling
  • ready — whether all host timelines were present at the deadline
  • rebindFired — whether __hfForceTimelineRebind() was invoked

Why

Regression CI has been flaky on main and PRs since the 2026-05-18 renderer stack for async-loaded map blocks (#928 → main). Different fixtures fail on different runs:

PSNR drops are 18–30 dB across animation-heavy intervals — the fingerprint of render time shifted ~1-2 frames vs. baseline. Hypothesis: the count-based rebind heuristic in pollSubCompositionTimelines (introduced 2c84c9a, amended 4bf2fa5) is racy. On slow CI runners, sub-composition scripts register their timelines between the before-snapshot and the end of the poll → rebind fires → seek timing shifts → PSNR drops. On fast runners, the same fixture registers everything synchronously → no rebind → baseline matches.

This PR doesn't fix the race — it adds the observability needed to confirm the hypothesis before reverting/redesigning. Once correlation is established between failing runs and rebindFired=true, this PR is reverted and a real fix follows.

How

Replaces the timelinesBeforePoll/timelinesAfterPoll count comparison with a set-difference computed from explicit ID lists. Behavior is unchanged: addedDuringPoll.length > 0 is mathematically equivalent to after > before under the invariant that window.__timelines keys are never removed mid-poll.

The log line is unconditional — we want it on every CI shard, regardless of pass/fail.

Test plan

  • Trigger regression workflow (auto, on PR open) and grep CI logs for pollSubCompositionTimelines JSON lines
  • Re-run regression 3-4 times via empty commits or shard rerun
  • Verify correlation: failing shards' fixtures should show rebindFired:true and a non-empty addedDuringPoll; passing shards should show rebindFired:false
  • No baseline regen needed — render output is unchanged

This PR is intentionally landing as draft and is intended to be reverted once the investigation completes.

Captures host IDs, timeline IDs before/after the poll, the diff set, and
whether __hfForceTimelineRebind fired — emitted as a single JSON log line
per render. Lets us correlate flaky regression runs (style-7-prod,
gsap-letters-render-compat, style-3-prod) with whether the count-based
rebind heuristic fired.

Behavior is unchanged: the rebind condition `addedDuringPoll.length > 0`
is equivalent to the prior `timelinesAfterPoll > timelinesBeforePoll`
under the invariant that timeline IDs are never removed mid-poll.

Intended to be reverted once the race condition is confirmed and patched.
Lets every shard run to completion and emit its pollSubCompositionTimelines
JSON log line, so we can correlate rebindFired across all shards on a single
run instead of only seeing data from whichever shard happened to fail first.

Restore fail-fast: true before merging or reverting this PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant