Skip to content

Anchor host memory-baseline hard gate to the recent ceiling#5352

Open
habdelra wants to merge 1 commit into
mainfrom
host-memory-baseline-ceiling-gate
Open

Anchor host memory-baseline hard gate to the recent ceiling#5352
habdelra wants to merge 1 commit into
mainfrom
host-memory-baseline-ceiling-gate

Conversation

@habdelra

@habdelra habdelra commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Background

The Host Memory Baseline check (packages/host/scripts/check-memory-baseline.mjs) compares each test module's post-GC heap boundary delta — usedJSHeapSize(moduleEnd) − usedJSHeapSize(moduleStart), each measured after a 3-cycle settle-GC — against a rolling-window baseline, and hard-fails (blocking) when the current delta exceeds 2× the window mean or +50 MB.

For a memory-heavy module that metric is dominated by whether the settle-GC happens to drain a large collectable transient before the boundary measurement. That drain timing is non-deterministic, so the delta swings run-to-run by 100 MB+ with no underlying retention. The clearest tell is a baseline window whose samples straddle zero — e.g. Integration | search-entries resource: [106.7, −1.4, −8.5, −8.6, 131.0]. A negative sample means the heap was smaller at module end than at module start; a module that actually retained memory could never produce one.

Why this module swings without leaking

The flagged module stands up two in-browser realms plus the base realm per test and runs real searches, so each test allocates a large transient graph that GC reclaims. Heap analysis over its run confirms nothing survives:

  • Post-GC used-heap is flat across every test in the module (end-to-end drift < 1 MB, app_instances=0 throughout) — no per-test growth.
  • A heap-snapshot diff from early in the run to module end shows the total node count decreasing, with the instance counts of every candidate retainer flat or down: SearchEntriesResource 4→4, Realm 5→5, Loader 9→9, ApplicationInstance 4→3, StoreService 5→5. A genuine per-test leak would climb each of these by one per test.

So the recurring red is the gate misreading boundary GC-timing noise as a regression, not a heap regression in the module under test.

The change

Anchor the hard (build-blocking) gate to the recent ceiling (the max of the rolling samples) instead of the mean: a run can't hard-fail on a delta the module has already produced in its window, while a value that clears the ceiling by the hard threshold still fails. When a module's variance is low (ceiling ≈ mean) the gate is unchanged. The soft, non-blocking warning stays anchored to the mean, so a genuine upward trend still surfaces early.

Behavior against the current baseline

Module Current delta Ceiling (max recent sample) Result
Integration | search-entries resource 129.4 MB 131.0 MB passes (non-blocking warning)
Integration | search-entries resource 205 MB 131.0 MB hard-fails (clears ceiling +74)
low-variance module (samples ≈ 17.6 MB) 70 MB 17.7 MB hard-fails (unchanged)

🤖 Generated with Claude Code

The hard (build-blocking) gate compared a module's current post-GC heap
boundary delta against the rolling-mean baseline. For memory-heavy modules
that delta is dominated by non-deterministic settle-GC drain timing and swings
100MB+ run-to-run with no retention, so the gate fired on values the module had
already produced. Measure the hard regression from the recent ceiling (max
sample) instead; low-variance modules (ceiling ~= mean) are unchanged, and the
soft warning stays mean-based.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the host memory-baseline gating logic so the hard (build-blocking) failure check is anchored to the recent observed ceiling (max of rolling samples) rather than the rolling mean, reducing false CI failures for high-variance modules while keeping the soft warning anchored to the mean.

Changes:

  • Add a baselineCeiling() helper that derives a module’s recent-window max delta (with backward-compatible fallback to legacy delta_mb).
  • Change the hard-failure condition to compare current delta against the computed ceiling (by hardThreshold), while leaving the soft-warning comparison against the rolling mean.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions

github-actions Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Preview deployments

Host Test Results

    1 files      1 suites   2h 26m 29s ⏱️
3 272 tests 3 257 ✅ 15 💤 0 ❌
3 291 runs  3 276 ✅ 15 💤 0 ❌

Results for commit f864e11.

Realm Server Test Results

    1 files      1 suites   10m 13s ⏱️
1 661 tests 1 661 ✅ 0 💤 0 ❌
1 740 runs  1 740 ✅ 0 💤 0 ❌

Results for commit f864e11.

@habdelra habdelra requested a review from a team June 28, 2026 01:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants