Anchor host memory-baseline hard gate to the recent ceiling#5352
Open
habdelra wants to merge 1 commit into
Open
Anchor host memory-baseline hard gate to the recent ceiling#5352habdelra wants to merge 1 commit into
habdelra wants to merge 1 commit into
Conversation
The hard (build-blocking) gate compared a module's current post-GC heap boundary delta against the rolling-mean baseline. For memory-heavy modules that delta is dominated by non-deterministic settle-GC drain timing and swings 100MB+ run-to-run with no retention, so the gate fired on values the module had already produced. Measure the hard regression from the recent ceiling (max sample) instead; low-variance modules (ceiling ~= mean) are unchanged, and the soft warning stays mean-based. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adjusts the host memory-baseline gating logic so the hard (build-blocking) failure check is anchored to the recent observed ceiling (max of rolling samples) rather than the rolling mean, reducing false CI failures for high-variance modules while keeping the soft warning anchored to the mean.
Changes:
- Add a
baselineCeiling()helper that derives a module’s recent-window max delta (with backward-compatible fallback to legacydelta_mb). - Change the hard-failure condition to compare
current deltaagainst the computed ceiling (byhardThreshold), while leaving the soft-warning comparison against the rolling mean.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Contributor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
The
Host Memory Baselinecheck (packages/host/scripts/check-memory-baseline.mjs) compares each test module's post-GC heap boundary delta —usedJSHeapSize(moduleEnd) − usedJSHeapSize(moduleStart), each measured after a 3-cycle settle-GC — against a rolling-window baseline, and hard-fails (blocking) when the current delta exceeds 2× the window mean or +50 MB.For a memory-heavy module that metric is dominated by whether the settle-GC happens to drain a large collectable transient before the boundary measurement. That drain timing is non-deterministic, so the delta swings run-to-run by 100 MB+ with no underlying retention. The clearest tell is a baseline window whose samples straddle zero — e.g.
Integration | search-entries resource:[106.7, −1.4, −8.5, −8.6, 131.0]. A negative sample means the heap was smaller at module end than at module start; a module that actually retained memory could never produce one.Why this module swings without leaking
The flagged module stands up two in-browser realms plus the base realm per test and runs real searches, so each test allocates a large transient graph that GC reclaims. Heap analysis over its run confirms nothing survives:
app_instances=0throughout) — no per-test growth.SearchEntriesResource 4→4,Realm 5→5,Loader 9→9,ApplicationInstance 4→3,StoreService 5→5. A genuine per-test leak would climb each of these by one per test.So the recurring red is the gate misreading boundary GC-timing noise as a regression, not a heap regression in the module under test.
The change
Anchor the hard (build-blocking) gate to the recent ceiling (the max of the rolling samples) instead of the mean: a run can't hard-fail on a delta the module has already produced in its window, while a value that clears the ceiling by the hard threshold still fails. When a module's variance is low (ceiling ≈ mean) the gate is unchanged. The soft, non-blocking warning stays anchored to the mean, so a genuine upward trend still surfaces early.
Behavior against the current baseline
Integration | search-entries resourceIntegration | search-entries resource🤖 Generated with Claude Code