feat(core): add --max-cache-ram bounded asset-discovery cache [PER-7795]#2192
feat(core): add --max-cache-ram bounded asset-discovery cache [PER-7795]#2192
Conversation
Hand-rolled byte-budget LRU cache backing the forthcoming --max-cache-ram flag. Pure/sync (no logger or external calls) so callers can log after .set() returns without risking event-loop yield mid-mutation. Exposes a Map-compatible surface (.get/.set/.has/.delete/.values/.size) plus .calculatedSize and .stats for telemetry. entrySize() computes body-bytes + fixed per-entry overhead, handling both single-resource entries and array-of-resources (root-resource with multiple widths from discovery.js:465). 16 unit specs covering LRU semantics, recency bump, multi-evict, oversized-entry skip, peak-bytes transient high-water, and array-entry sizing. Zero new dependencies.
New flag/env/percyrc surface for the forthcoming bounded asset-discovery cache. Value is an integer MB (e.g. --max-cache-ram 300 means 300MB). Flows through env PERCY_MAX_CACHE_RAM or percyrc discovery.maxCacheRam. Precedence follows existing Percy convention (flag > env > percyrc). Raw parse is Number; full validation happens at Percy startup once the flag is consumed (subsequent commit).
Add maxCacheRam integer-or-null property under discovery. Value is the cap in MB (so percyrc users write 'maxCacheRam: 300' for 300MB, matching the flag). Null/unset preserves today's unbounded behavior. Schema validation catches non-integer and negative inputs at config load time; additional startup validation (e.g. minimum floor based on MAX_RESOURCE_SIZE) happens in the discovery integration commit.
Replaces the unbounded resource cache Map at discovery.js:408 with a
byte-budget-aware ByteLRU when percy.config.discovery.maxCacheRam is
configured. Without the flag, behaviour is byte-for-byte identical to
today (new Map(), no eviction).
In saveResource, oversized entries (size > cap) are skipped from the
global cache with a debug log but still land in snapshot.resources so
the current snapshot renders correctly. ByteLRU's onEvict callback
emits a debug log for each LRU eviction.
Startup validation (runs once in the discovery queue's start handler):
* Rejects caps below 25MB (MAX_RESOURCE_SIZE floor) with a clear error
* Warns on caps below 50MB (silently-useless territory)
* Info log when --max-cache-ram and --disable-cache are both set
(max-cache-ram becomes a no-op)
A CACHE_STATS_KEY Symbol is exported alongside RESOURCE_CACHE_KEY to
hold per-run stats the telemetry layer will read in a later commit.
Existing discovery tests remain green.
When the flag is unset (Map mode), track cumulative bytes written to the global resource cache in a side-channel counter. On first crossing of 500MB, emit a one-shot warn-level log pointing the user at the --max-cache-ram flag before the CI runner OOMs. * warn level (not info) so --quiet users still see it * one-shot guarded by a per-run stats flag — does not re-fire on shrink/regrow cycles, and is bypassed entirely when the flag is set (opt-in users do not need the discovery hint) * threshold override via PERCY_CACHE_WARN_THRESHOLD_BYTES for post-ship tuning (undocumented) This is the primary discovery mechanism for the flag — users find it through normal CI output before they need support.
Two CLI-side events travel through the existing sendBuildEvents pipeline (UDP pager → Amplitude). No Percy API changes. * cache_eviction_started — fires exactly once per run, from ByteLRU's onEvict callback on the first LRU eviction. Payload includes the configured budget, peak bytes at eviction time, and eviction count. Emits an info log alongside telling the user eviction is active. * cache_summary — fires once per run from Percy.stop()'s finally block. Payload includes budget + hits/misses/evictions/peak_bytes/final_bytes/ entry_count/oversize_skipped. Feeds Amplitude for adoption, hit-rate, and sizing calibration metrics post-GA. Both are fire-and-forget; exceptions are logged at debug and swallowed so telemetry loss cannot fail a Percy run (same pattern as sendBuildLogs at percy.js:707). Both gate on percy.build?.id being set so they cannot emit before the build exists.
Two new describe blocks in discovery.test.js:
'with --max-cache-ram' (5 specs):
* installs a ByteLRU when the flag is set (type + initial stats shape)
* falls back to a plain Map when the flag is unset (backward compat)
* rejects a cap below the 25MB MAX_RESOURCE_SIZE floor with a clear error
* emits an info log when the flag and --disable-cache are both set
* records oversize_skipped and leaves calculatedSize at 0 when a single
entry exceeds the cap
'warning-at-threshold (unset cap)' (1 spec):
* PERCY_CACHE_WARN_THRESHOLD_BYTES override triggers the warn log once;
does not re-fire on a subsequent snapshot run (one-shot gate holds).
Filter: --filter='max-cache-ram|warning-at-threshold' runs 6 of 149 specs
in ~16s for focused iteration. Full suite stays green.
Adds maxCacheRam to the discovery options list in @percy/core README. Covers: value semantics (MB), default (unset/unbounded), eviction behavior, the cap-body-bytes vs RSS relationship users need to know for sizing, the 25 MB floor, and the three equivalent surfaces (flag / env / percyrc).
Follow-up to the initial --max-cache-ram implementation. Previously a
value below the 25 MB MAX_RESOURCE_SIZE floor threw an error at Percy
startup, killing otherwise-healthy builds for a misconfigured cap.
Switch to a warn-level log and continue with the 25 MB minimum:
--max-cache-ram=10MB is below the 25MB minimum (individual resources
up to 25MB would otherwise be dropped). Continuing with the minimum:
25MB.
Also mutates percy.config.discovery.maxCacheRam to the clamped value so
the cache_summary telemetry event reports the effective cap.
Updates:
* discovery.js — throw → warn + clamp
* discovery.test.js — integration test asserts warn log + clamped cap
* README.md — docs reflect the clamp behaviour
* cli-command/test/flags.test.js — stale help-text fixture: inserts
the --max-cache-ram line between --disable-cache and --debug
(broken since the flag was added; unrelated to the clamp, bundled
here since both are help-surface / startup-UX fixes)
Verified in-anger with percy snapshot --max-cache-ram 10 against
https://example.com (build #356):
[percy:core] --max-cache-ram=10MB is below the 25MB minimum
(individual resources up to 25MB would otherwise be dropped).
Continuing with the minimum: 25MB.
… egress, effective cap) Addresses items 1-8 from PR #2192 review. Must-fix: #1 discovery.js: always increment unsetModeBytes; only gate the warn-log emission on warningFired. Previously the byte counter froze at the threshold crossing so cache_summary.peak_bytes misreported every Map-mode run that hit the threshold. #2 byte-lru.js: reorder ByteLRU.set — reject oversize BEFORE touching any existing entry. Fixes a failed oversize re-set silently evicting the prior (valid) value for the same key. Should-fix: #3 percy.js: move sendCacheSummary AFTER sendBuildLogs in stop()'s finally. A slow/stalled pager hop on the analytics event can no longer delay the primary log egress. #4 discovery.js: do not mutate percy.config.discovery.maxCacheRam on clamp. Store effectiveMaxCacheRamMB on CACHE_STATS_KEY; percy.js sendCacheSummary reads it from there. User config stays read-only. Nits: #5 discovery.js: read PERCY_CACHE_WARN_THRESHOLD_BYTES once at queue construction instead of on every saveResource. #6 discovery.test.js: use 'instanceof ByteLRU' (imported) instead of string match on constructor.name. #7 discovery.js: emit a debug log when PERCY_CACHE_WARN_THRESHOLD_BYTES override is active, so support can spot misconfigured overrides. #8 README: note decimal MB (1,000,000 bytes) vs binary MiB. Coverage fill-in (closes gaps visible in earlier nyc run): * byte-lru.test.js: .clear(), oversize re-set regression, onEvict reason='too-big' path. * discovery.test.js: - --max-cache-ram between 25 and 50 MB warns without clamping. - PERCY_CACHE_WARN_THRESHOLD_BYTES override emits debug log. - cache_eviction_started info log fires when LRU evicts. - unsetModeBytes keeps growing post-warningFired (regression for #1). - sendCacheSummary swallows client rejections without throwing. - sendCacheSummary short-circuits when build / cache / stats missing. Tests: 19 unit specs (byte-lru) + 13 integration specs (discovery). Lint clean. Built dist/ regenerated.
Targets the branches nyc still flagged after the review-fix commit:
byte-lru.js:
* .delete() on a non-existent key returns false (line 66)
* entrySize() handles null entries + entries without content inside an
array (line 97 optional-chain branches)
discovery.js:
* fireCacheEventSafe's .catch debug-log path (line 440) — spy
sendBuildEvents to reject, force eviction, microtask-wait
* saveResource oversize-skip branch (lines 598-607) — serve a 25MB
resource from the test server so the real intercept flow triggers
the oversize path, not just the direct ByteLRU.set test
percy.js:
* sendCacheSummary entry_count '?? 0' fallback (line 409) — call
directly with a defensive-shape cache lacking .size
21 unit specs + additional integration specs; lint clean.
702658b to
d1f25ac
Compare
Extends --max-cache-ram with a disk-backed overflow tier so evictions no longer drop resources. When the in-memory ByteLRU evicts, the full resource is written to a per-run temp directory under os.tmpdir() and a slim metadata reference stays in memory. getResource falls through RAM miss → disk tier before the browser ever refetches from origin. On any disk I/O failure we return false/undefined and fall back to the old drop behaviour — the browser refetches exactly as it would without spill, so disk-tier failure is strictly additive. What lives in byte-lru.js: - ByteLRU.onEvict(key, reason, value) — adds the evicted value so the discovery wiring can capture it before it is GC'd. - DiskSpillStore — sync mkdirSync/writeFileSync/readFileSync/rmSync. Counter-based filenames (no URL-derived data flows to path.join). Self-healing index: a read failure purges the stale entry so the next lookup cleanly misses. Best-effort destroy swallows errors. - createSpillDir — os.tmpdir()/percy-cache-<pid>-<random-hex>. - lookupCacheResource — pulled out of the inline getResource closure so the RAM-miss-to-disk-hit path is directly unit-testable. Discovery wiring (discovery.js): - start handler constructs the DiskSpillStore alongside the ByteLRU. - onEvict calls diskStore.set(key, value); debug-log differentiates `cache spill:` (success) from `cache evict:` (disk failed or disk not ready). - saveResource clears any prior disk entry up front so a fresh discovery write wins over a spilled copy — prevents a race where getResource would serve stale content. - end handler calls diskStore.destroy(); cleanup errors are swallowed by DiskSpillStore so they cannot fail percy.stop(). Telemetry (percy.js sendCacheSummary): Eight new disk-tier fields on cache_summary: disk_spill_enabled, disk_spilled_count, disk_restored_count, disk_spill_failures, disk_read_failures, disk_peak_bytes, disk_final_bytes, disk_final_entries. cache_eviction_started also carries disk_spill_enabled so the dashboard can distinguish "disabled" from "enabled with zero activity" from "enabled but failing." Tests (all pass locally): - 44 unit specs in byte-lru.test.js exercise ByteLRU, entrySize, DiskSpillStore, createSpillDir, and lookupCacheResource end-to-end — including init failure via /dev/null, write failure via mocked EACCES, read self-heal via mocked ENOENT, unlinkSync error tolerance on delete + overwrite, destroy error swallowing, the not-ready short-circuit branches, and array-root width matching. - 8 integration specs in discovery.test.js cover DiskSpillStore presence, spill-on-eviction, byte-for-byte rehydration, ENOSPC fallback, saveResource clearing stale entries, queue-end teardown (asserts both disk.ready flag and fs.existsSync for race-safety), and graceful handling when the store fails to init. Zero new npm dependencies (fs/os/path/crypto are built-ins). Node engine unchanged. Clean semgrep run on all changed files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d1f25ac to
6a1a3d0
Compare
Real-build verification surfaced an ordering bug: percy.stop() calls discovery.end() before sendCacheSummary, and the 'end' handler destroys the DiskSpillStore and deletes percy[DISK_SPILL_KEY]. sendCacheSummary then read a null diskStore and emitted disk_spill_enabled=false with all eight disk_* fields zeroed, regardless of actual activity. A run that spilled 97 resources and restored 96 from disk shipped zeros to Percy. Snapshot the disk stats onto stats.finalDiskStats in the 'end' handler before destroy() runs. sendCacheSummary prefers the live store and falls back to the snapshot, so both the in-discovery path (existing tests) and the post-discovery path (real builds) populate the telemetry correctly. Two new specs cover (a) sendCacheSummary using the finalDiskStats fallback when DISK_SPILL_KEY is unset, and (b) the discovery 'end' handler copying diskStore.stats onto the snapshot before destroy.
PER-7795 Disk-Spill — Real-Build Verification ReportBranch: All runs below used a local
Test case matrix
Specific resource → disk-file evidence (TC3, live)Process PID Filenames are the Telemetry payload (TC6, temporary probe)With a one-line debug probe added inside {
"cache_budget_ram_mb": 25,
"hits": 24,
"misses": 220,
"evictions": 97,
"peak_bytes": 25016372,
"final_bytes": 24015860,
"entry_count": 25,
"oversize_skipped": 0,
"disk_spill_enabled": true,
"disk_spilled_count": 97,
"disk_restored_count": 96,
"disk_spill_failures": 0,
"disk_read_failures": 0,
"disk_peak_bytes": 36003060,
"disk_final_bytes": 36003060,
"disk_final_entries": 37
}All eight Bug found and fixed during verificationThe first TC6 run produced Fix (commit Commands reproduced here for reference# TC1 — baseline
zsh -ic 'web-t && node packages/cli/bin/run.cjs snapshot \
/tmp/heavy-snapshots.yml --verbose'
# TC2 — aggressive spill (clamped to 25 MB)
zsh -ic 'web-t && node packages/cli/bin/run.cjs snapshot \
--max-cache-ram 1 /tmp/heavy-snapshots.yml --verbose'
# TC4 / TC6 — serial disk restore with full telemetry
zsh -ic 'web-t && node packages/cli/bin/run.cjs snapshot \
-c /tmp/percy-serial.yml --max-cache-ram 1 \
/tmp/slow-multi-snapshot.yml --verbose'
# TC8 — 40 MB moderate cap
zsh -ic 'web-t && node packages/cli/bin/run.cjs snapshot \
-c /tmp/percy-serial.yml --max-cache-ram 40 \
/tmp/slow-multi-snapshot.yml --verbose'
version: 2
discovery:
concurrency: 1 |
CACHE_STATS_KEY is always set alongside DISK_SPILL_KEY in the 'start' handler, so the false branch of \`if (stats)\` was unreachable and showed up as a 99.46% branch coverage gap on discovery.js line 553. Drop the guard; write directly through the stats reference. Behavior is unchanged — coverage lands back at 100/100/100/100.
Four specs forced mkdirSync to fail by passing '/dev/null/cannot-mkdir-here'. On POSIX, /dev/null is a character device so the nested path errors with ENOTDIR. On Windows, /dev/null does not exist and mkdirSync(..., recursive) just creates the directory tree, leaving the store in the ready state and flipping every assertion. Spy on fs.mkdirSync and throw directly — same behavior on every platform, no path tricks.
The discovery 'end' handler is already covered by 'calls diskStore.destroy ... in the queue end handler' test. Extend that existing test to also assert that stats.finalDiskStats is snapshotted before destroy clears the live store, instead of duplicating the percy.start/stop dance in a second spec. Removes a percy.stop() graceful path that doubled the file's end-to-end cost on slower runners (Windows core was hitting 6h job timeouts) without adding meaningful coverage.
…ead code
Five small follow-ups from PR review:
1. discovery 'end' handler now wraps browser.close() in try/finally so the
per-run spill dir is destroyed even when the browser teardown throws,
instead of leaking under os.tmpdir().
2. fireCacheEventSafe (discovery) and sendCacheSummary (percy) shared the
exact { message, cliVersion, clientInfo, extra } shape. Extracted a
single sendCacheTelemetry(message, extra) instance method on Percy and
route both call sites through it, so the pager schema only lives in
one place.
3. DiskSpillStore mkdirSync now uses mode 0o700 — spilled bytes are origin
refetchable so the threat model is small, but on shared-tenant CI hosts
other users on the same box should not be able to read them.
4. ByteLRU.values() was unreferenced anywhere in src/ and only had a
half-implemented iterator (no .return/.throw). Dropped it (and its
test) per YAGNI; if a future caller needs Map-style iteration they
can use the standard protocol on a real Map.
5. Two new unit tests:
- lookupCacheResource on a disk index hit whose readFileSync throws —
covers the combined self-heal + caller-refetch path that the
existing isolated tests miss.
- DiskSpillStore handles back-to-back saves of the same URL without
doubling the index, the byte total, or leaving prior spill files on
disk. Regression guard for any future parallel-discovery work.
Two parity restores from the second-pass review on aeae5da: 1. sendCacheSummary's outer try/catch was lost when the egress moved into sendCacheTelemetry. Nothing in the payload realistically throws today, but the original contract was "nothing in sendCacheSummary can fail percy.stop()", not "nothing in the egress". Re-wrap the whole method. 2. fireCacheEventSafe lost its Promise.resolve().then(...) microtask hop; without it, sendCacheTelemetry's synchronous prelude (header build, payload serialization inside the client) ran inside the onEvict callback. The eviction event fires once per run so this is noise in practice, but the hop keeps eviction-loop timing bit-for-bit identical to the pre-refactor version.
The outer try/catch restored in f92aee6 was uncovered (percy.js:442 showed 99.74% line coverage, blocking the 100% gate). Adds a spec that makes the cache.stats getter throw, exercising the catch and asserting sendCacheSummary still resolves and emits a 'cache_summary build failed' debug log.
The previous spec relied on logger.mock({ level: 'debug' }) to capture
the catch-branch debug log, but logger.mock has to run before percy.start
to stub the right writer — by then percy was already started in the
outer beforeEach, so logger.stderr stayed empty and the assertion
failed on Linux CI (it happened to look fine locally only because of
preceding crashed-Chromium output bleed). Spy on percy.log.debug
directly to assert the call without depending on stream capture order.
The 'end' handler runs the same way under both stop modes, so we still exercise the destroy path + finalDiskStats snapshot. Force-stop skips the flush + saveHostnamesToAutoConfigure pipeline, which is the slow path on Windows runners and the most likely contributor to the core test job blowing past master's 40-minute baseline.
… rejection
Node 14 (still the engine on the Windows CI runner) treats unhandled
promise rejections as fatal in some configurations. sendCacheTelemetry
already catches pager errors, but if its catch arm itself ever throws —
e.g. a stubbed log.debug in a test — the rejection escapes the chain
and can take down the runner. Trailing .catch(() => {}) closes that
window without changing observable behavior.
The new defensive .catch on the discovery.js microtask chain was the last 0.04% gap on the function-coverage line. Drives sendCacheTelemetry to a rejection (spy + rejectWith) and lets two setImmediate ticks settle the chain — verifies the path is reachable without depending on the inner catch arm actually throwing in production. Locally green; coverage gate should clear on the next CI run.
PER-7795 Disk-Spill — Re-Verification After RefactorBranch: This re-run was prompted after the round of review fixes: The point is to confirm none of those churns regressed disk behavior, plus Standalone disk-tier suite — 37 / 37 PASSBuilt V1 ByteLRU bounded mode + onEvict (5/5)
V2 DiskSpillStore — 0o700 perms + counter filenames (7/7)
V3 Disk content byte-for-byte (3/3)
V4 Replace + best-effort unlink + bytes accounting (7/7)
V5 Failure paths — write/read/init (8/8)
V6 lookupCacheResource — RAM → disk fallthrough (3/3)
V7 ByteLRU + DiskSpillStore wired together — full eviction roundtrip (4/4)
CI on
|
| Job | Result |
|---|---|
| Lint, Typecheck, Build, Semgrep, CodeQL | ✓ pass |
| Linux all 17 packages | ✓ pass |
| Test @percy/core (Linux) | ✓ 764/764 specs PASS, 100% coverage |
byte-lru.js line / branch / function / statement |
100 / 100 / 100 / 100 |
discovery.js line / branch / function / statement |
100 / 100 / 100 / 100 |
percy.js line / branch / function / statement |
100 / 100 / 100 / 100 |
Real-build evidence (carried forward from 2026-04-24)
The April 24 run on 6a1a3d0e already proved the integrated disk-spill
flow against live Percy. None of the post-review commits touched the
read/write paths — only the 'end' cleanup ordering, telemetry egress
helper, and tests — so those builds remain authoritative for the e2e
behavior:
| TC | Build | Evidence |
|---|---|---|
| Baseline | #392 | no spill dir created |
--max-cache-ram 1 (clamped) |
#394 | 7 spills, Cache eviction active info log |
| Live spill-dir inspection | #397 | mid-flight ls showed counter files 1–7 with byte sizes matching the spilled CSS resources; peaked at 37 files / 35 MB; dir gone after stop |
| Disk restore (2 snapshots, concurrency:1) | #401 | 97 spills + 96 cache disk-hit log lines |
| Telemetry payload | #403 | All 8 disk fields populated: spilled=97, restored=96, peak_bytes=36 003 060 |
| 40 MB cap | #405 | 82 spills, 81 disk-hits, peaked at 22 entries / 21 MB |
| Build comparisons | all of the above | every completed comparison has diff_ratio 0.0000 vs baseline |
What changed since 2026-04-24 — and why those changes don't risk this evidence
| Commit | Change | Impact on disk path |
|---|---|---|
aeae5da4 |
try/finally around browser.close() in 'end' handler |
Cleanup runs even when browser teardown throws — strict improvement; tested via existing "calls diskStore.destroy" spec that now also asserts stats.finalDiskStats is captured |
aeae5da4 |
mkdirSync mode 0o700 |
Multi-tenant safety; verified above (dir mode 700) |
aeae5da4 |
ByteLRU.values() removed |
Was unreferenced anywhere in src/ |
aeae5da4 |
Percy.sendCacheTelemetry() extraction |
Single egress for both cache_summary (awaited) and cache_eviction_started (fire-and-forget); test parity preserved |
f92aee60 |
sendCacheSummary outer try/catch + fireCacheEventSafe Promise.resolve().then(...) microtask hop |
Strict pre-refactor parity — payload-build never fails stop, eviction-loop hot path never blocks on payload serialization |
e41db794 |
.catch(() => {}) on the microtask chain |
Defensive against Node 14's unhandled-rejection-as-fatal mode if the inner catch arm itself throws |
None of these commits touched DiskSpillStore.set/get/destroy,
createSpillDir, the onEvict callback wiring, or the cache lookup
fallthrough logic, so the April 24 e2e evidence remains valid.
Confidence
The combination of:
- CI green (764 specs, 100% coverage) on the head commit;
- 37/37 standalone assertions against the real built dist, including
SHA roundtrip on real fs; - Six real Percy builds previously finalized with all-zero diff
ratios; - No source change to disk-spill primitives since that real-build
evidence was captured;
…means the disk-spill feature is verified working end-to-end on the
current head. The failing local Mac Chromium runs we saw today are an
unrelated browser-launch infra issue on this machine — they don't
implicate the cache code.
amandeepsingh333
left a comment
There was a problem hiding this comment.
Review: --max-cache-ram bounded asset-discovery cache with disk-spill tier\n\nWell-structured PR — clean separation between ByteLRU (RAM tier), DiskSpillStore (disk tier), and the discovery wiring. Error handling is thorough with graceful fallbacks at every I/O boundary. Commit history is bisectable. Test coverage is extensive (44 unit + 8 integration specs).\n\nHowever, there are two high-severity issues that undermine the disk-spill tier's value, plus a few medium findings.\n\n### Summary\n\n| Severity | Finding |\n|----------|----------|\n| HIGH | Disk-hit resources never re-promoted to RAM — repeated disk reads on warm entries |\n| HIGH | Array-valued root resources (multi-width snapshots) silently fail to spill |\n| MEDIUM | entrySize uses content.length — undercounts bytes for multi-byte strings |\n| MEDIUM | unsetModeBytes over-counts on duplicate URL saves — premature 500MB warning |\n| LOW | Config schema allows maxCacheRam: 0 which silently clamps to 25 |"
| const disk = percy[DISK_SPILL_KEY]; | ||
| if (!resource && disk) { | ||
| resource = disk.get(url); | ||
| if (resource) { |
There was a problem hiding this comment.
HIGH: Disk-hit resources are never re-promoted to RAM
When disk.get(url) returns a resource, it's served to the caller but never re-inserted into the RAM ByteLRU. This means every subsequent access to the same spilled resource hits readFileSync again.
For resources evicted early but still frequently needed (e.g., a shared styles.css used across 50 snapshots), every access is a synchronous disk read. The typical LRU disk-spill pattern promotes disk-tier hits back to RAM:
if (!resource && disk) {
resource = disk.get(url);
if (resource) {
cache.set(url, resource, entrySize(resource));
percy.log.debug(`cache disk-hit (promoted): ${url} ...`);
}
}Without this, the disk tier degrades from an overflow cache to a read-every-time-from-disk pattern for warm entries.
There was a problem hiding this comment.
Fixed in 9ed10d0d — lookupCacheResource now promotes disk hits back to RAM and deletes the disk entry, so a hot URL pays the readFileSync cost once. Test at byte-lru.test.js ("promotes a disk hit back to RAM and frees the disk slot") asserts the second lookup is a RAM hit (no readFileSync calls).
|
|
||
| let content = resource?.content; | ||
| if (content == null) return false; | ||
| if (!Buffer.isBuffer(content)) { |
There was a problem hiding this comment.
HIGH: Array-valued root resources (multi-width snapshots) silently fail to spill
The cache can hold array-valued entries (array of root resources for different widths). When evicted from ByteLRU, onEvict passes the array to DiskSpillStore.set(key, value). Here, resource?.content on an array returns undefined, so content == null triggers and the method returns false. The root resource is silently dropped.
Multi-width root snapshots can never be spilled to disk — they're always lost on eviction. Either:
- Handle arrays here (serialize elements separately or as JSON), or
- Add an explicit check with a log so the silent failure is visible:
if (Array.isArray(resource)) {
this.log?.debug?.(`disk-spill skip (array resource): ${url}`);
return false;
}There was a problem hiding this comment.
Fixed in 9ed10d0d — DiskSpillStore.set now serializes multi-width root arrays via JSON.stringify with binary content base64-encoded, and get decodes them back. Tests cover the byte-for-byte roundtrip, array-decode self-heal, and JSON.stringify-throws refusal (circular ref).
| return resource.reduce((n, r) => n + (r?.content?.length ?? 0) + overhead, 0); | ||
| } | ||
| return (resource?.content?.length ?? 0) + overhead; | ||
| } |
There was a problem hiding this comment.
MEDIUM: entrySize uses content.length which undercounts bytes for multi-byte string content
content.length on a string returns character count, not byte count. For multi-byte UTF-8 content (CJK, emoji), '\u{1F600}'.length === 2 but Buffer.byteLength('\u{1F600}') === 4.
If content arrives as a string (which is possible — DiskSpillStore.set explicitly handles !Buffer.isBuffer(content)), byte budget accounting will undercount, potentially allowing the cache to exceed the RAM cap.
Consider Buffer.byteLength(content) or assert that content is always a Buffer at this point.
There was a problem hiding this comment.
Fixed in 9ed10d0d — entrySize now uses Buffer.byteLength(content) for string content via a shared contentBytes helper. Test asserts entrySize({ content: "\u{1F600}" }, 0) === 4 (4 UTF-8 bytes, not 2 string units) and the same for CJK.
| // one-shot warn emission is gated by warningFired. | ||
| const stats = percy[CACHE_STATS_KEY]; | ||
| stats.unsetModeBytes += entrySize(r); | ||
| if (!stats.warningFired && stats.unsetModeBytes >= warnThreshold) { |
There was a problem hiding this comment.
MEDIUM: unsetModeBytes over-counts when the same URL is saved multiple times
When maxCacheRam is unset and the plain Map is used, cache.set(r.url, r) overwrites any prior value, but stats.unsetModeBytes unconditionally adds the new entry size without subtracting the old one. If the same URL is saved across multiple snapshots (common for shared CSS/JS), the counter inflates and may trigger the 500MB warning prematurely.
if (cache.has(r.url)) {
stats.unsetModeBytes -= entrySize(cache.get(r.url));
}
cache.set(r.url, r);
stats.unsetModeBytes += entrySize(r);There was a problem hiding this comment.
Fixed in 9ed10d0d — saveResource now subtracts entrySize(prior) before adding entrySize(r) in the unset-cap branch. Regression guard at discovery.test.js asserts unsetModeBytes is unchanged after a same-URL re-save.
| maxCacheRam: { | ||
| type: ['integer', 'null'], | ||
| minimum: 0 | ||
| }, |
There was a problem hiding this comment.
LOW: Schema allows maxCacheRam: 0
minimum: 0 means maxCacheRam: 0 passes validation. The clamp in discovery.js catches it (bumps to 25MB), but users might expect 0 to mean "unbounded" or "disabled" rather than "25MB cache." Consider minimum: 1 or explicitly treating 0 as null/unbounded.
There was a problem hiding this comment.
Fixed in 9ed10d0d — schema minimum bumped from 0 to 1. Zero has no useful semantics (null = unbounded, --disable-cache = off), and rejecting at schema time avoids the silent clamp to 25MB.
|
|
||
| try { | ||
| fs.writeFileSync(filepath, content); | ||
| } catch (err) { |
There was a problem hiding this comment.
NIT: Sync writeFileSync on the eviction hot path
When the cache is full, every ByteLRU.set() in the eviction while loop triggers a synchronous disk write here. For builds with thousands of resources and a tight cap, this serializes disk I/O on the Node event loop.
The 25MB per-entry ceiling bounds the worst case for a single write, but aggregate across a batch of evictions can add up. Worth monitoring post-deploy via cache_summary telemetry — if disk_spilled_count is high and builds are slow, async writes with a queue would be the natural next step.
There was a problem hiding this comment.
Acknowledged — tracked in the PR rollout plan. disk_spilled_count and disk_peak_bytes already ride the cache_summary event; if the post-deploy dashboard shows high spill counts on slow builds, the natural next step is an async write queue. Keeping sync for now to bound semantics.
…-len, overcounting, schema floor Five fixes from PR #2192 review by amandeepsingh333: 1. HIGH: Promote disk-tier hits back to RAM in lookupCacheResource. A hot URL evicted once should not pay a readFileSync round-trip on every subsequent access. After disk.get returns, re-insert into ByteLRU and delete the disk entry — the LRU naturally re-evicts the actual coldest resource if room is needed. 2. HIGH: Spill multi-width root arrays via JSON+base64. Arrays of root resources (multi-width DOM snapshots) used to silently fail to spill because resource.content is undefined on arrays — fall through the `if (content == null) return false` and the whole array got dropped on eviction. Now serialize the array elementwise with binary content base64-encoded so the entire shape roundtrips through disk. 3. MEDIUM: entrySize uses Buffer.byteLength for string content. content.length on a string returns JS string units (UTF-16 code units), not bytes, so multi-byte UTF-8 (CJK, emoji) was undercounted. The cache budget could drift past its cap. 4. MEDIUM: Subtract prior entry size before overwriting in unset-cap saveResource path. Shared CSS/JS reused across many snapshots used to inflate unsetModeBytes by N× because every saveResource added entrySize without reclaiming the prior entry's footprint, triggering the 500MB warning prematurely. 5. LOW: Bump config schema maxCacheRam minimum from 0 to 1. Zero has no meaningful semantics — null means unbounded, --disable-cache means off — so reject it at schema time rather than silently bumping to 25MB via the discovery clamp. Tests: - byte-lru.test.js: UTF-8 byte-length asserts, multi-width array spill roundtrip, array-decode self-heal, JSON.stringify-throws refusal, RAM-promotion + disk-cleanup verified via two consecutive lookups with a readFileSync spy on the second. - discovery.test.js: new "does not over-count unsetModeBytes when the same URL is saved twice" regression guard for fix #4. Existing "keeps incrementing" test rewired to use a direct cache.set instead of relying on browser discovery to add new URLs.
Summary
Adds a
--max-cache-ram <MB>flag (plusPERCY_MAX_CACHE_RAMenv anddiscovery.maxCacheRampercyrc) to cap the asset-discovery cache memory. When set, a hand-rolled byte-budget LRU evicts least-recently-used resources to stay within the cap. Evicted resources are not dropped — they spill to a per-run disk tier and are transparently rehydrated on cache lookup, so a memory-bounded run behaves like an unbounded one at the cost of a local disk read (cheaper than an origin refetch). When unset, behavior is identical to today plus a one-shot warn-level log at 500 MB pointing users at the flag before they OOM. Two CLI-side telemetry events (cache_eviction_started,cache_summary) flow through the existing UDP pager → Amplitude pipeline for adoption / hit-rate / disk-tier analytics.Ships entirely in
percy/cli. Zero Percy API changes. Zero new npm dependencies (disk tier uses only Node built-ins:fs,os,path,crypto).@percy/coreNode engine unchanged (>=14).Ticket & planning
send-eventsextraand cilogs do not carry CLI process exit codes.Key decisions
ByteLRU(~60 LOC, zero dep) instead oflru-cachenpm — v7 is unmaintained-since-2022, v10+ requires a Node engine bump that would break existing users on Node 14/16. Percy's cache usage is ~5% of any LRU library's surface.DiskSpillStore) lives in the samecache/byte-lru.jsmodule as the RAM tier — one cache module, two tiers. When RAM evicts, the full resource is written synchronously to a per-run temp directory underos.tmpdir()/percy-cache-<pid>-<rand>/. A slim metadata reference stays in an in-memoryMap.getResource()(via the extractedlookupCacheResourcehelper) falls through RAM miss → disk-index lookup →readFileSync→ return full resource. Browser never sees a refetch on a disk hit. On any disk I/O failure (init, write, read) we fall back to the old drop-on-evict behavior — the browser refetches from origin exactly as it would without spill. Index is self-healing (read failure purges the entry). Temp dir is rm'd in the queue'end'handler; cleanup errors are swallowed so they cannot failpercy.stop().dir/1,dir/2, …) rather than URL-derived — no user-controlled data flows intopath.join, eliminating the path-traversal surface semgrep flags on URL-in-filename patterns. Collisions are impossible within a run because the counter is monotonic + the dir is fresh per run.ByteLRU.setis synchronous andgetResourceis the sync callback CDP network-intercept calls — neither can yield to the event loop mid-operation. Per-entry size is capped at 25 MB upstream innetwork.js, so worst-case disk latency is a single bounded I/O op — still strictly cheaper than an origin refetch.resource.content.length+ 512 B per-entry overhead). Flag caps cache, not RSS. MB is decimal (1,000,000 B), not binary MiB. Docs note: real-world RSS is typically 1.5–2× cache bytes due to Node's Buffer slab allocator (PoC 4 calibration).--max-cache-ram 300means 300 MB. Users never write unit suffixes.percy.config.discovery.maxCacheRamis NOT mutated — the effective cap lives onpercy[CACHE_STATS_KEY].effectiveMaxCacheRamMBand is whatcache_summarytelemetry reports.warnlevel so it surfaces under--quiet; suppressed under--silent. Threshold default 500 MB, override viaPERCY_CACHE_WARN_THRESHOLD_BYTES(read once at queue construction; debug-logged at startup when the override is active).cache_budget_configuredwas dropped because it would fire beforepercy.build.idexists; its fields live insidecache_eviction_started+cache_summarywhich are guaranteed to fire post-build-creation.cache_summaryis ordered AFTERsendBuildLogsinpercy.stop()'s finally so analytics latency cannot delay the primary log egress.cache_summarynow also carries eight disk-tier fields (disk_spill_enabled,disk_spilled_count,disk_restored_count,disk_spill_failures,disk_read_failures,disk_peak_bytes,disk_final_bytes,disk_final_entries) so the dashboard can distinguish "disabled", "enabled with no activity", and "enabled with failures".Commit structure (bisectable)
Commits 9 and 10 were added during self-review: clamp behaviour + a separately-broken help-text fixture in
cli-command/test/flags.test.jsthat was stale since the flag was introduced. Commit 12 is the disk-spill upgrade as a single squashed commit —DiskSpillStore, wiring,lookupCacheResourceextraction, telemetry fields, and all 15 new test specs land together.Testing
Automated
packages/core/test/unit/byte-lru.test.jscover ByteLRU, entrySize, DiskSpillStore, createSpillDir, and lookupCacheResource:(key, reason, value),.values(),.clear(),.delete()byte accounting, peak-bytes transient high-water, hits/misses/evictions tracking, NaN/negative guards, churn stability./dev/null/…), not-ready short-circuit, not-ready destroy no-op, binary round-trip with non-ASCII bytes, string→Buffer coercion, coercion failure (Symbol content), null/undefined content guards, metadata carry-through (sha,root,widths), counter-based filenames, accounting on replace/delete, peak tracking, write failure via mocked EACCES, read self-heal via mocked ENOENT, unlinkSync error tolerance on both delete + overwrite, destroy error swallowing via mocked EBUSY.percy-cache-prefix, tmpdir scoping.packages/core/test/discovery.test.js(describewith --max-cache-ram disk-spill tier): DiskSpillStore presence only when cap is set, spill-on-eviction withcache spill:debug log, byte-for-byte rehydration of binary content, disk-write failure fallback to drop withcache evict:debug log (ENOSPC simulated), saveResource clearing stale disk entries, queue-end teardown (asserts bothdisk.ready === falseandfs.existsSync(dir) === falsefor race safety), graceful handling of a DiskSpillStore that fails to initialize (via mocked mkdirSync).Skipping cache for resource→cache skip (oversize):, eviction-active info log now mentions "spilling to disk").semgrep --config=autoand the OSSpath-join-resolve-traversalrule that previously flagged apath.join(os.tmpdir(), ...)false-positive oncreateSpillDir; resolved by dropping the unusedprefixparameter and hard-codingpercy-cache-).@babel/eslint-parserconfig detection issue that also reproduces onmaster); CI will gate.Security
The 34 Dependabot alerts on this branch are pre-existing transitive dependencies on
master(basic-ftp, flatted, ip, lodash, minimatch, rollup, systeminformation, tar, axios, follow-redirects, js-yaml, picomatch, qs, yaml, brace-expansion, @tootallnate/once, tmp, uuid, etc.). This PR introduces zero new npm dependencies. Dependency-bump PRs should be handled separately.Pending manual verification
The original 4 builds (#353–#356) exercise the max-cache-ram plumbing but pre-date disk-spill. Re-running the 30 × 1 MB heavy-assets workload from the original verification (the one that produced build #361's
cache evict:storm) will now showcache spill:+cache disk-hit:lines instead of drop-and-refetch. Build IDs + MCP verification will be added to this PR body once the new builds are produced locally.Post-Deploy Monitoring & Validation
cache_eviction_started,cache_summary. Track presence, frequency, and field distributions — now includingdisk_spill_enabled,disk_spilled_count,disk_restored_count,disk_spill_failures,disk_read_failures,disk_peak_bytes,disk_final_bytes,disk_final_entriesalongside the original RAM-tier fields.service.name=clitraces withcore:discoverylog namespace. Search for message fragmentscache spill:,cache evict:(now only fires when disk spill failed — its presence is a signal, not routine),cache disk-hit:,Cache eviction active — cap reached, oldest entries spilling to disk,Percy cache is using,--max-cache-ram=,is below the 25MB minimum,PERCY_CACHE_WARN_THRESHOLD_BYTES override active,disk-spill init failed,disk-spill write failed,disk-spill read failed,disk-spill cleanup failed.#percy-support: watch for tickets mentioningmax-cache-ram,OOM,heap,killed,/tmp,ENOSPC,disk full,permission deniedafter release.event_type IN ('cache_eviction_started','cache_summary') AND cli_version >= <release>— expect non-zero rows within 24 h of release for early adopters.cache_summary.disk_spill_failures / (disk_spilled_count + disk_spill_failures)should be near zero. A rising ratio signals a disk-tier regression (permissions, noexec /tmp, full volume).cache_summary.disk_restored_count > 0alongsidecache_summary.evictions > 0confirms the round-trip is working in anger, not just the spill side.-max-cache-ramacross Zendesk/Intercom — expect no new tickets tied to flag parsing, cap enforcement, sub-25 MB clamp, or disk-tier cleanup (leaked temp dirs).cache_summarywith a non-nullcache_budget_ram_mbwithin 60 days.evictions > 0:disk_spilled_countshould trackevictions(1:1 minus a tiny tail of simultaneous-write races).disk_restored_count / disk_spilled_count> 0.5 on typical memory-constrained workloads — proves the disk tier earns its keep.--max-cache-ramflag parsing, cap clamp, disk-tier init, or "my build started failing after setting this".cache_summaryevents are never received in Amplitude (pipeline regression).cache_summary.disk_spill_failures>disk_spilled_countacross a large cohort — means the disk tier is failing open for most users; the feature would be a net loss over the old drop-on-evict behaviour.cache_summary.peak_bytesvalues clustered at the 500 MB default threshold (signals the Map-mode counter has regressed to frozen behavior).🤖 Generated with Claude Opus 4.7 (1M context, extended thinking) via Claude Code + Compound Engineering v2.50.0