Skip to content

feat(logger): disk-backed log store for bounded CLI memory (PER-7809)#2200

Open
pranavz28 wants to merge 1 commit intomasterfrom
feat/per-7809-disk-logs-minimal
Open

feat(logger): disk-backed log store for bounded CLI memory (PER-7809)#2200
pranavz28 wants to merge 1 commit intomasterfrom
feat/per-7809-disk-logs-minimal

Conversation

@pranavz28
Copy link
Copy Markdown
Contributor

@pranavz28 pranavz28 commented Apr 28, 2026

Summary

Replace the unbounded messages = new Set() in @percy/logger with a disk-backed JSONL store and a lazily-built in-memory cache for snapshot-tagged entries. CLI memory stays bounded across long debug-heavy builds (8h × 10k snapshots) without changing the /logs upload payload or per-snapshot log resource bytes.

Jira: PER-7809

Approach

  • JSONL is source of truth, sized down by a buffer. Every entry is appended to ${TMPDIR}/percy-logs/<pid>-<ts>-<r>.jsonl in batches of 500 entries or every 100 ms — so disk syscalls stay at ~10/sec regardless of log volume.
  • snapshotIndex holds offsets, not entries. ~16 bytes per tagged entry → ~5 MB worst case for an 8-hour 10k-snapshot debug build.
  • cache is built lazily on first snapshotLogs(...) call by reading the disk delta since the previous call. In normal mode the cache holds at most one snapshot's worth (evictSnapshot runs immediately after each snapshot uploads). In defer mode the cache fills at end-of-build and drains as snapshots upload sequentially.
  • query(filter) is sync, flushes the buffer, chunked-reads the disk in 64 KB chunks, JSON-parses linewise, filters. Used twice at end of build (checkForNoSnapshotCommandError, sendBuildLogs).
  • Redaction stays exactly where master has itredactSecrets lives in @percy/core/src/utils.js and runs only on ci-logs at upload. Disk content matches master's in-memory shape.
  • Graceful degradation. Any disk failure (ENOSPC, RO fs, unwritable tmpdir, or PERCY_LOGS_IN_MEMORY=1) transparently falls back to master's unbounded in-memory Set. Build continues either way.

Parity guarantee

A diff of the /logs HTTP payload and per-snapshot log resource bytes between a build run on master vs this branch should be empty:

  • Same { debug, level, message, meta, timestamp, error } shape.
  • Same insertion order (JSONL append-only, single-threaded JS).
  • Same redaction (only ci-logs at upload time).
  • Same JSON.stringify semantics (entries round-trip through disk before upload, but the shape is preserved for everything that lands in real Percy log calls).

What changed

File Change
packages/logger/src/logger.js Disk + buffer + lazy snapshot cache + memory-fallback path. Add snapshotLogs, evictSnapshot, reset. Keep messages as a back-compat getter.
packages/logger/src/index.js Re-export new methods on the singleton.
packages/core/src/discovery.js 2 lines: per-snapshot resource uses logger.snapshotLogs(meta) + logger.evictSnapshot(meta).
packages/core/src/api.js 1 line: /test/api/reset calls logger.instance.reset() instead of mutating the (now-derived) messages Set.
packages/logger/test/helpers.js Tests run in unbounded in-memory mode (PERCY_LOGS_IN_MEMORY=1) by default; reset properly tears down disk state.
packages/logger/test/logger.test.js New disk-mode tests (round-trip, buffer flush triggers, snapshot index + eviction, reset, circular meta, in-memory rollback).
packages/core/test/api.test.js 1 line: same messages.clear()reset() change.

Test plan

Local

  • @percy/logger tests: 45/45 passing (37 existing + 8 new disk-mode).
  • yarn lint clean.
  • @percy/core tests: running on CI (local environment lacks built @percy/dom bundle for many discovery tests).

End-to-end (parity)

Each scenario will be run twice — once against @percy/cli@1.31.x (latest published) and once against this branch — with logs compared via percy-support-cli, and RSS/disk/runtime monitored during the run:

  1. Baseline web build (~5 snapshots, normal mode).
  2. Baseline + PERCY_DEBUG=* (chatty logs / hot path).
  3. Medium build (~50 snapshots, normal mode) — exercises per-snapshot eviction.
  4. percy upload defer mode (~20 image snapshots) — the deferUploads gotcha.
  5. Medium build + PERCY_DEBUG=* — stress combo.
  6. PERCY_LOGS_IN_MEMORY=1 rollback — confirms master parity.
  7. Build with deliberately failing URL — verifies error/error-stack round-trip.
  8. Two builds running concurrently — verifies unique-filename and best-effort folder clear.

For each: ✅ logs match / ❌ diff lines, RSS-master vs RSS-branch (peak), runtime-master vs runtime-branch, disk file present-during / absent-after.

Known follow-ups (not in this PR)

  • Streaming /logs HTTP upload (would eliminate the end-of-build memory spike).
  • Wire bench-logger.js into CI (perf gates).

🤖 Generated with Claude Opus 4.7 (1M context) via Claude Code

@pranavz28 pranavz28 requested a review from a team as a code owner April 28, 2026 22:57
pranavz28 added a commit that referenced this pull request Apr 29, 2026
P1 fixes:
- Track evicted snapshot keys in `evicted` Set so late entries that land
  on disk after evictSnapshot don't repopulate the cache (defeats the
  bounded-memory promise in defer mode).
- Per-pid subdir under percy-logs/ instead of clearing the shared
  percy-logs/ dir at init — concurrent percy processes (CI matrix,
  parallel workers, npx) no longer clobber each other's JSONLs.
- Remove unused `messages` Set-shaped getter; api.js + percy.test.js use
  query() directly. Also drops the helpers.js force-memory-mode flag so
  the existing 60-case suite now exercises the disk path end-to-end.

P2 fixes:
- Drop the write-only snapshotIndex Map (was a per-entry write with no
  reader).
- Consolidate query() and _readAllFromDisk() into one _readDiskFiltered()
  helper; query() = flush + filter, fallback path = filter all.
- Add SIGINT/SIGTERM/uncaughtException hooks (via _installSignalHooks)
  so Ctrl-C / runner kill don't orphan the JSONL or lose buffered logs.
- Bump filename randomness from randomBytes(2) to randomBytes(8).
- Preserve meta.snapshot when stringify falls back for circular meta so
  the entry still routes through snapshotLogs().

Coverage: 100% on logger.js / index.js / utils.js / timing.js. 62/62 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pranavz28 pranavz28 force-pushed the feat/per-7809-disk-logs-minimal branch 2 times, most recently from 3660a7a to 6bd3c68 Compare April 30, 2026 05:40
Copy link
Copy Markdown
Contributor

@aryanku-dev aryanku-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Solid PR — well-architected, thoughtful comments, strong test coverage of the disk-backed path including the failure-fallback flows. The use of a JSONL append-only log + lazy snapshot cache is a clean way to bound memory while preserving the upload payload shape.

The rest of this review walks through specific concerns inline. None are outright merge-blockers, but I'd want #1 (retry semantics) and #2 (_refreshCache allocation) addressed before merge — both have real impact on the hot path the PR is trying to fix.

Severity-tagged findings

🟡 P2 — Should fix before merge:

  1. evictSnapshot + cacheCursor combination drops prior-attempt logs on snapshot retry (semantic regression vs master). See discovery.js:221 + logger.js:163.
  2. _refreshCache allocates Buffer.alloc(diskSize - cacheCursor) in one shot — defeats the bounding goal in defer mode (could be hundreds of MB). See logger.js:411.
  3. Defer-mode peak memory: cache is built lazily, but _refreshCache groups all tagged entries on first call; in defer mode that's the entire build. PR description acknowledges this; flagging because the "bounded CLI memory" claim is conditional on normal mode.
  4. snapshotLogs returns the live cache array, not a copy — any mutation by a consumer (current or future) corrupts the cache. logger.js:160.
  5. fs.appendFileSync blocks the event loop on every flush (~10×/sec). On slow CI disks this is measurable. logger.js:365.

🔵 P3 — Nice-to-have:
6. _snapshotKey separator | collides on legitimate test names containing |. logger.js:332.
7. PERCY_LOGS_IN_MEMORY is only read in the constructor — silently ignored if set after first log. logger.js:69.
8. _filterFallback is O(N) per call; matters in memory-fallback mode under high snapshot counts. logger.js:296.
9. query() returns fresh-parsed entries in disk mode vs original entries in memory mode — an identity-vs-value semantic split. Worth a single line of JSDoc warning consumers.
10. Comment on lines 47–48 reads as if all entries repopulate post-eviction; only entries logged after eviction do. Slightly misleading.

Test coverage gaps

  • ⚠️ No test for the retry case (snapshot logs across multiple processSnapshotResources calls for same meta with errors in between). The "retry-friendly" test only validates late entries surface, not that logs from the first attempt are preserved on retry.
  • ⚠️ No test for _refreshCache with very large delta (defer-mode 1k+ tagged entries before first snapshotLogs).
  • ⚠️ No test verifying the returned snapshotLogs array isn't mutated by consumer.

Parity

  • Insertion order: ✅ JSONL append + single-threaded JS preserves order.
  • Redaction: ✅ Only on upload, disk content unchanged.
  • /logs payload shape: ✅ Same envelope.
  • Object identity: ⚠️ Diverges in disk mode (see #9).

🤖 Reviewed via Claude Code multi-agent review pass.

Comment thread packages/logger/src/logger.js Outdated
// snapshotLogs cache: Map<key, entry[]>. Bounded by # of un-evicted snapshot
// keys at any moment; evictSnapshot() drops them. Late entries that arrive
// after eviction repopulate the cache — that is intentional, so retry/
// re-discovery flows that snapshotLogs(meta) again still see them.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 P3 — comment is slightly misleading.

Reading Late entries that arrive after eviction repopulate the cache it sounds like any late access of an evicted key works. In practice only entries logged after the eviction repopulate — entries that were already past cacheCursor and grouped before eviction are not re-read on next snapshotLogs(meta).

Suggest tightening to:

// Entries logged AFTER eviction repopulate the cache on the next
// snapshotLogs(meta) call. Entries already consumed before eviction
// are not re-fetched (cacheCursor only moves forward).

See #1 below for why this matters for retry.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in 3dce017c. Updated the comment to:

Entries logged AFTER eviction repopulate the cache through the next _refreshCache delta. Pre-eviction entries are restored on retry via the pendingFullScan re-scan in snapshotLogs (preserves master's messages Set retain-everything semantics for retried snapshots).

instance.fallback ??= new Set();
/* istanbul ignore if: only triggered when env=1 is set after logs have already buffered */
if (instance.writeBuffer.length) instance._drainBufferToMemory();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 P3 — env var only consulted at construction time.

PERCY_LOGS_IN_MEMORY is read once in the singleton constructor; setting it later (or unsetting it) has no effect because subsequent new Logger() calls return the cached instance. Tests work around this by delete logger.constructor.instance before flipping the env.

Not a bug, but worth a one-liner note in the comment block above this branch so future debuggers don't get tripped up. (Or read it on every _record if production needs runtime toggling — probably not.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in 3dce017c. Added a one-liner directly in this comment block:

Note: PERCY_LOGS_IN_MEMORY is only consulted here at construction time; setting or unsetting it later has no effect because subsequent new Logger() returns the cached singleton — tests that need to flip mode mid-process must delete logger.constructor.instance first.

Comment thread packages/logger/src/logger.js Outdated
return this._filterFallback(key);
}
this._refreshCache();
return this.cache.get(key) || [];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 P2 — returning the live cache array.

return this.cache.get(key) || [] hands the caller the same array reference held in the cache. Any mutation by the consumer (push, splice, in-place redaction, etc.) corrupts the cache for any later observer of the same key (the eviction window is short but non-zero — discovery.js:220 calls this then evicts on the next line, but a future caller that doesn't immediately evict is exposed).

createLogResource looks safe today, but this is a sharp edge. Defensive copy:

let entries = this.cache.get(key);
return entries ? entries.slice() : [];

The slice is cheap relative to the JSON parsing already done.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in 3dce017c. snapshotLogs(meta) now returns a shallow copy:

return cached ? [...cached] : [];

Caller push/splice no longer corrupts the cache. Test added — 'snapshotLogs returns a fresh array — caller push/splice does not corrupt cache' — that pushes/splices the returned array and asserts the second call still sees the original entries.

Note: the entry object refs are still shared (same as master's Array.from(messages).filter(...) semantics), so an in-place entry.message = X mutation by redactSecrets still works in memory mode for the cli-exec test contract. That's intentional.

let key = this._snapshotKey({ snapshot: meta });
if (!key) return;
this.cache.delete(key);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 P2 — eviction permanently removes entries from the snapshot view (semantic regression vs master, retry-relevant).

In master, messages = new Set() retained every entry; a retried snapshot's log resource included logs from both the failed attempt and the successful one. With this PR:

  1. Attempt 1 logs [A1] for snapshot meta {name: 'A'}.
  2. processSnapshotResources calls snapshotLogs({name:'A'})[A1] cached, then immediately evictSnapshot({name:'A'}) (discovery.js:221).
  3. Retry triggers; attempt 2 logs [A2].
  4. snapshotLogs({name:'A'}) returns only [A2]cacheCursor already advanced past A1, eviction dropped the cache slot, and there's no path back.

The new lets late entries after evictSnapshot repopulate the cache (retry-friendly) test (logger.test.js:635-653) explicitly asserts logsA.length === 1 after a late log — i.e. it codifies the regression rather than catching it.

Options to consider:

  • Move evictSnapshot call to after the snapshot successfully uploads (not after processSnapshotResources).
  • On evict, store a checkpoint marker so a subsequent snapshotLogs call for the same key re-reads disk from before the cursor.
  • If lossy-on-retry is intentional, document it loudly (PR body, JSDoc) — "snapshot log resource will not include logs from prior failed retry attempts."

I'd lean toward the first option — it preserves master parity and the disk has the data already.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in 3dce017c — retry parity restored.

Mechanism: evictSnapshot(meta) now drops cache[key] AND adds the key to a pendingFullScan Set. The next snapshotLogs(meta) call detects the mark and triggers a one-shot full-disk rescan via _scanDisk(e => snapshotKey(e.meta) === key). The recovered entries (including A1 from before eviction) are repopulated into cache[key]. Subsequent calls are O(1) cache-hit again.

Sequence with the new code:

  1. log A1 → buffer → flush. diskSize = L1.
  2. snapshotLogs(A)_refreshCache populates cache[A] = [A1]. Returns [A1].
  3. evictSnapshot(A)cache.delete(A); pendingFullScan.add(A).
  4. log A2 (retry) → flush. diskSize = L2.
  5. snapshotLogs(A)_refreshCache advances cursor past L2 (adds A2 to a fresh cache[A]); pendingFullScan.has(A) is true → _scanDisk for key A scans full disk → returns [A1, A2]cache[A] = [A1, A2]. Returns [A1, A2]. ✓

Tests added/updated:

  • 'on retry after evictSnapshot, snapshotLogs returns BOTH attempts (master parity)' — exact regression case
  • 'evictSnapshot drops the cache but disk + retry rescan still surface the entries' — single-attempt eviction-then-resnapshot case

Memory cost: the rescan is one-shot per evict-then-resnapshot cycle; the rescan result is bounded by the number of entries for that single key. The bounded-memory promise during the build is preserved (cache only grows when snapshotLogs is invoked, evictions still drop, late entries past cursor still grouped incrementally).

if (this._snapshotKey(entry?.meta) === key) result.push(entry);
}
return result;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 P3 — O(N) per call.

In memory-fallback mode (or PERCY_LOGS_IN_MEMORY=1), a 10k-snapshot build with debug logs calls _filterFallback 10k times, each scanning the entire fallback Set. That's quadratic in entry count.

Acceptable for the failure path (you've already lost disk), but if PERCY_LOGS_IN_MEMORY=1 is ever a recommended workaround, this becomes a real perf footgun. A Map<key, entry[]> index built incrementally in _record would mirror the disk-mode cache structure for O(1) lookup.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in 3dce017c. Added a lazy Map<key, entry[]> index over fallback (fallbackByKey):

  • _filterFallback(key) builds the index on first call (O(N) once), then reads are O(1)
  • _record (in memory mode) maintains the index incrementally — push entry into fallbackByKey.get(key), creating a bucket if needed
  • _fallbackToMemory and reset clear the index

Tests added that exercise the lazy build, incremental update, untagged-entry skip, and empty-key-miss branches:

  • 'memory-mode fallbackByKey lazy build groups multiple entries per key'
  • 'memory-mode index update — incremental _record path with and without meta'
  • 'memory-mode snapshotLogs returns [] for an unknown key after lazy build'

PERCY_LOGS_IN_MEMORY=1 is now safe as a workaround for 10k-snapshot builds.

log.meta.snapshot?.testCase === snapshot.meta.snapshot.testCase && log.meta.snapshot?.name === snapshot.meta.snapshot.name
))));
resources.push(createLogResource(logger.snapshotLogs(snapshot.meta.snapshot)));
logger.evictSnapshot(snapshot.meta.snapshot);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 P2 — eviction immediately after snapshotLogs defeats retry parity (see logger.js:167 comment for the full mechanism).

This call site is where the regression manifests: if processSnapshotResources is called more than once for the same snapshot meta (the retry path through captureSnapshotResources at line 486 wraps this with count: 3), the second invocation sees a cache that's been pruned and a cacheCursor already past the first attempt's logs.

Master's logger.query(...) re-scanned the entire messages Set every time, so retries got the union.

Minimal fix — defer the evict until after the snapshot is known to be uploaded successfully, not inside resource processing. Or wire it into the upload completion callback further downstream. Open to other approaches; the key is snapshotLogs(meta) and evictSnapshot(meta) shouldn't be coupled in a single tick if the meta might still be retried.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in 3dce017c — same fix as logger.js:167 (the pendingFullScan rescan in snapshotLogs). discovery.js:221 is unchanged: it still calls snapshotLogs then evictSnapshot immediately.

The captureSnapshotResources retry path (count: 3) goes through processSnapshotResources again with the same meta — the second invocation now hits the pendingFullScan mark from the prior eviction and triggers a one-shot full-disk rescan, recovering the first attempt's logs alongside the second attempt's. So both attempts' logs end up in the per-snapshot log resource (master parity).

Test that pins this behaviour: 'on retry after evictSnapshot, snapshotLogs returns BOTH attempts (master parity)' in logger.test.js.

Comment thread packages/core/src/api.js
// the reset command will reset testing mode and clear any logs
percy.testing = {};
logger.instance.messages.clear();
logger.instance.reset();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Good replacement. messages.clear() was a tight coupling to the in-memory representation; reset() is the right abstraction.

Minor: in tests where /test/api/reset is called between cases, the new reset() also tears down the disk file and resets deprecations. That's stronger than master's messages.clear() (which only cleared logs, not deprecations). Worth verifying no test depends on deprecations persisting across cmd === 'reset' — quick grep didn't find any, but a confirmation would be nice.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified, no test depends on deprecations persisting across /test/api/reset.

Greps:

  • packages/core/test/api.test.js:719 and :803 are the only call sites of /test/api/reset. Lines 720/804 onward only assert on percy.testing config and on /test/logs returning empty — nothing checks deprecations after the reset.
  • Across the full repo, no test file inspects logger.instance.deprecations after invoking the reset endpoint.

Change is safe.

return Array.from(this.fallback).filter(filter);
}
return this._scanDisk(filter);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 P3 — semantic difference worth a JSDoc note.

query() returns:

  • In memory mode: the original entry objects from the fallback Set. Mutations persist.
  • In disk mode: freshly parsed copies. Mutations are local to the caller.

redactSecrets in percy.js mutates entry.message in place. In master / memory mode, that mutation persists in messages; in disk mode it doesn't (the disk file is unchanged, which is actually the correct behavior for redaction).

No current consumer breaks, but the identity-vs-value split is exactly the kind of thing that bites in 6 months. A one-line JSDoc:

// NOTE: in disk mode, entries are freshly parsed copies — mutating
// returned entries does NOT persist back to the log store. In memory
// mode, returned entries are the live Set members.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in 3dce017c. JSDoc rewritten:

Returns matching entries. The semantics differ by mode:

  • memory mode: returns the live entry refs from the fallback Set; mutations to entry.message (e.g. redactSecrets in percy.js sendBuildLogs) persist in the Set. This mirrors master's messages contract.
  • disk mode: streams a fresh JSONL pass per call and returns freshly-parsed copies. Mutations are local to the caller and never reach disk — intentional, since disk-backed redaction would require a rewrite.

Production consumers (sendBuildLogs) only depend on the array returned by redactSecrets, not the mutation side-effect, so both modes are correct.

expect(logsA[0].message).toEqual('A2 (late)');
// disk still keeps the entry for the global /logs payload
expect(logger.query(() => true).find(e => e.message === 'A2 (late)')).toBeDefined();
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test gap — this codifies the retry regression rather than catching it.

The assertion expect(logsA.length).toEqual(1) (only A2 (late), not A1 + A2) is the regression I flag in logger.js:167. In master, the in-memory Set retained A1, so a retry path would have surfaced both.

If the lossy-on-retry behavior is intentional, please:

  1. Rename the test to make it explicit: 'evictSnapshot is permanent — late entries form a fresh slice, prior entries are dropped'.
  2. Add a separate negative-assertion test confirming that on snapshot retry (calling snapshotLogs(meta) twice with no eviction in between, simulating a retry that re-runs processSnapshotResources), both attempts' logs are present.

If it's not intentional, the fix is in evictSnapshot (don't drop the cache entry, just mark it consumed and re-build from disk on next access).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in 3dce017c — fixed the underlying behaviour rather than codifying the regression.

Old assertion (this PR's previous commit):

expect(logsA.length).toEqual(1);  // only A2 (late) — lossy

New assertion:

expect(logsA.map(l => l.message)).toEqual(['A1', 'A2 (retry)']);  // both attempts

Test renamed: 'on retry after evictSnapshot, snapshotLogs returns BOTH attempts (master parity)'. The fix mechanism is the pendingFullScan rescan documented in the reply on logger.js:167.

The companion test 'evictSnapshot drops the cache but disk + retry rescan still surface the entries' validates the single-attempt eviction-then-resnapshot case so the rescan path is independently tested.

// the default 10-listener limit and trips MaxListenersExceededWarning
// in downstream test suites. On Ctrl-C / runner kill our JSONL is left
// in os.tmpdir()/percy-logs/<pid>/ which the OS cleans, and the
// pid-scoped subdir prevents concurrent runs from colliding.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Acknowledgement — the SIGINT/SIGTERM punt is reasonable but worth a follow-up.

Not blocking on this; the trade-off is documented and the OS cleans /tmp eventually. But on macOS the /tmp cleaner runs every 3 days, and on long-lived dev machines the percy-logs/<pid>/ dirs from killed runs will accumulate. If anyone hits a complaint, consider:

  • A startup sweep that removes any percy-logs/* subdirs whose pid is no longer alive (process.kill(pid, 0) throws ESRCH).
  • Or move the dir under os.tmpdir()/percy/<run-id> and rely on a single boot sweep.

Definitely a follow-up, not in-scope here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — leaving as a follow-up per your note. Filed mentally as: "startup sweep over percy-logs/* removes subdirs whose pid is no longer alive (process.kill(pid, 0) throws ESRCH)". That's the cleanest of the two options you listed and avoids the race window of moving the dir per-run.

Will open a follow-up issue if it comes up in real complaints; not blocking this PR.

Replaces @percy/logger's unbounded `messages` Set with a JSONL-backed
hybrid store that keeps resident memory bounded across long builds
(10k snapshots, 8-hour deferUploads runs) while preserving byte-for-byte
parity with master's `/logs` upload payload and per-snapshot log resources.

Design
------
- writes go through a 500-entry / 100ms buffer flushed via
  fs.appendFileSync to ${tmpdir}/percy-logs/<pid>/<ts>-<rand>.jsonl
- snapshotLogs(meta) reads the disk delta into a bounded `cache` keyed by
  snapshot meta; evictSnapshot drops the cache entry once the snapshot's
  upload is complete; late entries are allowed to repopulate (retry-safe)
- query(filter) streams the JSONL once per call (chunked 64KB read);
  in-memory mode preserves master's identity-mutation contract that
  redactSecrets relies on
- disk-init failures, mid-build appendFileSync failures, and the
  PERCY_LOGS_IN_MEMORY=1 rollback all flip to an unbounded in-memory Set
  (master parity) — including replaying entries already on disk so the
  upload still includes them
- circular meta is sanitized via JSON.stringify try/catch, but
  meta.snapshot is preserved so the entry still routes via snapshotLogs
- exit cleanup uses a process-scoped Set on `process[Symbol.for(...)]`
  shared across module copies; supports multiple live instances and
  unlinks every active disk file on `exit` / `beforeExit`
- per-pid subdir prevents concurrent percy processes (CI matrix, parallel
  workers, npx invocations) from clobbering each other's files; cleanup
  best-effort rmdirs the subdir so long-lived runners don't accumulate

Wiring
------
- packages/core/src/discovery.js — uses snapshotLogs/evictSnapshot for
  per-snapshot log resources
- packages/core/src/api.js — /test/logs and the test-mode reset path now
  use logger.query() / logger.instance.reset()
- packages/core/test/helpers/index.js — defaults setupTest to memory mode
  (master parity) so downstream tests using mockfs don't flake against
  the disk path's live volume

Tests
-----
- 68 specs, 100% statements/branches/functions/lines on logger.js
- the existing 37-case logger suite runs under the disk path by default
  (no PERCY_LOGS_IN_MEMORY set in helpers.mock); 25+ new specs in
  describe('disk-backed storage') cover round-trip, snapshotLogs / evict
  retry semantics, fallback after appendFileSync / mkdirSync failures,
  the 100ms timer, the 500-entry size cap, per-pid subdir, the
  Symbol.for latch, multi-instance cleanup, rmdir best-effort, and
  circular-meta snapshot preservation
- cli-exec / cli-snapshot / cli-build / cli-doctor / cli-upload /
  cli-command / cli / core / config / client / env / monitoring /
  webdriver-utils all green locally on Node 14

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pranavz28 pranavz28 force-pushed the feat/per-7809-disk-logs-minimal branch from 6bd3c68 to 3dce017 Compare April 30, 2026 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants