Skip to content

feat(core): structured abort errors, per-instance network timeout, log redaction (PER-7855 phase 1/3)#2196

Closed
Shivanshu-07 wants to merge 1 commit intomasterfrom
feat/per-7855-cli-qos-phase-1
Closed

feat(core): structured abort errors, per-instance network timeout, log redaction (PER-7855 phase 1/3)#2196
Shivanshu-07 wants to merge 1 commit intomasterfrom
feat/per-7855-cli-qos-phase-1

Conversation

@Shivanshu-07
Copy link
Copy Markdown
Contributor

@Shivanshu-07 Shivanshu-07 commented Apr 27, 2026

Summary

Phase 1 of PER-7855 — proactive CLI hardening (no incident driving it; YAGNI applies). Network refactors plus three small wins, isolated to core/src/network.js + core/src/utils.js for low-risk first-PR.

  • R4Network.TIMEOUT static class field → per-instance networkIdleWaitTimeout. Concurrent pages with different env values no longer overwrite each other.
  • R5 — Export AbortCodes enum (ABORTED, TIMEOUT_NETWORK_IDLE). Aborted-request throw at network.js:138 carries {code, reason} via the existing AbortError class. The consumer at :529 prefers error.code === 'ABORTED'; legacy string-match clauses retained for BC.
  • R6redactSecrets() wraps the warn/debug logs in executeDomainValidation (utils.js:200, 212-213). Upstream errors that echo response bodies no longer leak AWS keys, URL-embedded credentials, etc., to stderr or build logs.
  • R7 — Append actionable hint to network-idle timeout message: `Hint: set PERCY_NETWORK_IDLE_WAIT_TIMEOUT to increase the budget, or allowlist slow domains via the discovery config.`

Implementation Note

The deepened plan called for `_throwTimeoutError` to throw `AbortError`, but that broke a chain of consumers — `error.name === 'AbortError'` is checked at `discovery.js:520`, `percy.js:347`, and `snapshot.js:472`, all of which treat aborts as "snapshot cancelled" rather than as errors. The network-idle timeout uses a plain `Error` with `code`/`reason` properties; only the explicit browser-cancellation path uses `AbortError`. This learning is captured in the plan (docs/plans/2026-04-27-001-feat-per-7855-cli-qos-hardening-plan.md).

Tests

  • 6 new specs:
    • SC6 per-instance timeout (init from env, fallback to 30000ms)
    • R5 AbortCodes shape + AbortError name preservation
    • SC8 redactSecrets fixtures (AWS keys, URL-embedded credentials)
  • Updated existing idle-timeout assertions in `discovery.test.js` for the new hint message; removed `Network.TIMEOUT` reset infrastructure that the static-field refactor obviates.

Test run on this branch: 690 specs, 28 pre-existing failures (21 `Unit / Install Chromium` with `TypeError: Cannot read properties of undefined (reading 'set')`, 5 `runDoctorOnFailure`, 1 `API Server when the server is disabled`, 1 flaky `Snapshot with browsers`). None touch any file in this PR. Confirmed pre-existing on master via stash + run.

Test plan

  • CI green on Linux + macOS + Windows (modulo the pre-existing failures above)
  • Unit / Network tests pass (4 new specs)
  • Unit / Utils > redactSecrets > SC8 fixtures pass (2 new specs)
  • Discovery > idle timeout tests pass with the hint
  • No regression in the consumer at network.js:529 swallow path
  • Manual smoke: trigger a network-idle timeout in a snapshot, confirm the hint appears in stderr

Post-Deploy Monitoring & Validation

  • What to monitor/search
    • Logs: any sudden uptick of `AbortError` instances reaching `discovery.js:520` or `percy.js:347` after release — indicates the structured-error refactor accidentally re-enabled a code path
    • Metrics/Dashboards: `PERCY_NETWORK_IDLE_WAIT_TIMEOUT` related support tickets (should decrease as users hit the new hint)
  • Validation checks
    • Search `#percy-cli` Slack for any "Request was aborted by browser" reports post-merge — the message string is preserved for BC, so this should be unchanged
  • Expected healthy behavior
    • Idle-timeout errors include the env-var hint
    • Domain-validation errors no longer leak credentials
  • Failure signal(s) / rollback trigger
    • Snapshot count drops or snapshots silently disappear (would indicate AbortError name mismatch)
    • Build logs continue to contain unredacted secrets in `Domain validation` errors
  • Validation window & owner
    • Window: 24h post-merge
    • Owner: @shivanshu.si

Origin / Plan


Compound Engineering v2.50.0
🤖 Generated with Claude Opus 4.7 (1M context, extended thinking) via Claude Code

…g redaction (PER-7855)

Phase 1 of PER-7855 CLI QoS hardening — network refactors plus small wins:

R4 — Move `Network.TIMEOUT` from a static class field to a per-instance
`networkIdleWaitTimeout`, derived from PERCY_NETWORK_IDLE_WAIT_TIMEOUT in
the constructor. Concurrent pages with different env values no longer
overwrite each other's timeout.

R5 — Export `AbortCodes` enum (`ABORTED`, `TIMEOUT_NETWORK_IDLE`). Throws
from `Network#send` for aborted requests now carry `{code, reason}` via
the existing `AbortError` class. The consumer at `network.js:529` prefers
`error.code === 'ABORTED'`; legacy string-match clauses retained for BC.

R6 — Wrap `redactSecrets()` around the warn/debug logs in
`executeDomainValidation` (`utils.js:200, 212-213`). Upstream errors that
echo response bodies no longer leak AWS keys, URL-embedded credentials,
etc., to stderr or build logs.

R7 — Append actionable hint to network-idle timeout message: "Hint: set
PERCY_NETWORK_IDLE_WAIT_TIMEOUT to increase the budget, or allowlist slow
domains via the discovery config."

Implementation note: the deepened plan called for `_throwTimeoutError`
to throw `AbortError`, but `error.name === 'AbortError'` is checked by
`discovery.js:520`, `percy.js:347`, and `snapshot.js:472` — all of which
treat aborts as "snapshot cancelled" rather than as errors. The
network-idle timeout uses a plain `Error` with `code`/`reason`
properties; only the explicit browser-cancellation path uses
`AbortError`.

Tests added: 6 new specs (SC6 per-instance timeout, R5 AbortCodes
shape, SC8 redactSecrets fixtures for AWS keys + URL-embedded creds).
Existing idle-timeout assertions in `discovery.test.js` updated for
the new hint message and removed the `Network.TIMEOUT` reset infra
that the static-field refactor obviates.

Origin:        docs/brainstorms/2026-04-24-per-7855-cli-qos-hardening-requirements.md
Plan:          docs/plans/2026-04-27-001-feat-per-7855-cli-qos-hardening-plan.md
Phase 2 next:  per-port lockfile (PER-7855)

Co-Authored-By: Claude Opus 4.7 (1M context, extended thinking) <noreply@anthropic.com>
@Shivanshu-07
Copy link
Copy Markdown
Contributor Author

Closing in favor of consolidated PR #2199, which contains all three commits (the same content) so review can happen against a single diff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant