Skip to content

fix(webapp): auto-recover replication services after stream errors#3613

Open
ericallam wants to merge 5 commits into
mainfrom
fix/replication-auto-recover-on-stream-error
Open

fix(webapp): auto-recover replication services after stream errors#3613
ericallam wants to merge 5 commits into
mainfrom
fix/replication-auto-recover-on-stream-error

Conversation

@ericallam
Copy link
Copy Markdown
Member

Summary

When the logical-replication stream errored (most commonly after a Postgres failover), the runs and sessions replication services logged the error and left the underlying client stopped. The host process kept running, the WAL backed up, and ClickHouse silently fell behind.

Fix

Both services now run a configurable recovery strategy on stream errors, defaulting to in-process reconnect with exponential backoff so a fresh self-hosted setup heals on its own.

  • reconnect (default) — re-subscribe with exponential backoff (1s → 60s cap, unlimited attempts). LogicalReplicationClient.subscribe(lastLsn) re-validates the publication, re-acquires the leader lock, and resumes from the last acknowledged LSN.
  • exitprocess.exit(1) after a short flush window so a host supervisor (Docker restart=always, systemd, k8s) can replace the process.
  • log — preserves the old behaviour.

Per-service strategy + exit knobs are env-driven (RUN_REPLICATION_ERROR_STRATEGY / SESSION_REPLICATION_ERROR_STRATEGY + *_EXIT_DELAY_MS, *_EXIT_CODE). Reconnect tuning is shared across both services (REPLICATION_RECONNECT_INITIAL_DELAY_MS, _MAX_DELAY_MS, _MAX_ATTEMPTS; MAX_ATTEMPTS=0 means unlimited).

Test plan

Integration tests cover all three strategies by simulating a failover with pg_terminate_backend against the WAL sender:

  • reconnect — kill the backend, insert a new row, assert it lands in ClickHouse
  • exit — kill the backend, assert process.exit(1) is called
  • log — kill the backend, insert a new row, assert it does not land in ClickHouse
pnpm --filter webapp test --run runsReplicationService.errorRecovery

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 13, 2026

⚠️ No Changeset found

Latest commit: 5944ff6

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This PR adds configurable error recovery for the runs and sessions replication services. When a logical replication stream fails (e.g., during a database failover), the system can reconnect with exponential backoff, exit to let an external supervisor restart the host, or remain stopped with logging. Environment variables control per-service strategy selection and tuning. The implementation integrates into both services' lifecycle (on error, stream start, and shutdown) and is validated through containerized integration tests that force replication stream failures.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'fix(webapp): auto-recover replication services after stream errors' clearly and concisely summarizes the main change—adding automatic error recovery to replication services.
Description check ✅ Passed The PR description provides a comprehensive summary, detailed explanation of the fix with three strategies, environment variable configuration details, and test plan coverage.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/replication-auto-recover-on-stream-error

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

@ericallam ericallam force-pushed the fix/replication-auto-recover-on-stream-error branch from 6f8cc24 to 5ba46ff Compare May 14, 2026 09:33
devin-ai-integration[bot]

This comment was marked as resolved.

@ericallam ericallam force-pushed the fix/replication-auto-recover-on-stream-error branch 2 times, most recently from bc57072 to 969dbdb Compare May 15, 2026 07:31
devin-ai-integration[bot]

This comment was marked as resolved.

@ericallam ericallam force-pushed the fix/replication-auto-recover-on-stream-error branch from 969dbdb to 964b7e4 Compare May 15, 2026 15:31
devin-ai-integration[bot]

This comment was marked as resolved.

@ericallam ericallam force-pushed the fix/replication-auto-recover-on-stream-error branch from 964b7e4 to a2eaf3e Compare May 15, 2026 16:09
devin-ai-integration[bot]

This comment was marked as resolved.

ericallam added 5 commits May 16, 2026 16:57
When the underlying logical-replication client errored (e.g. after a
Postgres failover), the runs and sessions replication services logged
the error and left the stream stopped. The host process kept running,
the WAL backed up, and ClickHouse silently fell behind.

Both services now run a configurable recovery strategy on stream errors,
defaulting to in-process reconnect with exponential backoff so a fresh
self-hosted setup heals on its own:

- "reconnect" (default) re-subscribes via the existing subscribe(lastLsn)
  path with exponential backoff (1s -> 60s cap, unlimited attempts), which
  re-validates the publication, re-acquires the leader lock, and resumes
  from the last acknowledged LSN.
- "exit" calls process.exit after a short flush window so a host's
  supervisor (Docker restart=always, systemd, k8s, etc.) can replace the
  process.
- "log" preserves the historical behaviour.

Per-service strategy + exit knobs are env-driven via
RUN_REPLICATION_ERROR_STRATEGY / SESSION_REPLICATION_ERROR_STRATEGY plus
matching *_EXIT_DELAY_MS / *_EXIT_CODE. Reconnect tuning is shared
across both services via REPLICATION_RECONNECT_INITIAL_DELAY_MS /
_MAX_DELAY_MS / _MAX_ATTEMPTS (0 = unlimited).
Addresses PR review feedback:

- LogicalReplicationClient.subscribe() can throw before its internal
  "error" listener is wired up (notably when pg client.connect() fails
  mid-failover). The reconnect strategy's catch block only logged, so
  recovery silently stopped. Now also calls scheduleReconnect(err) — the
  pendingReconnect guard makes it idempotent if an error event was also
  emitted.
- Reject negative values for the new replication-recovery env vars and
  cap exit codes at 255.
- Convert the new ReplicationErrorRecovery{Deps,} interfaces to type
  aliases to match the repo's TypeScript style.
- Tighten the reconnect dep comment to drop a stale "lastAcknowledgedLsn"
  reference (the wrapper-tracked resume LSN is what callers actually pass).
- Restore process.exit after service.shutdown() in the exit-strategy
  test so a delayed exit timer can't terminate the test worker.
LogicalReplicationClient.subscribe() can resolve without throwing or
emitting an "error" event when leader-lock acquisition fails — it just
calls this.stop() and returns. The reconnect callback now checks
isStopped after subscribe() and throws so the recovery handler can
schedule the next attempt instead of silently giving up.
…rough handle()

The previous post-subscribe() isStopped check was always true on the
happy path: subscribe() calls stop() up front (setting _isStopped=true)
and only resets the flag inside the replicationStart event, which fires
asynchronously after subscribe() returns. So the check threw on every
successful reconnect, the catch rescheduled, the next attempt tore down
the just-built client, and the cycle continued — replication briefly
worked between teardowns, which is why the integration test passed.

Replace it with the correct nudge: subscribe to leaderElection and call
the recovery handler on isLeader=false. That's the only subscribe()
exit path that doesn't either throw or emit an "error" event (the other
silent-return paths emit "error" first via createPublication/createSlot
failures).
The previous commit routed leaderElection(false) through handle(), which
under the exit strategy schedules process.exit. In a multi-instance
deployment that turns lost leader election — a normal operational state
— into a restart loop: exit, supervisor restarts, election fails again,
exit, and so on.

Add a dedicated notifyLeaderElectionLost() on ReplicationErrorRecovery
that the reconnect strategy treats as another retry trigger, while
exit and log strategies no-op. Wire the wrapper services through the
new method.
@ericallam ericallam force-pushed the fix/replication-auto-recover-on-stream-error branch from 499060e to 5944ff6 Compare May 16, 2026 15:57
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 new potential issues.

View 14 additional findings in Devin Review.

Open in Devin Review

Comment on lines +79 to +81
const exitSpy = vi
.spyOn(process, "exit")
.mockImplementation(((code?: number) => undefined as never) as typeof process.exit);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Test uses vi.spyOn/mockImplementation in violation of repository testing rules

The exit strategy test uses vi.spyOn(process, "exit").mockImplementation(...) at line 79-81. The repository's testing guidelines in ai/references/tests.md:78-81 explicitly state "Do not mock anything", "Do not use mocks in tests", "Do not use spies in tests", and "Do not use stubs in tests". CLAUDE.md reinforces this with "Never mock anything - use testcontainers instead." The existing sibling replication tests (test/runsReplicationService.part1.test.ts, test/runsReplicationService.part2.test.ts) follow this convention and use no mocks.

Pragmatic context

Mocking process.exit is arguably necessary here since the alternative (letting the process actually exit) would kill the test runner. There are also ~20 existing violations of this rule elsewhere in the webapp test suite. One option would be to restructure the exit strategy to call an injectable callback instead of process.exit directly, making it testable without mocks.

Prompt for agents
The exit strategy test spies on process.exit and uses mockImplementation, violating the repository rule against mocks/spies in tests (see ai/references/tests.md). To make this testable without mocks, consider refactoring the exit strategy in replicationErrorRecovery.server.ts to accept an injectable exit function (defaulting to process.exit). The test can then pass a no-op or tracking function instead of spying on the global. This would keep the test aligned with the existing replication service test conventions in runsReplicationService.part1.test.ts and part2.test.ts which use no mocks.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +70 to 78
errorRecovery: strategyFromEnv({
strategy: env.SESSION_REPLICATION_ERROR_STRATEGY,
reconnectInitialDelayMs: env.REPLICATION_RECONNECT_INITIAL_DELAY_MS,
reconnectMaxDelayMs: env.REPLICATION_RECONNECT_MAX_DELAY_MS,
reconnectMaxAttempts: env.REPLICATION_RECONNECT_MAX_ATTEMPTS,
exitDelayMs: env.SESSION_REPLICATION_EXIT_DELAY_MS,
exitCode: env.SESSION_REPLICATION_EXIT_CODE,
}),
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Sessions replication service is created with error recovery but never started in the instance file

The sessionsReplicationInstance.server.ts creates the SessionsReplicationService with full error recovery configuration but never calls service.start() — unlike runsReplicationInstance.server.ts:83-93 which conditionally starts and registers signal handlers. The sessions instance is reference-held in adminWorker.server.ts:12 (void sessionsReplicationInstance) with a comment claiming it "subscribes to the logical replication slot, wires signal handlers" — but neither happens. The error recovery added by this PR would only become effective if/when start() is called externally. This is a pre-existing issue, not introduced by this PR.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

}),
});

if (env.RUN_REPLICATION_ENABLED === "1") {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Initial start() failure from pg connection error bypasses error recovery

If start()subscribe() throws due to client.connect() failing (at internal-packages/replication/src/client.ts:284), the error propagates to the instance's .catch() handler at runsReplicationInstance.server.ts:89-93 which only logs. No error event is emitted by the LogicalReplicationClient for connect() failures, so the error recovery never triggers. The service is left dead. This is a pre-existing gap — the PR's error recovery correctly handles mid-stream failures and leader election losses, but the initial connection failure path remains unprotected. The reconnect strategy DOES handle this case during subsequent reconnect attempts (the catch block at replicationErrorRecovery.server.ts:98-110 re-schedules), so only the very first start() call is affected.

(Refers to lines 83-93)

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant