daemon: configurable max-age to recycle RPC connections by DeviaVir · Pull Request #225 · Blockstream/electrs

DeviaVir · 2026-06-16T08:17:09Z

Problem

electrs holds long-lived TCP connections to bitcoind for the whole process lifetime, only re-establishing them on error. When electrs reaches the daemon through a fronting load balancer a connection established before a rotation keeps routing to its original backend via the existing TCP/conntrack flow. Even after that backend is rotated out (or a closer/healthier one appears), a connection that is still "working" is never rebalanced, so it can stay pinned to a stale or sub-optimal endpoint indefinitely.

Note: daemon_rpc_addr is resolved to a SocketAddr once at startup (config::str_to_socketaddr). For a ClusterSetIP that address is a stable virtual IP, so the fix here is to recycle the TCP connection (forcing the LB to re-select a backend), not to re-resolve DNS. If true DNS re-resolution is ever needed for a different deployment, that's a follow-up.

Change

Adds a --daemon-rpc-conn-max-age <seconds> option:

Each Connection records when it was established. Before sending a request, if the connection is older than the configured max age it is proactively recycled (reconnect()), giving the load balancer a fresh backend selection.
Defaults to 0 = unlimited / never recycle, so behavior is unchanged unless an operator opts in.
The check also covers the per-thread connections used for parallel RPC (requests_iter), which are the ones that fan out across backends.

Files

src/config.rs — new CLI flag + daemon_conn_max_age: Option<Duration> (None when 0).
src/daemon.rs — Connection tracks established/max_age, adds is_expired(), and call_jsonrpc recycles expired connections; threaded through Daemon.
src/bin/electrs.rs, src/bin/tx-fingerprint-stats.rs — pass the new value.

Testing

cargo check --bins and cargo check --bins --features liquid both pass (only pre-existing warnings).
Not yet exercised at runtime — opening as draft for review of the approach (in particular: is recycling in call_jsonrpc the right spot vs. a background timer, and should the value also accept a human duration like 30m).

Copilot

Pull request overview

Adds an opt-in mechanism to proactively recycle long-lived bitcoind RPC TCP connections, enabling better backend rebalancing when electrs connects through a load balancer (e.g., avoiding “pinned” connections across node rotations).

Changes:

Introduces --daemon-rpc-conn-max-age <seconds> (default 0 = unlimited) and stores it as Option<Duration> in Config.
Tracks connection establishment time in Connection and reconnects in call_jsonrpc() when the configured max age is exceeded.
Threads the new configuration through binaries and test harness initialization.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/common.rs	Initializes the new `Config::daemon_conn_max_age` field and passes it into `Daemon::new()`.
src/daemon.rs	Adds connection age tracking + expiry check and proactive reconnect on JSON-RPC calls; threads config through `Daemon`.
src/config.rs	Adds CLI flag, parses to `Option<Duration>`, and stores it on `Config`.
src/bin/electrs.rs	Passes `daemon_conn_max_age` into `Daemon::new()`.
src/bin/tx-fingerprint-stats.rs	Passes `daemon_conn_max_age` into `Daemon::new()`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

EddieHouston · 2026-06-16T14:26:59Z

lgtm, looking forward to seeing how your testing with the call_jsonrpc at runtime goes.

Long-lived daemon RPC connections stay pinned to a single backend for their whole lifetime. When electrs connects through a load balancer such as a Kubernetes ClusterSetIP (`*.clusterset.local`), a connection established before a node rotation keeps routing to the original backend via the existing TCP/conntrack flow, even after healthier/closer backends become available. The connection is only re-established on error, so a still-working-but-stale endpoint is never rebalanced. Add a `--daemon-rpc-conn-max-age` option (seconds). When a connection exceeds the configured age it is proactively recycled before the next request, re-establishing the TCP connection so the load balancer can re-select a backend. Defaults to 0 = unlimited (never recycle), so behavior is unchanged unless explicitly enabled. The age check is also applied to the per-thread connections used for parallel RPC requests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> daemon: make max-age connection recycling best-effort Proactive recycling previously called the infinite-retry reconnect path while holding the daemon connection mutex, before sending the request. A transient "new connections fail" event at the load balancer could therefore block all requests on that connection instead of continuing to use the existing, still-healthy socket -- turning an LB hiccup into an electrs outage when --daemon-rpc-conn-max-age is enabled. Split tcp_connect() into a single-attempt tcp_connect_once() (primary then fallback, no retry/backoff) and keep the looping tcp_connect() for startup and post-failure reconnects, where there is no usable socket to fall back to. Max-age recycling now uses try_reconnect_once(): on success the connection is swapped, on failure we log and keep the existing connection, retrying recycling on a later request. Real send/recv failures still go through the existing infinite reconnect. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> daemon: address Copilot review nits - config: parse --daemon-rpc-conn-max-age via value_t_or_exit! for consistent clap error handling instead of a manual parse + panic!. - daemon: store the actually-connected address (primary or fallback) on Connection and log it when recycling, so diagnostics aren't misleading when connected to the fallback daemon. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> daemon: rate-limit failed recycles, add metric + tests Address review feedback on proactive max-age recycling: - Blocker: a failed recycle attempt kept the existing connection (good) but did not update any timestamp, so is_expired() stayed true and every subsequent request re-attempted the recycle first -- each failed attempt blocking up to DAEMON_CONNECTION_TIMEOUT under the connection mutex. During a sustained "new connections fail" event this turned every fast RPC into a request paying a full connect timeout. Now a failed attempt records last_recycle_attempt and a cooldown (DAEMON_CONN_RECYCLE_COOLDOWN, default 30s) gates retries, so the old socket keeps serving requests at full speed between attempts. - Extract the recycle decision into a pure `recycle_due()` helper and cover it with unit tests (max-age boundary, None, and cooldown). - Add a daemon_rpc_conn_recycled{result="ok|failed"} counter so recycle behavior is observable in prod. - tcp_connect_once no longer warns per-attempt; it returns one descriptive error that callers log, avoiding double log lines on the recycle path. The startup/error loop logs that error + backoff. - Document in --daemon-rpc-conn-max-age help that the reconnect is inline on the request path, so the value should be generous (minutes). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

DeviaVir · 2026-06-18T11:34:29Z

lgtm, looking forward to seeing how your testing with the call_jsonrpc at runtime goes.

Verified call_jsonrpc is the right call:

call_jsonrpc vs. background timer → call_jsonrpc is correct. The connection model is decisive. Each Daemon owns one Mutex<Connection>, and for parallel batch requests() each rayon worker thread lazily creates its own thread-local Daemon + connection (DAEMON_INSTANCE thread-local). So at runtime there are 1 + daemon_parallelism independent, single-owner connections living in different threads' TLS.

A background timer thread cannot reach those connections: they're thread-local and created lazily; enumerating/recycling them would require a shared connection registry and cross-thread synchronization, i.e. re-architecting the model. call_jsonrpc is the only place that runs on the owning thread with the lock held, so it's the natural, race-free spot. It also gives the right semantics for free: each connection ages independently, idle/abandoned worker connections are never recycled (no thundering herd, no wasted reconnects), and the cost is one reconnect amortized over a whole max-age window. The only real downside is the triggering request pays the reconnect latency, this is bounded and already documented ("minutes, not seconds"), and the best-effort failure path (keep existing conn + cooldown) means a flapping LB can't stall requests.

DeviaVir self-assigned this Jun 16, 2026

DeviaVir force-pushed the daemon-conn-max-age branch from bf22a45 to cc0942f Compare June 16, 2026 08:28

DeviaVir marked this pull request as ready for review June 16, 2026 08:40

DeviaVir requested review from EddieHouston and Copilot June 16, 2026 08:40

Copilot started reviewing on behalf of DeviaVir June 16, 2026 08:41 View session

Copilot AI reviewed Jun 16, 2026

View reviewed changes

Comment thread src/config.rs Outdated

Comment thread src/daemon.rs Outdated

Randy808 approved these changes Jun 16, 2026

View reviewed changes

DeviaVir force-pushed the daemon-conn-max-age branch from 2b64ee4 to 9acb135 Compare June 18, 2026 09:24

DeviaVir force-pushed the daemon-conn-max-age branch from 9acb135 to 48418ca Compare June 18, 2026 09:25

DeviaVir merged commit c795602 into new-index Jun 18, 2026
6 checks passed

DeviaVir deleted the daemon-conn-max-age branch June 18, 2026 11:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

daemon: configurable max-age to recycle RPC connections#225

daemon: configurable max-age to recycle RPC connections#225
DeviaVir merged 1 commit into
new-indexfrom
daemon-conn-max-age

DeviaVir commented Jun 16, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

EddieHouston commented Jun 16, 2026

Uh oh!

DeviaVir commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

DeviaVir commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Change

Files

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

EddieHouston commented Jun 16, 2026

Uh oh!

DeviaVir commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DeviaVir commented Jun 16, 2026 •

edited

Loading