Skip to content

daemon: configurable max-age to recycle RPC connections#225

Merged
DeviaVir merged 1 commit into
new-indexfrom
daemon-conn-max-age
Jun 18, 2026
Merged

daemon: configurable max-age to recycle RPC connections#225
DeviaVir merged 1 commit into
new-indexfrom
daemon-conn-max-age

Conversation

@DeviaVir

@DeviaVir DeviaVir commented Jun 16, 2026

Copy link
Copy Markdown

Problem

electrs holds long-lived TCP connections to bitcoind for the whole process lifetime, only re-establishing them on error. When electrs reaches the daemon through a fronting load balancer a connection established before a rotation keeps routing to its original backend via the existing TCP/conntrack flow. Even after that backend is rotated out (or a closer/healthier one appears), a connection that is still "working" is never rebalanced, so it can stay pinned to a stale or sub-optimal endpoint indefinitely.

Note: daemon_rpc_addr is resolved to a SocketAddr once at startup (config::str_to_socketaddr). For a ClusterSetIP that address is a stable virtual IP, so the fix here is to recycle the TCP connection (forcing the LB to re-select a backend), not to re-resolve DNS. If true DNS re-resolution is ever needed for a different deployment, that's a follow-up.

Change

Adds a --daemon-rpc-conn-max-age <seconds> option:

  • Each Connection records when it was established. Before sending a request, if the connection is older than the configured max age it is proactively recycled (reconnect()), giving the load balancer a fresh backend selection.
  • Defaults to 0 = unlimited / never recycle, so behavior is unchanged unless an operator opts in.
  • The check also covers the per-thread connections used for parallel RPC (requests_iter), which are the ones that fan out across backends.

Files

  • src/config.rs — new CLI flag + daemon_conn_max_age: Option<Duration> (None when 0).
  • src/daemon.rsConnection tracks established/max_age, adds is_expired(), and call_jsonrpc recycles expired connections; threaded through Daemon.
  • src/bin/electrs.rs, src/bin/tx-fingerprint-stats.rs — pass the new value.

Testing

  • cargo check --bins and cargo check --bins --features liquid both pass (only pre-existing warnings).
  • Not yet exercised at runtime — opening as draft for review of the approach (in particular: is recycling in call_jsonrpc the right spot vs. a background timer, and should the value also accept a human duration like 30m).

@DeviaVir DeviaVir self-assigned this Jun 16, 2026
@DeviaVir DeviaVir force-pushed the daemon-conn-max-age branch from bf22a45 to cc0942f Compare June 16, 2026 08:28
@DeviaVir DeviaVir marked this pull request as ready for review June 16, 2026 08:40
@DeviaVir DeviaVir requested review from EddieHouston and Copilot June 16, 2026 08:40

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in mechanism to proactively recycle long-lived bitcoind RPC TCP connections, enabling better backend rebalancing when electrs connects through a load balancer (e.g., avoiding “pinned” connections across node rotations).

Changes:

  • Introduces --daemon-rpc-conn-max-age <seconds> (default 0 = unlimited) and stores it as Option<Duration> in Config.
  • Tracks connection establishment time in Connection and reconnects in call_jsonrpc() when the configured max age is exceeded.
  • Threads the new configuration through binaries and test harness initialization.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/common.rs Initializes the new Config::daemon_conn_max_age field and passes it into Daemon::new().
src/daemon.rs Adds connection age tracking + expiry check and proactive reconnect on JSON-RPC calls; threads config through Daemon.
src/config.rs Adds CLI flag, parses to Option<Duration>, and stores it on Config.
src/bin/electrs.rs Passes daemon_conn_max_age into Daemon::new().
src/bin/tx-fingerprint-stats.rs Passes daemon_conn_max_age into Daemon::new().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/config.rs Outdated
Comment thread src/daemon.rs Outdated
@EddieHouston

Copy link
Copy Markdown
Collaborator

lgtm, looking forward to seeing how your testing with the call_jsonrpc at runtime goes.

@DeviaVir DeviaVir force-pushed the daemon-conn-max-age branch from 2b64ee4 to 9acb135 Compare June 18, 2026 09:24
Long-lived daemon RPC connections stay pinned to a single backend for
their whole lifetime. When electrs connects through a load balancer such
as a Kubernetes ClusterSetIP (`*.clusterset.local`), a connection
established before a node rotation keeps routing to the original backend
via the existing TCP/conntrack flow, even after healthier/closer
backends become available. The connection is only re-established on
error, so a still-working-but-stale endpoint is never rebalanced.

Add a `--daemon-rpc-conn-max-age` option (seconds). When a connection
exceeds the configured age it is proactively recycled before the next
request, re-establishing the TCP connection so the load balancer can
re-select a backend. Defaults to 0 = unlimited (never recycle), so
behavior is unchanged unless explicitly enabled. The age check is also
applied to the per-thread connections used for parallel RPC requests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

daemon: make max-age connection recycling best-effort

Proactive recycling previously called the infinite-retry reconnect path
while holding the daemon connection mutex, before sending the request.
A transient "new connections fail" event at the load balancer could
therefore block all requests on that connection instead of continuing to
use the existing, still-healthy socket -- turning an LB hiccup into an
electrs outage when --daemon-rpc-conn-max-age is enabled.

Split tcp_connect() into a single-attempt tcp_connect_once() (primary
then fallback, no retry/backoff) and keep the looping tcp_connect() for
startup and post-failure reconnects, where there is no usable socket to
fall back to. Max-age recycling now uses try_reconnect_once(): on
success the connection is swapped, on failure we log and keep the
existing connection, retrying recycling on a later request. Real
send/recv failures still go through the existing infinite reconnect.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

daemon: address Copilot review nits

- config: parse --daemon-rpc-conn-max-age via value_t_or_exit! for
  consistent clap error handling instead of a manual parse + panic!.
- daemon: store the actually-connected address (primary or fallback) on
  Connection and log it when recycling, so diagnostics aren't misleading
  when connected to the fallback daemon.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

daemon: rate-limit failed recycles, add metric + tests

Address review feedback on proactive max-age recycling:

- Blocker: a failed recycle attempt kept the existing connection (good)
  but did not update any timestamp, so is_expired() stayed true and every
  subsequent request re-attempted the recycle first -- each failed
  attempt blocking up to DAEMON_CONNECTION_TIMEOUT under the connection
  mutex. During a sustained "new connections fail" event this turned
  every fast RPC into a request paying a full connect timeout. Now a
  failed attempt records last_recycle_attempt and a cooldown
  (DAEMON_CONN_RECYCLE_COOLDOWN, default 30s) gates retries, so the old
  socket keeps serving requests at full speed between attempts.

- Extract the recycle decision into a pure `recycle_due()` helper and
  cover it with unit tests (max-age boundary, None, and cooldown).

- Add a daemon_rpc_conn_recycled{result="ok|failed"} counter so recycle
  behavior is observable in prod.

- tcp_connect_once no longer warns per-attempt; it returns one
  descriptive error that callers log, avoiding double log lines on the
  recycle path. The startup/error loop logs that error + backoff.

- Document in --daemon-rpc-conn-max-age help that the reconnect is inline
  on the request path, so the value should be generous (minutes).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@DeviaVir DeviaVir force-pushed the daemon-conn-max-age branch from 9acb135 to 48418ca Compare June 18, 2026 09:25
@DeviaVir

Copy link
Copy Markdown
Author

lgtm, looking forward to seeing how your testing with the call_jsonrpc at runtime goes.

Verified call_jsonrpc is the right call:

call_jsonrpc vs. background timer → call_jsonrpc is correct. The connection model is decisive. Each Daemon owns one Mutex<Connection>, and for parallel batch requests() each rayon worker thread lazily creates its own thread-local Daemon + connection (DAEMON_INSTANCE thread-local). So at runtime there are 1 + daemon_parallelism independent, single-owner connections living in different threads' TLS.

A background timer thread cannot reach those connections: they're thread-local and created lazily; enumerating/recycling them would require a shared connection registry and cross-thread synchronization, i.e. re-architecting the model. call_jsonrpc is the only place that runs on the owning thread with the lock held, so it's the natural, race-free spot. It also gives the right semantics for free: each connection ages independently, idle/abandoned worker connections are never recycled (no thundering herd, no wasted reconnects), and the cost is one reconnect amortized over a whole max-age window. The only real downside is the triggering request pays the reconnect latency, this is bounded and already documented ("minutes, not seconds"), and the best-effort failure path (keep existing conn + cooldown) means a flapping LB can't stall requests.

@DeviaVir DeviaVir merged commit c795602 into new-index Jun 18, 2026
6 checks passed
@DeviaVir DeviaVir deleted the daemon-conn-max-age branch June 18, 2026 11:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants