daemon: configurable max-age to recycle RPC connections#225
Conversation
bf22a45 to
cc0942f
Compare
There was a problem hiding this comment.
Pull request overview
Adds an opt-in mechanism to proactively recycle long-lived bitcoind RPC TCP connections, enabling better backend rebalancing when electrs connects through a load balancer (e.g., avoiding “pinned” connections across node rotations).
Changes:
- Introduces
--daemon-rpc-conn-max-age <seconds>(default0= unlimited) and stores it asOption<Duration>inConfig. - Tracks connection establishment time in
Connectionand reconnects incall_jsonrpc()when the configured max age is exceeded. - Threads the new configuration through binaries and test harness initialization.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/common.rs | Initializes the new Config::daemon_conn_max_age field and passes it into Daemon::new(). |
| src/daemon.rs | Adds connection age tracking + expiry check and proactive reconnect on JSON-RPC calls; threads config through Daemon. |
| src/config.rs | Adds CLI flag, parses to Option<Duration>, and stores it on Config. |
| src/bin/electrs.rs | Passes daemon_conn_max_age into Daemon::new(). |
| src/bin/tx-fingerprint-stats.rs | Passes daemon_conn_max_age into Daemon::new(). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
lgtm, looking forward to seeing how your testing with the call_jsonrpc at runtime goes. |
2b64ee4 to
9acb135
Compare
Long-lived daemon RPC connections stay pinned to a single backend for
their whole lifetime. When electrs connects through a load balancer such
as a Kubernetes ClusterSetIP (`*.clusterset.local`), a connection
established before a node rotation keeps routing to the original backend
via the existing TCP/conntrack flow, even after healthier/closer
backends become available. The connection is only re-established on
error, so a still-working-but-stale endpoint is never rebalanced.
Add a `--daemon-rpc-conn-max-age` option (seconds). When a connection
exceeds the configured age it is proactively recycled before the next
request, re-establishing the TCP connection so the load balancer can
re-select a backend. Defaults to 0 = unlimited (never recycle), so
behavior is unchanged unless explicitly enabled. The age check is also
applied to the per-thread connections used for parallel RPC requests.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
daemon: make max-age connection recycling best-effort
Proactive recycling previously called the infinite-retry reconnect path
while holding the daemon connection mutex, before sending the request.
A transient "new connections fail" event at the load balancer could
therefore block all requests on that connection instead of continuing to
use the existing, still-healthy socket -- turning an LB hiccup into an
electrs outage when --daemon-rpc-conn-max-age is enabled.
Split tcp_connect() into a single-attempt tcp_connect_once() (primary
then fallback, no retry/backoff) and keep the looping tcp_connect() for
startup and post-failure reconnects, where there is no usable socket to
fall back to. Max-age recycling now uses try_reconnect_once(): on
success the connection is swapped, on failure we log and keep the
existing connection, retrying recycling on a later request. Real
send/recv failures still go through the existing infinite reconnect.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
daemon: address Copilot review nits
- config: parse --daemon-rpc-conn-max-age via value_t_or_exit! for
consistent clap error handling instead of a manual parse + panic!.
- daemon: store the actually-connected address (primary or fallback) on
Connection and log it when recycling, so diagnostics aren't misleading
when connected to the fallback daemon.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
daemon: rate-limit failed recycles, add metric + tests
Address review feedback on proactive max-age recycling:
- Blocker: a failed recycle attempt kept the existing connection (good)
but did not update any timestamp, so is_expired() stayed true and every
subsequent request re-attempted the recycle first -- each failed
attempt blocking up to DAEMON_CONNECTION_TIMEOUT under the connection
mutex. During a sustained "new connections fail" event this turned
every fast RPC into a request paying a full connect timeout. Now a
failed attempt records last_recycle_attempt and a cooldown
(DAEMON_CONN_RECYCLE_COOLDOWN, default 30s) gates retries, so the old
socket keeps serving requests at full speed between attempts.
- Extract the recycle decision into a pure `recycle_due()` helper and
cover it with unit tests (max-age boundary, None, and cooldown).
- Add a daemon_rpc_conn_recycled{result="ok|failed"} counter so recycle
behavior is observable in prod.
- tcp_connect_once no longer warns per-attempt; it returns one
descriptive error that callers log, avoiding double log lines on the
recycle path. The startup/error loop logs that error + backoff.
- Document in --daemon-rpc-conn-max-age help that the reconnect is inline
on the request path, so the value should be generous (minutes).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
9acb135 to
48418ca
Compare
Verified call_jsonrpc is the right call:
A background timer thread cannot reach those connections: they're thread-local and created lazily; enumerating/recycling them would require a shared connection registry and cross-thread synchronization, i.e. re-architecting the model. |
Problem
electrs holds long-lived TCP connections to bitcoind for the whole process lifetime, only re-establishing them on error. When electrs reaches the daemon through a fronting load balancer a connection established before a rotation keeps routing to its original backend via the existing TCP/conntrack flow. Even after that backend is rotated out (or a closer/healthier one appears), a connection that is still "working" is never rebalanced, so it can stay pinned to a stale or sub-optimal endpoint indefinitely.
Note:
daemon_rpc_addris resolved to aSocketAddronce at startup (config::str_to_socketaddr). For a ClusterSetIP that address is a stable virtual IP, so the fix here is to recycle the TCP connection (forcing the LB to re-select a backend), not to re-resolve DNS. If true DNS re-resolution is ever needed for a different deployment, that's a follow-up.Change
Adds a
--daemon-rpc-conn-max-age <seconds>option:Connectionrecords when it was established. Before sending a request, if the connection is older than the configured max age it is proactively recycled (reconnect()), giving the load balancer a fresh backend selection.0= unlimited / never recycle, so behavior is unchanged unless an operator opts in.requests_iter), which are the ones that fan out across backends.Files
src/config.rs— new CLI flag +daemon_conn_max_age: Option<Duration>(None when 0).src/daemon.rs—Connectiontracksestablished/max_age, addsis_expired(), andcall_jsonrpcrecycles expired connections; threaded throughDaemon.src/bin/electrs.rs,src/bin/tx-fingerprint-stats.rs— pass the new value.Testing
cargo check --binsandcargo check --bins --features liquidboth pass (only pre-existing warnings).call_jsonrpcthe right spot vs. a background timer, and should the value also accept a human duration like30m).