Implement Exponential Backoff for Transient Sync Errors#588
Conversation
|
👋 Thanks for assigning @tnull as a reviewer! |
tnull
left a comment
There was a problem hiding this comment.
Thanks!
CI is unhappy though, please check that the code compiles before pushing.
| log_error!(logger, "Failed to synchronize chain listeners: {:?}", e); | ||
| tokio::time::sleep(Duration::from_secs(CHAIN_POLLING_INTERVAL_SECS)) | ||
| .await; | ||
| backoff_delay_multiplier += 1; |
There was a problem hiding this comment.
I don't think we should let the backoff delay grow infinitely. Please include a sane upper bound.
There was a problem hiding this comment.
Got it, I am adding a max backoff delay of 300 secs for this
| } else { | ||
| log_error!( | ||
| logger, | ||
| "Failed to synchronize chain listeners: {:?}, e" |
There was a problem hiding this comment.
This doesn't compile. Also, as discussed elsewhere, we at least need to error out. But technically, we should keep retrying currently as otherwise we'd need to add panic here as permanently being unable to connect to the chain source is an unrecoverable error.
There was a problem hiding this comment.
This doesn't compile. Also, as discussed elsewhere, we at least need to error out. But technically, we should keep retrying currently as otherwise we'd need to add panic here as permanently being unable to connect to the chain source is an unrecoverable error.
@tnull As you suggested that panicking isnt best and not being able to error out I stuck with retrying but with the delay of 300 secs. Exponential backoff from CHAIN_POLLING_INTERVAL_SECS was only introduced for transient errors
Yeah, just saw that, would check it out now.. |
tnull
left a comment
There was a problem hiding this comment.
Please also make sure to always provide rationale for the change in the commit message, and format it properly (heading shouldn't be larger than one line). Feel free to refer to https://cbea.ms/git-commit/ on how to write good commit messages.
Previously, chain synchronization failures would retry immediately without any delay, which could lead to tight retry loops and high CPU usage during failures. This change introduces exponential backoff for transient errors, starting at 2 seconds and doubling each time up to a maximum of 300 seconds. Persistent errors also now delay retries by the maximum backoff duration to prevent rapid loops while maintaining eventual recovery. Fixes lightningdevkit#587
Previously, the background reconnection task retried every persisted peer on a fixed 60s interval with no backoff, so an unreachable peer was retried indefinitely at the same cadence — log spam and wasted work. This became more visible after lightningdevkit#895 retained peers across force-closes so that channel_reestablish recovery can run. Track per-peer reconnect state in ConnectionManager: on failure, double the retry interval up to PEER_RECONNECTION_MAX_INTERVAL (30 min); on success (including user-initiated connects), clear the state so a subsequent drop retries promptly. The 60s tokio::time::interval is kept as the wakeup, gated per-peer by next_retry_at, since lightningdevkit#588's inline-sleep form does not generalize to N peers. Backoff state is in-memory and resets on restart — a fresh post-restart attempt is the correct behavior. State is also cleared when a peer is removed from the persisted store. Closes lightningdevkit#918.
This PR attempts to improve retry in the bitcoind RPC synchronization loop from #587 by replacing the existing linear backoff with a proper exponential backoff strategy. We could also add a maximum delay to prevent excessively long waits, but I am not sure if it would really help.