Skip to content

Implement Exponential Backoff for Transient Sync Errors#588

Merged
tnull merged 1 commit into
lightningdevkit:mainfrom
tosynthegeek:main
Jun 30, 2025
Merged

Implement Exponential Backoff for Transient Sync Errors#588
tnull merged 1 commit into
lightningdevkit:mainfrom
tosynthegeek:main

Conversation

@tosynthegeek

Copy link
Copy Markdown
Contributor

This PR attempts to improve retry in the bitcoind RPC synchronization loop from #587 by replacing the existing linear backoff with a proper exponential backoff strategy. We could also add a maximum delay to prevent excessively long waits, but I am not sure if it would really help.

@ldk-reviews-bot

ldk-reviews-bot commented Jun 26, 2025

Copy link
Copy Markdown

👋 Thanks for assigning @tnull as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

@tnull tnull left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

CI is unhappy though, please check that the code compiles before pushing.

Comment thread src/chain/mod.rs Outdated
log_error!(logger, "Failed to synchronize chain listeners: {:?}", e);
tokio::time::sleep(Duration::from_secs(CHAIN_POLLING_INTERVAL_SECS))
.await;
backoff_delay_multiplier += 1;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should let the backoff delay grow infinitely. Please include a sane upper bound.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I am adding a max backoff delay of 300 secs for this

Comment thread src/chain/mod.rs Outdated
} else {
log_error!(
logger,
"Failed to synchronize chain listeners: {:?}, e"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't compile. Also, as discussed elsewhere, we at least need to error out. But technically, we should keep retrying currently as otherwise we'd need to add panic here as permanently being unable to connect to the chain source is an unrecoverable error.

@tosynthegeek tosynthegeek Jun 28, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't compile. Also, as discussed elsewhere, we at least need to error out. But technically, we should keep retrying currently as otherwise we'd need to add panic here as permanently being unable to connect to the chain source is an unrecoverable error.

@tnull As you suggested that panicking isnt best and not being able to error out I stuck with retrying but with the delay of 300 secs. Exponential backoff from CHAIN_POLLING_INTERVAL_SECS was only introduced for transient errors

@tosynthegeek

Copy link
Copy Markdown
Contributor Author

Thanks!

CI is unhappy though, please check that the code compiles before pushing.

Yeah, just saw that, would check it out now..

@tnull tnull left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also make sure to always provide rationale for the change in the commit message, and format it properly (heading shouldn't be larger than one line). Feel free to refer to https://cbea.ms/git-commit/ on how to write good commit messages.

Previously, chain synchronization failures would retry immediately
without any delay, which could lead to tight retry loops and high
CPU usage during failures.

This change introduces exponential backoff for transient
errors, starting at 2 seconds and doubling each time up to a maximum
of 300 seconds. Persistent errors also now delay retries by the
maximum backoff duration to prevent rapid loops while maintaining
eventual recovery.

Fixes lightningdevkit#587

@tnull tnull left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@tnull tnull merged commit 775e0db into lightningdevkit:main Jun 30, 2025
20 of 24 checks passed
Jolah1 added a commit to Jolah1/ldk-node that referenced this pull request Jun 10, 2026
Previously, the background reconnection task retried every persisted peer
on a fixed 60s interval with no backoff, so an unreachable peer was retried
indefinitely at the same cadence — log spam and wasted work. This became
more visible after lightningdevkit#895 retained peers across force-closes so that
channel_reestablish recovery can run.

Track per-peer reconnect state in ConnectionManager: on failure, double the
retry interval up to PEER_RECONNECTION_MAX_INTERVAL (30 min); on success
(including user-initiated connects), clear the state so a subsequent drop
retries promptly. The 60s tokio::time::interval is kept as the wakeup,
gated per-peer by next_retry_at, since lightningdevkit#588's inline-sleep form does not
generalize to N peers. Backoff state is in-memory and resets on restart —
a fresh post-restart attempt is the correct behavior. State is also
cleared when a peer is removed from the persisted store.

Closes lightningdevkit#918.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants