contractcourt: retry resolveContract with exponential backoff on transient errors (fixes #10668)#10682
Conversation
…errors Currently, the resolveContract goroutine permanently exits on any non-shutdown error returned by Resolve(). This means that a transient backend disruption (e.g., a bitcoind restart, ZMQ reconnection) can permanently kill the resolver goroutine, leaving HTLC outputs unwatched and funds at risk. The state machine remains in StateWaitingFullResolution because the contract is never marked resolved in the database, but no goroutine is actively working on the resolution. The contract becomes permanently stuck until the next lnd restart, at which point it will be re-initialized from the database and resolvers re-launched. This commit introduces a resolveWithRetry helper that wraps the Resolve() call with exponential backoff retry logic. Key design decisions: - Initial backoff of 1 second, doubling each attempt, capped at 5 minutes to avoid excessive delays for time-sensitive HTLC resolutions. - Only errResolverShuttingDown causes an immediate exit; all other errors are treated as potentially transient and trigger a retry. - The quit channel is respected during backoff waits, allowing clean shutdown even mid-backoff. - Uses errors.Is() for proper error chain unwinding (some resolvers wrap errResolverShuttingDown with fmt.Errorf). Fixes: lightningnetwork#10668
…sient errors Fixes lightningnetwork#10668 Previously, resolveContract would permanently exit its goroutine on any non-shutdown error from resolver.Resolve(). This made transient failures (e.g., network timeouts, brief consensus conflicts) fatal for individual channels' contract resolution. Add exponential backoff retry (5s initial, 5min max cap) that: - Respects c.quit shutdown signal between retries - Resets retry counter on successful Resolve() - Logs retry attempts for observability
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical issue where the contract resolution process would terminate prematurely upon encountering transient errors. By implementing an exponential backoff strategy, the system now automatically retries failed resolution attempts, significantly improving the robustness of fund management and ensuring that HTLC outputs are consistently monitored despite temporary infrastructure instability. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
Warning Gemini is experiencing higher than usual traffic and was unable to create the review. Please try again in a few hours by commenting |
Summary
Fixes #10668
Problem
The
resolveContractgoroutine permanently exits whenResolve()returns a transient error (e.g., bitcoind restart, ZMQ disconnection). This means HTLC outputs that were being watched may never get resolved — a serious safety issue for funds at risk.From the issue:
Changes
resolveWithRetry()method that wrapsResolve()with exponential backoff (1s initial, 5min max)errResolverShuttingDown— all other errors trigger a retry after backoffResolve()call inresolveContractwithresolveWithRetry()Testing
go build ./contractcourt/passesResolve()call