Skip to content

contractcourt: retry resolveContract with exponential backoff on transient errors (fixes #10668)#10682

Open
ThomsenDrake wants to merge 2 commits intolightningnetwork:masterfrom
ThomsenDrake:fix/contractcourt-resolver-retry-10668
Open

contractcourt: retry resolveContract with exponential backoff on transient errors (fixes #10668)#10682
ThomsenDrake wants to merge 2 commits intolightningnetwork:masterfrom
ThomsenDrake:fix/contractcourt-resolver-retry-10668

Conversation

@ThomsenDrake
Copy link
Copy Markdown

Summary

Fixes #10668

Problem

The resolveContract goroutine permanently exits when Resolve() returns a transient error (e.g., bitcoind restart, ZMQ disconnection). This means HTLC outputs that were being watched may never get resolved — a serious safety issue for funds at risk.

From the issue:

When a transient error occurs during resolution (e.g., due to a backend restart), the resolver goroutine exits permanently instead of retrying.

Changes

  • Add resolveWithRetry() method that wraps Resolve() with exponential backoff (1s initial, 5min max)
  • Only propagate errResolverShuttingDown — all other errors trigger a retry after backoff
  • Replace direct Resolve() call in resolveContract with resolveWithRetry()
  • Preserve existing quit channel semantics for clean shutdown

Testing

  • go build ./contractcourt/ passes
  • Logic is a straightforward retry wrapper around existing Resolve() call
  • No change to the happy path (non-error Resolve continues to work identically)

ThomsenDrake and others added 2 commits March 27, 2026 11:45
…errors

Currently, the resolveContract goroutine permanently exits on any
non-shutdown error returned by Resolve(). This means that a transient
backend disruption (e.g., a bitcoind restart, ZMQ reconnection) can
permanently kill the resolver goroutine, leaving HTLC outputs unwatched
and funds at risk.

The state machine remains in StateWaitingFullResolution because the
contract is never marked resolved in the database, but no goroutine is
actively working on the resolution. The contract becomes permanently
stuck until the next lnd restart, at which point it will be
re-initialized from the database and resolvers re-launched.

This commit introduces a resolveWithRetry helper that wraps the
Resolve() call with exponential backoff retry logic. Key design
decisions:

- Initial backoff of 1 second, doubling each attempt, capped at 5
  minutes to avoid excessive delays for time-sensitive HTLC
  resolutions.
- Only errResolverShuttingDown causes an immediate exit; all other
  errors are treated as potentially transient and trigger a retry.
- The quit channel is respected during backoff waits, allowing clean
  shutdown even mid-backoff.
- Uses errors.Is() for proper error chain unwinding (some resolvers
  wrap errResolverShuttingDown with fmt.Errorf).

Fixes: lightningnetwork#10668
…sient errors

Fixes lightningnetwork#10668

Previously, resolveContract would permanently exit its goroutine on any
non-shutdown error from resolver.Resolve(). This made transient failures
(e.g., network timeouts, brief consensus conflicts) fatal for individual
channels' contract resolution.

Add exponential backoff retry (5s initial, 5min max cap) that:
- Respects c.quit shutdown signal between retries
- Resets retry counter on successful Resolve()
- Logs retry attempts for observability
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue where the contract resolution process would terminate prematurely upon encountering transient errors. By implementing an exponential backoff strategy, the system now automatically retries failed resolution attempts, significantly improving the robustness of fund management and ensuring that HTLC outputs are consistently monitored despite temporary infrastructure instability.

Highlights

  • Exponential Backoff Implementation: Introduced a retry mechanism with exponential backoff for contract resolution, replacing the previous behavior where transient errors caused the goroutine to exit permanently.
  • Improved Reliability: Ensures that HTLC outputs remain watched during transient backend issues like bitcoind restarts or ZMQ disconnections, preventing potential fund loss.
  • Graceful Shutdown: Maintained existing quit channel semantics to ensure the resolver still respects shutdown signals even while waiting for a retry.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist
Copy link
Copy Markdown

Warning

Gemini is experiencing higher than usual traffic and was unable to create the review. Please try again in a few hours by commenting /gemini review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug]: contractcourt: resolveContract goroutine permanently exits on transient Resolve() errors

1 participant