Skip to content

server: replace connection manager with per-peer connection workers#10640

Draft
bitromortac wants to merge 14 commits intolightningnetwork:masterfrom
bitromortac:2602-peer-connection-workers
Draft

server: replace connection manager with per-peer connection workers#10640
bitromortac wants to merge 14 commits intolightningnetwork:masterfrom
bitromortac:2602-peer-connection-workers

Conversation

@bitromortac
Copy link
Copy Markdown
Collaborator

Fixes #10086.

Persistent peer connection management previously relied on the connection manager plus five server-side maps to track connection establishment state. Keeping these in sync is fragile and leads to bugs where connection requests accumulated without bound or reconnection state diverged (see master...bitromortac:lnd:2602-persistent-connreq-accumulation-fixes for a problem analysis).

This PR replaces that with per-peer connWorker goroutines that each own the full dial/retry/backoff lifecycle for a single persistent peer. A worker's existence in the map is the sole indicator of persistence, so no external state to keep in sync.

Todo:
I appended two commits with issues that will be fixed in a later iteration, just wanted to open the PR for broader visibility.

Add DialContext alongside existing Dial. The new function accepts a
context.Context and a context-aware dialer (DialContextFunc). The TCP
connect phase respects context cancellation natively via the dialer.
If the context is canceled during the Brontide handshake, a background
goroutine closes the underlying connection, causing handshake
reads/writes to fail immediately.

The existing Dial function is left unchanged — non-persistent paths
still use it.
Refactor nextPeerBackoff from a server method that reads
s.persistentPeersBackoff into a pure function peerBackoff that takes
the current backoff, start time, and configured bounds as arguments.
The server method becomes a thin wrapper. This prepares the function
for reuse by the connection worker without server access.
Define the vocabulary for per-peer connection workers in a new file
conn_worker.go:

- connWorkerCmd enum: cmdConnect, cmdUpdateAddrs, cmdStandDown, cmdStop
- connWorkerMsg struct: carries command with addresses and backoff
- connWorkerCfg struct: injected dependencies (dialer, callbacks, timing)
- connWorker struct: pubKeyStr, perm, cmdChan, backoff, addrs, cfg

No constructor or methods yet — compile-only commit establishing the
types that the run loop (next commit) will use.
Add the runtime for per-peer connection workers. A worker sits idle
until told to connect, then dials each known address in sequence with
a stagger delay, backing off exponentially on failure. Commands that
arrive mid-dial cancel the in-progress attempt immediately via context
cancellation and are processed without goroutine leaks.

This gives the server a single point of control per peer: sending
cmdConnect replaces whatever the worker is doing, cmdStandDown returns
it to idle (used when an inbound connection arrives), and cmdStop
tears it down permanently.
Introduce the server-side plumbing that connects the connWorker
goroutines to the rest of the server. Each persistent peer gets an
entry in the new persistentWorkers map; getOrCreateWorker lazily
creates workers and starts their Run loop under the server WaitGroup,
while stopWorker and sendWorkerCmd give callers a uniform way to
control them.

Wire cmdStandDown into InboundPeerConnected and OutboundPeerConnected
so that workers yield when the connection they are dialing for is
already being handled. Both old (connmgr) and new (worker) paths run
in parallel during the transition.
Replace the map writes (persistentPeers, persistentPeersBackoff,
persistentPeerAddrs) and connectToPersistentPeer/delayInitialReconnect
calls with getOrCreateWorker + cmdConnect.

Startup stagger is preserved: first numInstantInitReconnect peers get
cmdConnect immediately, the rest get a delayed send in a goroutine
(random 0-30s). The old maps are no longer written from this function;
other callers still use them during the transition period.
When the gossip handler receives a node announcement with new
addresses, we need to update the dial targets for any active
connection worker. Previously this required coordinating three
server maps: checking persistentPeers for membership, writing the
new addresses to persistentPeerAddrs, and inspecting
persistentConnReqs to decide whether to trigger a reconnect via
connectToPersistentPeer.

Replace all of that with a single cmdUpdateAddrs to the worker.
The worker updates its address list internally and, if it is
mid-dial, restarts the round with the new addresses. If no worker
exists for the peer (not persistent), sendWorkerCmd returns false
and the update is silently ignored — preserving the same filtering
the old persistentPeers map check provided.
When a persistent peer disconnects, peerTerminationWatcher must
schedule a reconnection after an appropriate backoff. Previously
this involved writing the computed backoff to persistentPeersBackoff,
merging advertised addresses into persistentPeerAddrs, creating a
retry cancel channel in persistentRetryCancels, and spawning an
untracked goroutine that slept for the backoff before calling
connectToPersistentPeer.

Replace all of that with a single cmdConnect carrying the address
list and backoff duration. The worker handles the timed wait
internally in dialLoop, where it also responds to preempting
commands (standDown, stop, address updates) — something the old
time.After goroutine could not do.

The persistentPeers membership check becomes a persistentWorkers
lookup. The backoff is computed via the pure peerBackoff function
reading the worker's current backoff, avoiding the old server map.
When a user calls ConnectToPeer with perm=true, the server must
create a persistent connection that survives disconnects. Previously
this wrote to persistentPeers, initialized persistentPeersBackoff,
appended a ConnReq to persistentConnReqs, and launched a connmgr
goroutine — all under the server lock. Repeated calls accumulated
duplicate ConnReqs for the same peer.

Replace with getOrCreateWorker followed by a single cmdConnect.
The worker is idempotent: sending cmdConnect to an already-dialing
worker cancels the current attempt and restarts with the new
addresses, so repeated calls converge on one active dial rather
than accumulating.

Add TestConnectToPeerAccumulation to verify that 10 rapid perm
calls produce exactly one worker.
Wire the remaining server methods into the worker map so that all
persistent-peer lifecycle operations flow through connWorker:

DisconnectPeer now calls stopWorker alongside the existing ConnReq
cancellation, ensuring the worker goroutine exits when a user
explicitly disconnects a peer.

prunePersistentPeerConnection and bannedPersistentPeerConnection
check the worker's perm flag and stop non-perm workers when the
last channel closes or the peer is banned, mirroring the existing
persistentPeers map logic.

The WatchNewChannel callback uses getOrCreateWorker to ensure a
worker exists when a new channel is opened with a peer. Since the
peer is already connected, no cmdConnect is sent — the worker
starts idle and will handle reconnection if the peer disconnects
later.

createBootstrapIgnorePeers iterates persistentWorkers instead of
persistentPeers so the autopilot ignores peers that have dedicated
workers.
Delete the five persistent-connection maps (persistentPeers,
persistentPeersBackoff, persistentPeerAddrs, persistentConnReqs,
persistentRetryCancels) and all code that operated on them, now that
per-peer connection workers handle the full dial/retry/backoff
lifecycle.

Removed functions: cancelConnReqs, connectToPersistentPeer,
delayInitialReconnect, nextPeerBackoff (server method).
Removed constants: UnassignedConnID, maxPersistentConnReqsPerPeer.
Removed tests: TestConnectToPersistentPeerGoroutineRace,
TestCancelConnReqsUnassignedID, TestOutboundPeerConnectedInboundCleanup
(the bugs they reproduced are structurally impossible with workers).

Cleaned up InboundPeerConnected and OutboundPeerConnected to remove
cancelConnReqs calls and the persistentConnReqs existence check.
When the worker's cmdChan buffer is full, stopWorker silently drops
cmdStop yet unconditionally removes the worker from persistentWorkers.
The test confirms the goroutine continues running after removal —
unreachable by the server until shutdown.
When ConnectToPeer (perm path) does a blocking send on w.cmdChan while
holding s.mu, it deadlocks if the worker is inside onConnection which
also acquires s.mu via OutboundPeerConnected. The test fills the channel
while the worker is stalled in onConnection and confirms the blocking
send cannot complete.
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly overhauls the persistent peer connection management within the server. The previous approach, which relied on multiple synchronized maps, was prone to race conditions and state divergence. The new design introduces dedicated connWorker goroutines for each persistent peer, centralizing their connection lifecycle management, including dialing, retries, and backoff. This change aims to create a more robust and less error-prone system for maintaining connections to critical peers.

Highlights

  • Connection Management Refactor: Replaced the centralized connection manager with a per-peer connWorker goroutine model to handle persistent peer connection, retry, and backoff logic. This consolidates state and improves robustness.
  • Context-Aware Brontide Dialing: Introduced a new DialContext function in the Brontide package, allowing for context cancellation during both the TCP connection and the Brontide handshake phases, preventing resource leaks on cancellation.
  • Improved State Management: Eliminated five loosely coupled maps previously used for tracking connection state, replacing them with a single persistentWorkers map where each entry encapsulates all necessary state for a given peer within its dedicated worker.
  • Deadlock Prevention: Addressed potential deadlock scenarios in connection handling by ensuring that the connWorker's command channel is buffered and that server-side operations interacting with workers do not block indefinitely while holding critical locks.
  • New Architecture Documentation: Added several Zettelkasten-style markdown documents detailing the new connection worker architecture, its execution flow, object context, and the context-cancellable Brontide dial mechanism.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • brontide/conn.go
    • Added context import.
    • Introduced DialContextFunc type for context-aware dialers.
    • Implemented DialContext function to support context cancellation during TCP connection and Brontide handshake.
  • brontide/conn_test.go
    • Added new test file conn_test.go.
    • Included tests for DialContext covering happy path, cancellation before dial, and cancellation during handshake.
  • conn_worker.go
    • Added new file conn_worker.go.
    • Defined connWorkerCmd and connWorkerMsg types for inter-goroutine communication.
    • Defined connWorkerCfg for worker configuration.
    • Implemented connWorker struct to manage a single persistent peer's connection lifecycle.
    • Implemented newConnWorker to create worker instances.
    • Implemented Run as the main event loop for the worker.
    • Implemented dialLoop to manage dial rounds, backoff, and command handling.
    • Defined dialResult and dialOutcome types for dial attempt results.
    • Implemented tryAllAddresses to iterate and dial peer addresses with staggering and context cancellation.
    • Implemented handleMidDial to process commands received during dial attempts.
  • conn_worker_test.go
    • Added new test file conn_worker_test.go.
    • Included helper functions testAddr, dialCall, newMockDialer, testWorkerHarness for testing connWorker.
    • Added tests for connWorker functionality, including dialing all addresses, stand-down, stop, address updates, backoff, idle state, quit signal, and rapid command cycling.
  • server.go
    • Replaced multiple persistentPeers related maps with a single persistentWorkers map.
    • Updated updatePersistentPeerAddrs to send cmdUpdateAddrs to the relevant connWorker.
    • Modified createBootstrapIgnorePeers to reference persistentWorkers instead of persistentPeers.
    • Refactored establishPersistentConnections to create and send cmdConnect messages to connWorker instances, incorporating initial stagger logic.
    • Removed delayInitialReconnect function.
    • Updated prunePersistentPeerConnection to use stopWorker for non-permanent peers.
    • Updated bannedPersistentPeerConnection to use stopWorker.
    • Added getOrCreateWorker helper to manage connWorker instances.
    • Added stopWorker helper to terminate a worker and remove it from the map.
    • Added sendWorkerCmd helper to send commands to a worker's channel.
    • Refactored nextPeerBackoff into a pure function peerBackoff.
    • Updated InboundPeerConnected to send cmdStandDown to the worker when an inbound connection is established.
    • Updated OutboundPeerConnected to send cmdStandDown to the worker if an inbound connection already exists or if the connection is scheduled.
    • Removed cancelConnReqs function and related logic.
    • Updated ConnectToPeer to use getOrCreateWorker and send cmdConnect for permanent connections.
    • Updated DisconnectPeer to use stopWorker.
    • Added a check in computeNextBackoff to handle zero margin values, preventing panics.
  • server_test.go
    • Added mockNet and mockConn for network mocking.
    • Introduced newTestServer and generateTestPubKey helpers for server testing.
    • Added TestConnectToPeerAccumulation to verify single worker creation for multiple ConnectToPeer calls.
    • Added TestPeerBackoff to test the new backoff calculation logic.
    • Added TestGetOrCreateWorker and TestStopWorker for worker management verification.
    • Added TestConnectToPeerDeadlockOnFullChannel to demonstrate and prevent a specific deadlock scenario.
    • Added TestStopWorkerDropsCmdOnFullChannel to test behavior when cmdStop is dropped due to a full channel.
    • Added TestSendWorkerCmdNoWorker to verify handling of non-existent workers.
  • zk/202602181000-Persistent-Connection-Workers.md
    • Added new documentation file explaining the new persistent connection worker architecture.
  • zk/202602181001-connworker-run-loop.md
    • Added new documentation file detailing the connWorker goroutine's run loop and state transitions.
  • zk/202602181002-connworker-object-context-diagram.md
    • Added new documentation file with a Mermaid diagram illustrating the object context and communication flow for connection workers.
  • zk/202602181003-connworker-execution-flow-diagram.md
    • Added new documentation file with a Mermaid diagram showing the detailed execution flow and error paths of a connWorker.
  • zk/202602181004-brontide-dial-context.md
    • Added new documentation file explaining the context-cancellable Brontide dial mechanism.
Activity
  • The author noted that two commits addressing related issues will be appended in a later iteration, indicating a phased approach to the overall solution.
  • The pull request was opened for broader visibility, suggesting an early review cycle for significant architectural changes.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors persistent peer connection management, replacing maps with dedicated per-peer connWorker goroutines, centralizing dial/retry/backoff logic, and improving state management. However, it introduces a significant deadlock risk in the main server mutex (s.mu) and a potential goroutine leak. The deadlock can occur when the server holds s.mu and performs a blocking send to a worker's command channel, while the worker is blocked on the same mutex during a callback. Additionally, the stopWorker method may fail to stop a worker if its command channel is full, leading to orphaned goroutines and resource leaks. These critical concurrency issues need to be addressed to ensure the new architecture is robust.

Comment on lines +4849 to 4852
w.cmdChan <- connWorkerMsg{
cmd: cmdConnect,
addrs: addrs,
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

A deadlock risk exists because the server's mutex (s.mu) is held while performing a blocking send to the worker's command channel (w.cmdChan). If the worker's channel is full and the worker attempts to acquire s.mu (e.g., in the onConnection callback), both goroutines will block indefinitely, leading to a high-severity Denial of Service (DoS) vulnerability. The mutex should be released before the potentially blocking channel send.

Suggested change
w.cmdChan <- connWorkerMsg{
cmd: cmdConnect,
addrs: addrs,
}
w := s.getOrCreateWorker(targetPub, true, addrs)
s.mu.Unlock()
w.cmdChan <- connWorkerMsg{
cmd: cmdConnect,
addrs: addrs,
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already documented as a repro test, to be addressed in the next iteration

Comment on lines +3749 to +3754
select {
case w.cmdChan <- msg:
default:
srvrLog.Warnf("Worker cmdChan full for peer %x, "+
"command %v is not processed", pubStr, msg.cmd)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to stopWorker, this non-blocking send can drop commands if the worker's channel is full. Dropping commands like cmdUpdateAddrs can lead to an inconsistent state where the worker continues to operate with stale information. This could be problematic if, for example, a peer's address changes and the update is missed.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already documented as a repro test, to be addressed in the next iteration

Comment on lines +3730 to +3735
select {
case w.cmdChan <- connWorkerMsg{cmd: cmdStop}:
default:
srvrLog.Warnf("Worker cmdChan full for peer %x, "+
"stop is not processed", pubStr)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The stopWorker method's non-blocking send for cmdStop can lead to orphaned worker goroutines. If the worker's command channel is full, the cmdStop command is dropped, but the worker is still removed from the persistentWorkers map at line 3737. This orphans the worker goroutine, causing a resource leak and potential unexpected connection attempts. The cmdStop signal must be reliably delivered to prevent this.

Comment on lines +151 to +158
// closeAndReturn is a helper to close the connection on handshake
// failure. It signals the cancel goroutine first to avoid a
// double-close race.
closeAndReturn := func(err error) (*Conn, error) {
close(done)
b.conn.Close()
return nil, err
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a potential race condition that could lead to conn.Close() being called twice. If the context is canceled, the goroutine at line 143 will call conn.Close(). This will cause the subsequent I/O operation (e.g., conn.Write or io.ReadFull) to fail, which in turn will call closeAndReturn, leading to a second call to b.conn.Close(). While net.Conn.Close() is generally idempotent, using sync.Once would make this more robust and prevent the double close.

For example:

var once sync.Once
closeConn := func() {
    conn.Close()
}

// The goroutine becomes:
go func() {
    select {
    case <-ctx.Done():
        once.Do(closeConn)
    case <-done:
    }
}()

// And `closeAndReturn` becomes:
closeAndReturn := func(err error) (*Conn, error) {
    close(done)
    once.Do(closeConn)
    return nil, err
}

Comment on lines +183 to +187
// TODO: Should we reset the backoff
// timer here? Currently we restart the
// full wait with the same duration. If
// the addresses changed, dialing sooner
// may be preferable.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This TODO raises a valid point. When addresses are updated, it's often desirable to attempt a new connection immediately rather than waiting for the existing backoff timer to expire, especially if the old addresses were failing. Resetting the backoff or using a much shorter one could lead to faster connection establishment.

@saubyk saubyk added this to the v0.22.0 milestone Mar 10, 2026
@saubyk saubyk added this to lnd v0.22 Mar 30, 2026
@github-project-automation github-project-automation bot moved this to Backlog in lnd v0.22 Mar 30, 2026
@saubyk saubyk moved this from Backlog to In progress in lnd v0.22 Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

[bug]: PeerConnection accumulation of ConnRequests for single peers not recycled

2 participants