server: replace connection manager with per-peer connection workers by bitromortac · Pull Request #10640 · lightningnetwork/lnd

bitromortac · 2026-03-10T11:21:41Z

Persistent peer connection management previously relied on the connection manager plus five server-side maps to track connection establishment state. Keeping these in sync is fragile and leads to bugs where connection requests accumulated without bound or reconnection state diverged (see master...bitromortac:lnd:2602-persistent-connreq-accumulation-fixes for a problem analysis).

This PR replaces that with per-peer connWorker goroutines that each own the full dial/retry/backoff lifecycle for a single persistent peer. A worker's existence in the map is the sole indicator of persistence, so no external state to keep in sync.

Todo:
I appended two commits with issues that will be fixed in a later iteration, just wanted to open the PR for broader visibility.

Add DialContext alongside existing Dial. The new function accepts a context.Context and a context-aware dialer (DialContextFunc). The TCP connect phase respects context cancellation natively via the dialer. If the context is canceled during the Brontide handshake, a background goroutine closes the underlying connection, causing handshake reads/writes to fail immediately. The existing Dial function is left unchanged — non-persistent paths still use it.

Refactor nextPeerBackoff from a server method that reads s.persistentPeersBackoff into a pure function peerBackoff that takes the current backoff, start time, and configured bounds as arguments. The server method becomes a thin wrapper. This prepares the function for reuse by the connection worker without server access.

Define the vocabulary for per-peer connection workers in a new file conn_worker.go: - connWorkerCmd enum: cmdConnect, cmdUpdateAddrs, cmdStandDown, cmdStop - connWorkerMsg struct: carries command with addresses and backoff - connWorkerCfg struct: injected dependencies (dialer, callbacks, timing) - connWorker struct: pubKeyStr, perm, cmdChan, backoff, addrs, cfg No constructor or methods yet — compile-only commit establishing the types that the run loop (next commit) will use.

Add the runtime for per-peer connection workers. A worker sits idle until told to connect, then dials each known address in sequence with a stagger delay, backing off exponentially on failure. Commands that arrive mid-dial cancel the in-progress attempt immediately via context cancellation and are processed without goroutine leaks. This gives the server a single point of control per peer: sending cmdConnect replaces whatever the worker is doing, cmdStandDown returns it to idle (used when an inbound connection arrives), and cmdStop tears it down permanently.

Introduce the server-side plumbing that connects the connWorker goroutines to the rest of the server. Each persistent peer gets an entry in the new persistentWorkers map; getOrCreateWorker lazily creates workers and starts their Run loop under the server WaitGroup, while stopWorker and sendWorkerCmd give callers a uniform way to control them. Wire cmdStandDown into InboundPeerConnected and OutboundPeerConnected so that workers yield when the connection they are dialing for is already being handled. Both old (connmgr) and new (worker) paths run in parallel during the transition.

Replace the map writes (persistentPeers, persistentPeersBackoff, persistentPeerAddrs) and connectToPersistentPeer/delayInitialReconnect calls with getOrCreateWorker + cmdConnect. Startup stagger is preserved: first numInstantInitReconnect peers get cmdConnect immediately, the rest get a delayed send in a goroutine (random 0-30s). The old maps are no longer written from this function; other callers still use them during the transition period.

When the gossip handler receives a node announcement with new addresses, we need to update the dial targets for any active connection worker. Previously this required coordinating three server maps: checking persistentPeers for membership, writing the new addresses to persistentPeerAddrs, and inspecting persistentConnReqs to decide whether to trigger a reconnect via connectToPersistentPeer. Replace all of that with a single cmdUpdateAddrs to the worker. The worker updates its address list internally and, if it is mid-dial, restarts the round with the new addresses. If no worker exists for the peer (not persistent), sendWorkerCmd returns false and the update is silently ignored — preserving the same filtering the old persistentPeers map check provided.

When a persistent peer disconnects, peerTerminationWatcher must schedule a reconnection after an appropriate backoff. Previously this involved writing the computed backoff to persistentPeersBackoff, merging advertised addresses into persistentPeerAddrs, creating a retry cancel channel in persistentRetryCancels, and spawning an untracked goroutine that slept for the backoff before calling connectToPersistentPeer. Replace all of that with a single cmdConnect carrying the address list and backoff duration. The worker handles the timed wait internally in dialLoop, where it also responds to preempting commands (standDown, stop, address updates) — something the old time.After goroutine could not do. The persistentPeers membership check becomes a persistentWorkers lookup. The backoff is computed via the pure peerBackoff function reading the worker's current backoff, avoiding the old server map.

When a user calls ConnectToPeer with perm=true, the server must create a persistent connection that survives disconnects. Previously this wrote to persistentPeers, initialized persistentPeersBackoff, appended a ConnReq to persistentConnReqs, and launched a connmgr goroutine — all under the server lock. Repeated calls accumulated duplicate ConnReqs for the same peer. Replace with getOrCreateWorker followed by a single cmdConnect. The worker is idempotent: sending cmdConnect to an already-dialing worker cancels the current attempt and restarts with the new addresses, so repeated calls converge on one active dial rather than accumulating. Add TestConnectToPeerAccumulation to verify that 10 rapid perm calls produce exactly one worker.

Wire the remaining server methods into the worker map so that all persistent-peer lifecycle operations flow through connWorker: DisconnectPeer now calls stopWorker alongside the existing ConnReq cancellation, ensuring the worker goroutine exits when a user explicitly disconnects a peer. prunePersistentPeerConnection and bannedPersistentPeerConnection check the worker's perm flag and stop non-perm workers when the last channel closes or the peer is banned, mirroring the existing persistentPeers map logic. The WatchNewChannel callback uses getOrCreateWorker to ensure a worker exists when a new channel is opened with a peer. Since the peer is already connected, no cmdConnect is sent — the worker starts idle and will handle reconnection if the peer disconnects later. createBootstrapIgnorePeers iterates persistentWorkers instead of persistentPeers so the autopilot ignores peers that have dedicated workers.

Delete the five persistent-connection maps (persistentPeers, persistentPeersBackoff, persistentPeerAddrs, persistentConnReqs, persistentRetryCancels) and all code that operated on them, now that per-peer connection workers handle the full dial/retry/backoff lifecycle. Removed functions: cancelConnReqs, connectToPersistentPeer, delayInitialReconnect, nextPeerBackoff (server method). Removed constants: UnassignedConnID, maxPersistentConnReqsPerPeer. Removed tests: TestConnectToPersistentPeerGoroutineRace, TestCancelConnReqsUnassignedID, TestOutboundPeerConnectedInboundCleanup (the bugs they reproduced are structurally impossible with workers). Cleaned up InboundPeerConnected and OutboundPeerConnected to remove cancelConnReqs calls and the persistentConnReqs existence check.

When the worker's cmdChan buffer is full, stopWorker silently drops cmdStop yet unconditionally removes the worker from persistentWorkers. The test confirms the goroutine continues running after removal — unreachable by the server until shutdown.

When ConnectToPeer (perm path) does a blocking send on w.cmdChan while holding s.mu, it deadlocks if the worker is inside onConnection which also acquires s.mu via OutboundPeerConnected. The test fills the channel while the worker is stalled in onConnection and confirms the blocking send cannot complete.

gemini-code-assist · 2026-03-10T12:12:57Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly overhauls the persistent peer connection management within the server. The previous approach, which relied on multiple synchronized maps, was prone to race conditions and state divergence. The new design introduces dedicated connWorker goroutines for each persistent peer, centralizing their connection lifecycle management, including dialing, retries, and backoff. This change aims to create a more robust and less error-prone system for maintaining connections to critical peers.

Highlights

Connection Management Refactor: Replaced the centralized connection manager with a per-peer connWorker goroutine model to handle persistent peer connection, retry, and backoff logic. This consolidates state and improves robustness.
Context-Aware Brontide Dialing: Introduced a new DialContext function in the Brontide package, allowing for context cancellation during both the TCP connection and the Brontide handshake phases, preventing resource leaks on cancellation.
Improved State Management: Eliminated five loosely coupled maps previously used for tracking connection state, replacing them with a single persistentWorkers map where each entry encapsulates all necessary state for a given peer within its dedicated worker.
Deadlock Prevention: Addressed potential deadlock scenarios in connection handling by ensuring that the connWorker's command channel is buffered and that server-side operations interacting with workers do not block indefinitely while holding critical locks.
New Architecture Documentation: Added several Zettelkasten-style markdown documents detailing the new connection worker architecture, its execution flow, object context, and the context-cancellable Brontide dial mechanism.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

brontide/conn.go
- Added context import.
- Introduced DialContextFunc type for context-aware dialers.
- Implemented DialContext function to support context cancellation during TCP connection and Brontide handshake.
brontide/conn_test.go
- Added new test file conn_test.go.
- Included tests for DialContext covering happy path, cancellation before dial, and cancellation during handshake.
conn_worker.go
- Added new file conn_worker.go.
- Defined connWorkerCmd and connWorkerMsg types for inter-goroutine communication.
- Defined connWorkerCfg for worker configuration.
- Implemented connWorker struct to manage a single persistent peer's connection lifecycle.
- Implemented newConnWorker to create worker instances.
- Implemented Run as the main event loop for the worker.
- Implemented dialLoop to manage dial rounds, backoff, and command handling.
- Defined dialResult and dialOutcome types for dial attempt results.
- Implemented tryAllAddresses to iterate and dial peer addresses with staggering and context cancellation.
- Implemented handleMidDial to process commands received during dial attempts.
conn_worker_test.go
- Added new test file conn_worker_test.go.
- Included helper functions testAddr, dialCall, newMockDialer, testWorkerHarness for testing connWorker.
- Added tests for connWorker functionality, including dialing all addresses, stand-down, stop, address updates, backoff, idle state, quit signal, and rapid command cycling.
server.go
- Replaced multiple persistentPeers related maps with a single persistentWorkers map.
- Updated updatePersistentPeerAddrs to send cmdUpdateAddrs to the relevant connWorker.
- Modified createBootstrapIgnorePeers to reference persistentWorkers instead of persistentPeers.
- Refactored establishPersistentConnections to create and send cmdConnect messages to connWorker instances, incorporating initial stagger logic.
- Removed delayInitialReconnect function.
- Updated prunePersistentPeerConnection to use stopWorker for non-permanent peers.
- Updated bannedPersistentPeerConnection to use stopWorker.
- Added getOrCreateWorker helper to manage connWorker instances.
- Added stopWorker helper to terminate a worker and remove it from the map.
- Added sendWorkerCmd helper to send commands to a worker's channel.
- Refactored nextPeerBackoff into a pure function peerBackoff.
- Updated InboundPeerConnected to send cmdStandDown to the worker when an inbound connection is established.
- Updated OutboundPeerConnected to send cmdStandDown to the worker if an inbound connection already exists or if the connection is scheduled.
- Removed cancelConnReqs function and related logic.
- Updated ConnectToPeer to use getOrCreateWorker and send cmdConnect for permanent connections.
- Updated DisconnectPeer to use stopWorker.
- Added a check in computeNextBackoff to handle zero margin values, preventing panics.
server_test.go
- Added mockNet and mockConn for network mocking.
- Introduced newTestServer and generateTestPubKey helpers for server testing.
- Added TestConnectToPeerAccumulation to verify single worker creation for multiple ConnectToPeer calls.
- Added TestPeerBackoff to test the new backoff calculation logic.
- Added TestGetOrCreateWorker and TestStopWorker for worker management verification.
- Added TestConnectToPeerDeadlockOnFullChannel to demonstrate and prevent a specific deadlock scenario.
- Added TestStopWorkerDropsCmdOnFullChannel to test behavior when cmdStop is dropped due to a full channel.
- Added TestSendWorkerCmdNoWorker to verify handling of non-existent workers.
zk/202602181000-Persistent-Connection-Workers.md
- Added new documentation file explaining the new persistent connection worker architecture.
zk/202602181001-connworker-run-loop.md
- Added new documentation file detailing the connWorker goroutine's run loop and state transitions.
zk/202602181002-connworker-object-context-diagram.md
- Added new documentation file with a Mermaid diagram illustrating the object context and communication flow for connection workers.
zk/202602181003-connworker-execution-flow-diagram.md
- Added new documentation file with a Mermaid diagram showing the detailed execution flow and error paths of a connWorker.
zk/202602181004-brontide-dial-context.md
- Added new documentation file explaining the context-cancellable Brontide dial mechanism.

Activity

The author noted that two commits addressing related issues will be appended in a later iteration, indicating a phased approach to the overall solution.
The pull request was opened for broader visibility, suggesting an early review cycle for significant architectural changes.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors persistent peer connection management, replacing maps with dedicated per-peer connWorker goroutines, centralizing dial/retry/backoff logic, and improving state management. However, it introduces a significant deadlock risk in the main server mutex (s.mu) and a potential goroutine leak. The deadlock can occur when the server holds s.mu and performs a blocking send to a worker's command channel, while the worker is blocked on the same mutex during a callback. Additionally, the stopWorker method may fail to stop a worker if its command channel is full, leading to orphaned goroutines and resource leaks. These critical concurrency issues need to be addressed to ensure the new architecture is robust.

gemini-code-assist · 2026-03-10T12:16:47Z

server.go

+		w.cmdChan <- connWorkerMsg{
+			cmd:   cmdConnect,
+			addrs: addrs,
 		}


A deadlock risk exists because the server's mutex (s.mu) is held while performing a blocking send to the worker's command channel (w.cmdChan). If the worker's channel is full and the worker attempts to acquire s.mu (e.g., in the onConnection callback), both goroutines will block indefinitely, leading to a high-severity Denial of Service (DoS) vulnerability. The mutex should be released before the potentially blocking channel send.

Suggested change

w.cmdChan <- connWorkerMsg{

cmd: cmdConnect,

addrs: addrs,

}

w := s.getOrCreateWorker(targetPub, true, addrs)

s.mu.Unlock()

w.cmdChan <- connWorkerMsg{

cmd: cmdConnect,

addrs: addrs,

}

already documented as a repro test, to be addressed in the next iteration

gemini-code-assist · 2026-03-10T12:16:47Z

server.go

+	select {
+	case w.cmdChan <- msg:
+	default:
+		srvrLog.Warnf("Worker cmdChan full for peer %x, "+
+			"command %v is not processed", pubStr, msg.cmd)
+	}


Similar to stopWorker, this non-blocking send can drop commands if the worker's channel is full. Dropping commands like cmdUpdateAddrs can lead to an inconsistent state where the worker continues to operate with stale information. This could be problematic if, for example, a peer's address changes and the update is missed.

already documented as a repro test, to be addressed in the next iteration

gemini-code-assist · 2026-03-10T12:16:48Z

server.go

+	select {
+	case w.cmdChan <- connWorkerMsg{cmd: cmdStop}:
+	default:
+		srvrLog.Warnf("Worker cmdChan full for peer %x, "+
+			"stop is not processed", pubStr)
+	}


The stopWorker method's non-blocking send for cmdStop can lead to orphaned worker goroutines. If the worker's command channel is full, the cmdStop command is dropped, but the worker is still removed from the persistentWorkers map at line 3737. This orphans the worker goroutine, causing a resource leak and potential unexpected connection attempts. The cmdStop signal must be reliably delivered to prevent this.

gemini-code-assist · 2026-03-10T12:16:48Z

brontide/conn.go

+	// closeAndReturn is a helper to close the connection on handshake
+	// failure. It signals the cancel goroutine first to avoid a
+	// double-close race.
+	closeAndReturn := func(err error) (*Conn, error) {
+		close(done)
+		b.conn.Close()
+		return nil, err
+	}


There's a potential race condition that could lead to conn.Close() being called twice. If the context is canceled, the goroutine at line 143 will call conn.Close(). This will cause the subsequent I/O operation (e.g., conn.Write or io.ReadFull) to fail, which in turn will call closeAndReturn, leading to a second call to b.conn.Close(). While net.Conn.Close() is generally idempotent, using sync.Once would make this more robust and prevent the double close.

For example:

var once sync.Once closeConn := func() { conn.Close() } // The goroutine becomes: go func() { select { case <-ctx.Done(): once.Do(closeConn) case <-done: } }() // And `closeAndReturn` becomes: closeAndReturn := func(err error) (*Conn, error) { close(done) once.Do(closeConn) return nil, err }

gemini-code-assist · 2026-03-10T12:16:48Z

conn_worker.go

+					// TODO: Should we reset the backoff
+					// timer here? Currently we restart the
+					// full wait with the same duration. If
+					// the addresses changed, dialing sooner
+					// may be preferable.


This TODO raises a valid point. When addresses are updated, it's often desirable to attempt a new connection immediately rather than waiting for the existing backoff timer to expire, especially if the old addresses were failing. Resetting the backoff or using a much shorter one could lead to faster connection establishment.

bitromortac added 14 commits February 20, 2026 15:37

temp: add docs

2371b00

gemini-code-assist bot reviewed Mar 10, 2026

View reviewed changes

saubyk assigned bitromortac Mar 10, 2026

saubyk added this to the v0.22.0 milestone Mar 10, 2026

saubyk added this to lnd v0.22 Mar 30, 2026

github-project-automation bot moved this to Backlog in lnd v0.22 Mar 30, 2026

saubyk moved this from Backlog to In progress in lnd v0.22 Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: replace connection manager with per-peer connection workers#10640

server: replace connection manager with per-peer connection workers#10640
bitromortac wants to merge 14 commits intolightningnetwork:masterfrom
bitromortac:2602-peer-connection-workers

bitromortac commented Mar 10, 2026

Uh oh!

gemini-code-assist bot commented Mar 10, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 10, 2026

Uh oh!

bitromortac Mar 10, 2026

Uh oh!

gemini-code-assist bot Mar 10, 2026

Uh oh!

bitromortac Mar 10, 2026

Uh oh!

gemini-code-assist bot Mar 10, 2026

Uh oh!

gemini-code-assist bot Mar 10, 2026

Uh oh!

gemini-code-assist bot Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bitromortac commented Mar 10, 2026

Uh oh!

gemini-code-assist bot commented Mar 10, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

bitromortac Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

bitromortac Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants