server: replace connection manager with per-peer connection workers#10640
server: replace connection manager with per-peer connection workers#10640bitromortac wants to merge 14 commits intolightningnetwork:masterfrom
Conversation
Add DialContext alongside existing Dial. The new function accepts a context.Context and a context-aware dialer (DialContextFunc). The TCP connect phase respects context cancellation natively via the dialer. If the context is canceled during the Brontide handshake, a background goroutine closes the underlying connection, causing handshake reads/writes to fail immediately. The existing Dial function is left unchanged — non-persistent paths still use it.
Refactor nextPeerBackoff from a server method that reads s.persistentPeersBackoff into a pure function peerBackoff that takes the current backoff, start time, and configured bounds as arguments. The server method becomes a thin wrapper. This prepares the function for reuse by the connection worker without server access.
Define the vocabulary for per-peer connection workers in a new file conn_worker.go: - connWorkerCmd enum: cmdConnect, cmdUpdateAddrs, cmdStandDown, cmdStop - connWorkerMsg struct: carries command with addresses and backoff - connWorkerCfg struct: injected dependencies (dialer, callbacks, timing) - connWorker struct: pubKeyStr, perm, cmdChan, backoff, addrs, cfg No constructor or methods yet — compile-only commit establishing the types that the run loop (next commit) will use.
Add the runtime for per-peer connection workers. A worker sits idle until told to connect, then dials each known address in sequence with a stagger delay, backing off exponentially on failure. Commands that arrive mid-dial cancel the in-progress attempt immediately via context cancellation and are processed without goroutine leaks. This gives the server a single point of control per peer: sending cmdConnect replaces whatever the worker is doing, cmdStandDown returns it to idle (used when an inbound connection arrives), and cmdStop tears it down permanently.
Introduce the server-side plumbing that connects the connWorker goroutines to the rest of the server. Each persistent peer gets an entry in the new persistentWorkers map; getOrCreateWorker lazily creates workers and starts their Run loop under the server WaitGroup, while stopWorker and sendWorkerCmd give callers a uniform way to control them. Wire cmdStandDown into InboundPeerConnected and OutboundPeerConnected so that workers yield when the connection they are dialing for is already being handled. Both old (connmgr) and new (worker) paths run in parallel during the transition.
Replace the map writes (persistentPeers, persistentPeersBackoff, persistentPeerAddrs) and connectToPersistentPeer/delayInitialReconnect calls with getOrCreateWorker + cmdConnect. Startup stagger is preserved: first numInstantInitReconnect peers get cmdConnect immediately, the rest get a delayed send in a goroutine (random 0-30s). The old maps are no longer written from this function; other callers still use them during the transition period.
When the gossip handler receives a node announcement with new addresses, we need to update the dial targets for any active connection worker. Previously this required coordinating three server maps: checking persistentPeers for membership, writing the new addresses to persistentPeerAddrs, and inspecting persistentConnReqs to decide whether to trigger a reconnect via connectToPersistentPeer. Replace all of that with a single cmdUpdateAddrs to the worker. The worker updates its address list internally and, if it is mid-dial, restarts the round with the new addresses. If no worker exists for the peer (not persistent), sendWorkerCmd returns false and the update is silently ignored — preserving the same filtering the old persistentPeers map check provided.
When a persistent peer disconnects, peerTerminationWatcher must schedule a reconnection after an appropriate backoff. Previously this involved writing the computed backoff to persistentPeersBackoff, merging advertised addresses into persistentPeerAddrs, creating a retry cancel channel in persistentRetryCancels, and spawning an untracked goroutine that slept for the backoff before calling connectToPersistentPeer. Replace all of that with a single cmdConnect carrying the address list and backoff duration. The worker handles the timed wait internally in dialLoop, where it also responds to preempting commands (standDown, stop, address updates) — something the old time.After goroutine could not do. The persistentPeers membership check becomes a persistentWorkers lookup. The backoff is computed via the pure peerBackoff function reading the worker's current backoff, avoiding the old server map.
When a user calls ConnectToPeer with perm=true, the server must create a persistent connection that survives disconnects. Previously this wrote to persistentPeers, initialized persistentPeersBackoff, appended a ConnReq to persistentConnReqs, and launched a connmgr goroutine — all under the server lock. Repeated calls accumulated duplicate ConnReqs for the same peer. Replace with getOrCreateWorker followed by a single cmdConnect. The worker is idempotent: sending cmdConnect to an already-dialing worker cancels the current attempt and restarts with the new addresses, so repeated calls converge on one active dial rather than accumulating. Add TestConnectToPeerAccumulation to verify that 10 rapid perm calls produce exactly one worker.
Wire the remaining server methods into the worker map so that all persistent-peer lifecycle operations flow through connWorker: DisconnectPeer now calls stopWorker alongside the existing ConnReq cancellation, ensuring the worker goroutine exits when a user explicitly disconnects a peer. prunePersistentPeerConnection and bannedPersistentPeerConnection check the worker's perm flag and stop non-perm workers when the last channel closes or the peer is banned, mirroring the existing persistentPeers map logic. The WatchNewChannel callback uses getOrCreateWorker to ensure a worker exists when a new channel is opened with a peer. Since the peer is already connected, no cmdConnect is sent — the worker starts idle and will handle reconnection if the peer disconnects later. createBootstrapIgnorePeers iterates persistentWorkers instead of persistentPeers so the autopilot ignores peers that have dedicated workers.
Delete the five persistent-connection maps (persistentPeers, persistentPeersBackoff, persistentPeerAddrs, persistentConnReqs, persistentRetryCancels) and all code that operated on them, now that per-peer connection workers handle the full dial/retry/backoff lifecycle. Removed functions: cancelConnReqs, connectToPersistentPeer, delayInitialReconnect, nextPeerBackoff (server method). Removed constants: UnassignedConnID, maxPersistentConnReqsPerPeer. Removed tests: TestConnectToPersistentPeerGoroutineRace, TestCancelConnReqsUnassignedID, TestOutboundPeerConnectedInboundCleanup (the bugs they reproduced are structurally impossible with workers). Cleaned up InboundPeerConnected and OutboundPeerConnected to remove cancelConnReqs calls and the persistentConnReqs existence check.
When the worker's cmdChan buffer is full, stopWorker silently drops cmdStop yet unconditionally removes the worker from persistentWorkers. The test confirms the goroutine continues running after removal — unreachable by the server until shutdown.
When ConnectToPeer (perm path) does a blocking send on w.cmdChan while holding s.mu, it deadlocks if the worker is inside onConnection which also acquires s.mu via OutboundPeerConnected. The test fills the channel while the worker is stalled in onConnection and confirms the blocking send cannot complete.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly overhauls the persistent peer connection management within the server. The previous approach, which relied on multiple synchronized maps, was prone to race conditions and state divergence. The new design introduces dedicated Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors persistent peer connection management, replacing maps with dedicated per-peer connWorker goroutines, centralizing dial/retry/backoff logic, and improving state management. However, it introduces a significant deadlock risk in the main server mutex (s.mu) and a potential goroutine leak. The deadlock can occur when the server holds s.mu and performs a blocking send to a worker's command channel, while the worker is blocked on the same mutex during a callback. Additionally, the stopWorker method may fail to stop a worker if its command channel is full, leading to orphaned goroutines and resource leaks. These critical concurrency issues need to be addressed to ensure the new architecture is robust.
| w.cmdChan <- connWorkerMsg{ | ||
| cmd: cmdConnect, | ||
| addrs: addrs, | ||
| } |
There was a problem hiding this comment.
A deadlock risk exists because the server's mutex (s.mu) is held while performing a blocking send to the worker's command channel (w.cmdChan). If the worker's channel is full and the worker attempts to acquire s.mu (e.g., in the onConnection callback), both goroutines will block indefinitely, leading to a high-severity Denial of Service (DoS) vulnerability. The mutex should be released before the potentially blocking channel send.
| w.cmdChan <- connWorkerMsg{ | |
| cmd: cmdConnect, | |
| addrs: addrs, | |
| } | |
| w := s.getOrCreateWorker(targetPub, true, addrs) | |
| s.mu.Unlock() | |
| w.cmdChan <- connWorkerMsg{ | |
| cmd: cmdConnect, | |
| addrs: addrs, | |
| } |
There was a problem hiding this comment.
already documented as a repro test, to be addressed in the next iteration
| select { | ||
| case w.cmdChan <- msg: | ||
| default: | ||
| srvrLog.Warnf("Worker cmdChan full for peer %x, "+ | ||
| "command %v is not processed", pubStr, msg.cmd) | ||
| } |
There was a problem hiding this comment.
Similar to stopWorker, this non-blocking send can drop commands if the worker's channel is full. Dropping commands like cmdUpdateAddrs can lead to an inconsistent state where the worker continues to operate with stale information. This could be problematic if, for example, a peer's address changes and the update is missed.
There was a problem hiding this comment.
already documented as a repro test, to be addressed in the next iteration
| select { | ||
| case w.cmdChan <- connWorkerMsg{cmd: cmdStop}: | ||
| default: | ||
| srvrLog.Warnf("Worker cmdChan full for peer %x, "+ | ||
| "stop is not processed", pubStr) | ||
| } |
There was a problem hiding this comment.
The stopWorker method's non-blocking send for cmdStop can lead to orphaned worker goroutines. If the worker's command channel is full, the cmdStop command is dropped, but the worker is still removed from the persistentWorkers map at line 3737. This orphans the worker goroutine, causing a resource leak and potential unexpected connection attempts. The cmdStop signal must be reliably delivered to prevent this.
| // closeAndReturn is a helper to close the connection on handshake | ||
| // failure. It signals the cancel goroutine first to avoid a | ||
| // double-close race. | ||
| closeAndReturn := func(err error) (*Conn, error) { | ||
| close(done) | ||
| b.conn.Close() | ||
| return nil, err | ||
| } |
There was a problem hiding this comment.
There's a potential race condition that could lead to conn.Close() being called twice. If the context is canceled, the goroutine at line 143 will call conn.Close(). This will cause the subsequent I/O operation (e.g., conn.Write or io.ReadFull) to fail, which in turn will call closeAndReturn, leading to a second call to b.conn.Close(). While net.Conn.Close() is generally idempotent, using sync.Once would make this more robust and prevent the double close.
For example:
var once sync.Once
closeConn := func() {
conn.Close()
}
// The goroutine becomes:
go func() {
select {
case <-ctx.Done():
once.Do(closeConn)
case <-done:
}
}()
// And `closeAndReturn` becomes:
closeAndReturn := func(err error) (*Conn, error) {
close(done)
once.Do(closeConn)
return nil, err
}| // TODO: Should we reset the backoff | ||
| // timer here? Currently we restart the | ||
| // full wait with the same duration. If | ||
| // the addresses changed, dialing sooner | ||
| // may be preferable. |
There was a problem hiding this comment.
This TODO raises a valid point. When addresses are updated, it's often desirable to attempt a new connection immediately rather than waiting for the existing backoff timer to expire, especially if the old addresses were failing. Resetting the backoff or using a much shorter one could lead to faster connection establishment.
Fixes #10086.
Persistent peer connection management previously relied on the connection manager plus five server-side maps to track connection establishment state. Keeping these in sync is fragile and leads to bugs where connection requests accumulated without bound or reconnection state diverged (see master...bitromortac:lnd:2602-persistent-connreq-accumulation-fixes for a problem analysis).
This PR replaces that with per-peer
connWorkergoroutines that each own the full dial/retry/backoff lifecycle for a single persistent peer. A worker's existence in the map is the sole indicator of persistence, so no external state to keep in sync.Todo:
I appended two commits with issues that will be fixed in a later iteration, just wanted to open the PR for broader visibility.