fix(libp2p): bound Keep authentication handshake with SetDeadline#4008
Open
lrsaturnino wants to merge 1 commit into
Open
fix(libp2p): bound Keep authentication handshake with SetDeadline#4008lrsaturnino wants to merge 1 commit into
lrsaturnino wants to merge 1 commit into
Conversation
The post-TLS Keep authentication handshake on inbound and outbound secure connections previously ran with no read/write deadline on the underlying net.Conn. libp2p-TLS's 15s security-negotiation context budget covers only the TLS handshake itself; once TLS completes, cancelling the context is a no-op. An unauthenticated peer that completes TLS but then stalls Act 1/2/3 of the Keep handshake — by sending nothing or trickling partial bytes — therefore holds a go-libp2p resource-manager transient-inbound slot indefinitely. At the default ~48-slot capacity on 8 GiB hosts, saturating bootstrap peers' pools blocks new and restarting peers from joining the mesh, which is particularly damaging during mandatory upgrade restart windows when threshold-signing groups must reform. Introduce a package-level keepHandshakeTimeout (15s, matching go-libp2p's defaultAcceptTimeout) and an applyHandshakeDeadline(ctx, conn) helper that sets min(ctx.Deadline(), now+keepHandshakeTimeout) on the connection and returns a clear function that resets the deadline to zero. SecureInbound and SecureOutbound both invoke the helper immediately after the libp2p-TLS layer returns and defer the clear so successful post-handshake application I/O is unbounded, as before. The helper wraps SetDeadline errors with %w to preserve the error chain for callers that need errors.Is checks. Add pkg/net/libp2p/transport_test.go with seven unit tests covering the stall, partial-byte-trickle, ctx-deadline-honoring, success-clears-deadline, fast-handshake-no-spurious-timeout, symmetric outbound stall, and direct-helper-on-closed-conn paths. Tests use an injected mock encryptionLayer to isolate the Keep handshake from real libp2p-TLS over net.Pipe and override keepHandshakeTimeout via t.Cleanup so test isolation is preserved. No protocol or wire-format changes. No new dependencies.
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The post-TLS Keep authentication handshake on inbound and outbound secure connections runs with no read or write deadline on the underlying
net.Conn. The 15-second context budget supplied by go-libp2p'sUpgrader(viadefaultAcceptTimeoutinp2p/net/upgrader/upgrader.go) is honored only during the TLS handshake itself; once TLS completes, cancelling that context is a no-op because Go'snet.Conndeadlines are independent ofcontext.Context. An unauthenticated peer that completes TLS within the 15s budget but then stalls during Act 1, Act 2, or Act 3 of the Keep handshake — by sending nothing, or by trickling partial bytes — therefore holds a go-libp2p resource-manager transient-inbound slot indefinitely. At the default capacity of approximately 48 transient-inbound slots per host on 8 GiB AutoScale machines, saturating bootstrap peers' pools blocks new and restarting peers from joining the mesh. The impact is most acute during mandatory upgrade restart windows when threshold-signing groups must reform across the network. The authorization-ordering aggravates this: keep-core's firewallIsRecognized()check runs after the handshake completes, so an attacker stalled mid-handshake never reaches that gate, and unstaked or unrecognized peers can hold slots without ever being filtered out.Solution
Introduce a package-level
keepHandshakeTimeoutvariable (15s, matching go-libp2p'sdefaultAcceptTimeout) and anapplyHandshakeDeadline(ctx, conn)helper inpkg/net/libp2p/transport.gothat setsmin(ctx.Deadline(), now+keepHandshakeTimeout)on the connection viaSetDeadlineand returns a clear function that resets the deadline to zero on success. BothSecureInboundandSecureOutboundinvoke the helper immediately after the libp2p-TLS layer returns anddefer clear()so that successful post-handshake application I/O is unbounded, exactly as before.SetDeadline(rather thanSetReadDeadlinealone) is used because Act 2 on the responder and Acts 1 and 3 on the initiator write to the peer; bounding both directions with a single absolute deadline closes both read-stall and write-stall variants under one invariant. The helper wrapsSetDeadlineerrors with%wto preserve the error chain for callers that rely onerrors.Is. The deadline is absolute pernet.Connsemantics, so a partial-byte trickle cannot extend the dwell time past the cap. The variable is declared asvarrather thanconstsolely so unit tests can shorten it undert.Cleanup; production code never mutates it. No protocol-level changes, no wire-format changes, nogo.modchanges, no new dependencies.Tests
The new
pkg/net/libp2p/transport_test.goadds seven unit tests covering: a full inbound stall after TLS (zero bytes for Act 1), partial-byte trickle attacks, post-success-clear (verifying that a successful handshake does not leave a deadline on the connection that would break later application I/O), explicitctx.Deadline()honoring (a 100ms context deadline fires before the 15s package-level timeout), the symmetric outbound stall path, a fast-handshake-no-spurious-timeout regression check, and a direct helper-on-closed-conn assertion that returns the expected error througherrors.Is(err, io.ErrClosedPipe). Tests inject a mockencryptionLayervia the package-private field so the suite isolates the Keep handshake from libp2p-TLS's TCP-socket dependency, and overridekeepHandshakeTimeoutviat.Cleanupto preserve test isolation across the suite. All seven new tests pass undergo test -race -count=1. Full keep-core unit-test suite passes locally (matching CI:go test -timeout 15m ./...,gofmt -l .,go vet, staticcheck with-SA1019, gosec with the CI exclusion set). The existingpkg/net/libp2ptests pass without modification.Residual Risk and Follow-Up
This patch closes the catastrophic indefinite-stall vector but does not close a residual continuous-churn DoS in which an attacker sustains roughly
transientInboundLimit / keepHandshakeTimeout≈ 3 stalled accepts per second, keeping the transient pool saturated for as long as the rate is maintained. The deadline bounds per-connection dwell time, not steady-state occupancy. Closing that residual class requires architectural follow-up — either earlier IP-layer gating viaConnectionGater.InterceptAccept, a split of the security transport into pure-TLS plus a separately-gated post-TLS authentication layer, or simplification onto libp2p-TLS-only peer authentication. That decision is out of scope for this hotfix and should be handled as a separate RFC PR. Also note: this branch is forked fromfeature/decouple-firewall-allowlist(PR #3909) and targets that branch as its merge base because PR #3909 is landing onmainthis week; once #3909 merges, the base of this PR will be re-targeted tomain. PR #3909 does not touchpkg/net/libp2p/transport.goorpkg/net/libp2p/authenticated_connection.go, so no file overlap is expected in either merge direction.