Skip to content

fix(libp2p): bound Keep authentication handshake with SetDeadline#4008

Open
lrsaturnino wants to merge 1 commit into
feature/decouple-firewall-allowlistfrom
fix/libp2p-handshake-timeout-dos
Open

fix(libp2p): bound Keep authentication handshake with SetDeadline#4008
lrsaturnino wants to merge 1 commit into
feature/decouple-firewall-allowlistfrom
fix/libp2p-handshake-timeout-dos

Conversation

@lrsaturnino
Copy link
Copy Markdown
Member

Problem

The post-TLS Keep authentication handshake on inbound and outbound secure connections runs with no read or write deadline on the underlying net.Conn. The 15-second context budget supplied by go-libp2p's Upgrader (via defaultAcceptTimeout in p2p/net/upgrader/upgrader.go) is honored only during the TLS handshake itself; once TLS completes, cancelling that context is a no-op because Go's net.Conn deadlines are independent of context.Context. An unauthenticated peer that completes TLS within the 15s budget but then stalls during Act 1, Act 2, or Act 3 of the Keep handshake — by sending nothing, or by trickling partial bytes — therefore holds a go-libp2p resource-manager transient-inbound slot indefinitely. At the default capacity of approximately 48 transient-inbound slots per host on 8 GiB AutoScale machines, saturating bootstrap peers' pools blocks new and restarting peers from joining the mesh. The impact is most acute during mandatory upgrade restart windows when threshold-signing groups must reform across the network. The authorization-ordering aggravates this: keep-core's firewall IsRecognized() check runs after the handshake completes, so an attacker stalled mid-handshake never reaches that gate, and unstaked or unrecognized peers can hold slots without ever being filtered out.

Solution

Introduce a package-level keepHandshakeTimeout variable (15s, matching go-libp2p's defaultAcceptTimeout) and an applyHandshakeDeadline(ctx, conn) helper in pkg/net/libp2p/transport.go that sets min(ctx.Deadline(), now+keepHandshakeTimeout) on the connection via SetDeadline and returns a clear function that resets the deadline to zero on success. Both SecureInbound and SecureOutbound invoke the helper immediately after the libp2p-TLS layer returns and defer clear() so that successful post-handshake application I/O is unbounded, exactly as before. SetDeadline (rather than SetReadDeadline alone) is used because Act 2 on the responder and Acts 1 and 3 on the initiator write to the peer; bounding both directions with a single absolute deadline closes both read-stall and write-stall variants under one invariant. The helper wraps SetDeadline errors with %w to preserve the error chain for callers that rely on errors.Is. The deadline is absolute per net.Conn semantics, so a partial-byte trickle cannot extend the dwell time past the cap. The variable is declared as var rather than const solely so unit tests can shorten it under t.Cleanup; production code never mutates it. No protocol-level changes, no wire-format changes, no go.mod changes, no new dependencies.

Tests

The new pkg/net/libp2p/transport_test.go adds seven unit tests covering: a full inbound stall after TLS (zero bytes for Act 1), partial-byte trickle attacks, post-success-clear (verifying that a successful handshake does not leave a deadline on the connection that would break later application I/O), explicit ctx.Deadline() honoring (a 100ms context deadline fires before the 15s package-level timeout), the symmetric outbound stall path, a fast-handshake-no-spurious-timeout regression check, and a direct helper-on-closed-conn assertion that returns the expected error through errors.Is(err, io.ErrClosedPipe). Tests inject a mock encryptionLayer via the package-private field so the suite isolates the Keep handshake from libp2p-TLS's TCP-socket dependency, and override keepHandshakeTimeout via t.Cleanup to preserve test isolation across the suite. All seven new tests pass under go test -race -count=1. Full keep-core unit-test suite passes locally (matching CI: go test -timeout 15m ./..., gofmt -l ., go vet, staticcheck with -SA1019, gosec with the CI exclusion set). The existing pkg/net/libp2p tests pass without modification.

Residual Risk and Follow-Up

This patch closes the catastrophic indefinite-stall vector but does not close a residual continuous-churn DoS in which an attacker sustains roughly transientInboundLimit / keepHandshakeTimeout ≈ 3 stalled accepts per second, keeping the transient pool saturated for as long as the rate is maintained. The deadline bounds per-connection dwell time, not steady-state occupancy. Closing that residual class requires architectural follow-up — either earlier IP-layer gating via ConnectionGater.InterceptAccept, a split of the security transport into pure-TLS plus a separately-gated post-TLS authentication layer, or simplification onto libp2p-TLS-only peer authentication. That decision is out of scope for this hotfix and should be handled as a separate RFC PR. Also note: this branch is forked from feature/decouple-firewall-allowlist (PR #3909) and targets that branch as its merge base because PR #3909 is landing on main this week; once #3909 merges, the base of this PR will be re-targeted to main. PR #3909 does not touch pkg/net/libp2p/transport.go or pkg/net/libp2p/authenticated_connection.go, so no file overlap is expected in either merge direction.

The post-TLS Keep authentication handshake on inbound and outbound
secure connections previously ran with no read/write deadline on the
underlying net.Conn. libp2p-TLS's 15s security-negotiation context
budget covers only the TLS handshake itself; once TLS completes,
cancelling the context is a no-op. An unauthenticated peer that
completes TLS but then stalls Act 1/2/3 of the Keep handshake — by
sending nothing or trickling partial bytes — therefore holds a
go-libp2p resource-manager transient-inbound slot indefinitely. At the
default ~48-slot capacity on 8 GiB hosts, saturating bootstrap peers'
pools blocks new and restarting peers from joining the mesh, which is
particularly damaging during mandatory upgrade restart windows when
threshold-signing groups must reform.

Introduce a package-level keepHandshakeTimeout (15s, matching
go-libp2p's defaultAcceptTimeout) and an applyHandshakeDeadline(ctx,
conn) helper that sets min(ctx.Deadline(), now+keepHandshakeTimeout)
on the connection and returns a clear function that resets the
deadline to zero. SecureInbound and SecureOutbound both invoke the
helper immediately after the libp2p-TLS layer returns and defer the
clear so successful post-handshake application I/O is unbounded, as
before. The helper wraps SetDeadline errors with %w to preserve the
error chain for callers that need errors.Is checks.

Add pkg/net/libp2p/transport_test.go with seven unit tests covering
the stall, partial-byte-trickle, ctx-deadline-honoring,
success-clears-deadline, fast-handshake-no-spurious-timeout, symmetric
outbound stall, and direct-helper-on-closed-conn paths. Tests use an
injected mock encryptionLayer to isolate the Keep handshake from real
libp2p-TLS over net.Pipe and override keepHandshakeTimeout via
t.Cleanup so test isolation is preserved.

No protocol or wire-format changes. No new dependencies.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: ab5611fb-64f0-4cd4-ae27-ffe832ba27fe

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/libp2p-handshake-timeout-dos

Comment @coderabbitai help to get the list of available commands and usage tips.

@lrsaturnino lrsaturnino marked this pull request as ready for review June 1, 2026 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant