Skip to content

fix(keyexchange): demote same-session PILA log to Debug#174

Merged
TeoSlayer merged 5 commits into
mainfrom
fix/keyexchange-same-session-log-spam
May 29, 2026
Merged

fix(keyexchange): demote same-session PILA log to Debug#174
TeoSlayer merged 5 commits into
mainfrom
fix/keyexchange-same-session-log-spam

Conversation

@TeoSlayer
Copy link
Copy Markdown
Owner

Summary

A fresh daemon connected to list-agents (node 179172) via the production registry/beacon was logging `encrypted tunnel established` at Info every ~8 seconds — 35 false-positive 'established' lines for one peer in 5 minutes. The session was actually live the whole time; the noise came from a gap in the duplicate-coalescing logic.

Field observation (2026-05-29)

After PR #155-era code shipped, build my freshly-compiled daemon (`audit-build-e05aa1f7`) and run `pilotctl handshake list-agents` + `pilotctl send-message list-agents --wait`. Logs show:

```
encrypted tunnel established auth=true peer_node_id=179172 endpoint=34.71.57.205:9001 relay=true
encrypted tunnel established auth=true peer_node_id=179172 endpoint=34.71.57.205:9001 relay=true
encrypted tunnel established auth=true peer_node_id=179172 endpoint=34.71.57.205:9001 relay=true
... (35 events total, ~one every 8 s)
```

Daemon stats: `tunnel_encryption_success: 44`, `pkts_recv: 0`, `peers: 1 (relay)`. So the AEAD path works but the data plane is dropping our PILS replies. Peer (list-agents) interprets that as us being stale and keeps retransmitting its PILA every ~8 s.

Root cause

`pkg/daemon/keyexchange/handle.go`:

```go
duplicate := hadCrypto && !keyChanged && time.Since(oldPC.CreatedAt) < DuplicateHandshakeDebounce
```

`DuplicateHandshakeDebounce = 250 ms`. An 8-second-cadence retransmit lands well past that, so `duplicate=false`. The frame carries the same X25519 ephemeral though — `keyChanged=false` — so nothing is installed (`!hadCrypto || keyChanged` is false). But the code still fires the full side-effect path: Info log, `tunnel.established` bus event, `PostInstallHook`.

Fix

Distinguish two slow-path cases:

  • `duplicate` (within 250 ms): coalesce everything (unchanged).
  • `sameSession := hadCrypto && !keyChanged` (past debounce): demote the Info log to Debug but keep firing the bus event + PostInstallHook — the existing test `TestDuplicatePILAOutsideDebounceFiresHookAgain` pins those for endpoint refresh and that behaviour stays.

Mirrored in `HandleUnauthFrame` (PILK) for symmetry.

Test

New: `TestSameSessionPILASuppressesInfoButFiresHookAndDebug` — full-flow regression:

  1. First PILA → Info `encrypted tunnel established`, hook count = 1.
  2. Sleep past debounce.
  3. Second same-key PILA → hook count = 2 (endpoint refresh preserved), `established` Info-log count stays at 1, Debug `same-session keepalive` present.

Test is not `t.Parallel()` because `slog.Default` is process-global; parallel `SetDefault` calls race.

Existing tests still green:

  • `TestDuplicatePILACoalescedSuppressesLogAndHook` — within-debounce coalescing.
  • `TestDuplicatePILAOutsideDebounceFiresHookAgain` — past-debounce hook fire (endpoint refresh).
  • `TestDuplicatePILAStillRepliesForAsymmetricRecovery` — recovery-reply path.
  • `TestReplyRateLimit*` — 1-second min interval between recovery replies.
  • `TestAsymmetricRecovery*` — InboundDecryptStale-driven reply.

What does NOT change

  • AEAD / X25519 / Ed25519 logic untouched.
  • `SendKeyExchangeToNode` recovery-reply path on `InboundDecryptStale` untouched.
  • Rate-limit gates (`KeyExchangeReplyMinInterval`, `MarkReplyKeyExchangeSent`) untouched.
  • `tunnel.established` bus event still fires past debounce — listeners that gate on it (relay admission, telemetry counters) keep working.

Test plan

  • `go test ./pkg/daemon/keyexchange/...` passes (all 8 existing keyexchange tests + new one).
  • `go build ./...` clean.
  • Rebuild a daemon from this branch, repeat the `pilotctl handshake list-agents` flow, confirm `encrypted tunnel established` fires exactly once per peer.

Observed against list-agents (node 179172) on 2026-05-29: a fresh
daemon sends its first PILA, peer replies and trust is established,
but the relayed data plane drops our PILS replies. Peer retransmits
its PILA every ~8 s as a keepalive. Each arrival carries the SAME
X25519 ephemeral (hadCrypto=true, keyChanged=false) and lands well
outside DuplicateHandshakeDebounce (250 ms), so the existing duplicate
gate doesn't catch it — every retransmit fires another 'encrypted
tunnel established' Info log and another tunnel.established bus event
even though structurally nothing was installed. 35 false-positive
'established' lines per peer per 5 minutes in field measurement.

Fix demotes the log to Debug for the same-session case while keeping
the bus event + postInstall hook firing (existing endpoint-refresh
contract pinned by TestDuplicatePILAOutsideDebounceFiresHookAgain).
Mirrors the demotion in HandleUnauthFrame (PILK) for the same reason.

Adds TestSameSessionPILASuppressesInfoButFiresHookAndDebug that pins:
  - first PILA still produces an Info 'established' line + hook count 1
  - second same-key PILA past debounce: hook count = 2 (endpoint refresh
    preserved), 'established' Info line count stays at 1, Debug log
    'same-session keepalive' present.

Does NOT change crypto/network behaviour. The asymmetric-recovery
reply path (TestAsymmetricRecoveryRepliesOnDuplicatePILAWhenStale)
and reply-rate-limit (TestReplyRateLimit*) gates are independent and
remain intact.
@TeoSlayer TeoSlayer requested a review from Alexgodoroja as a code owner May 29, 2026 18:08
teovl and others added 4 commits May 29, 2026 11:18
…-existing)

The fan-out loop reused a single msg buffer across iterations and did
the per-iteration copy INSIDE the spawned goroutine. The main goroutine's
next msg[0] = byte(i & 0xFF) raced with the previous goroutine's
copy(m2, m). Caught by go test -race in the Architecture gates job
(report: zz_ipc_write_deadline_test.go:75 read vs :72 write).

This was added in 1eff4fa (PILOT-218 write-deadline fix) before this
branch existed; every PR opened since then has been failing the
race-detector check. Hoisting the copy out of the goroutine fixes it
without changing the test's intent (still floods ic.ipcWrite with
ipcSendBuffer+10 distinct messages to fill the kernel send buffer).

Touched alongside the keyexchange log-spam fix because the same PR
job runs both and we can't merge until -race is green.
PR #156 / commit 1eff4fa was titled 'add write deadline to IPC
writeLoop and bypass semaphore for CmdHealth', and its commit message
correctly described the design (10 s per active write, 3 s drain on
Close). The CmdHealth inline-dispatch half landed; the writeLoop
SetWriteDeadline half didn't.

Result: TestWriteLoopExitsOnWriteDeadline has been failing since #156
merged — first with a -race report (a buffer-reuse bug in the test's
fan-out loop, fixed in the prior commit on this branch), then with
'writeLoop did not exit within deadline window' because the deadline
the test waits on doesn't actually exist. Every PR opened since #156
has been silently blocked on Architecture gates for this reason.

Adds:
  - ipcWriteTimeout = 10 s (the active-write deadline)
  - ipcDrainTimeout = 3 s (the Close-drain deadline)
  - SetWriteDeadline calls in writeLoop's both arms, matching the
    contract the test expects.

SetWriteDeadline errors are deliberately swallowed — net.Pipe ignores
the call, and any real socket that doesn't support it is already
broken in ways the next Write will surface.

Semantically a no-op for the happy path (normal clients read fast, the
deadline never trips). For a stalled client it does what PILOT-218
wanted: writeLoop times out → c.Conn.Close() → writeDone closes →
every parked ipcWrite caller returns ErrIPCClosed → semaphore drains
→ daemon is responsive again.
Companion to the prior commit (actually-set the writeLoop deadline).
After writeLoop hits ipcWriteTimeout and closes writeDone, ipcWrite
still had a sendCh-with-room enqueue path that returned nil — the
message would land in the channel, sit there orphaned, and the
caller would think it succeeded. The slow-path select did catch
writeDone, but only after the buffer filled.

Added a non-blocking writeDone check next to the existing c.done
fast-fail, so any ipcWrite after writeLoop exits returns ErrIPCClosed
immediately regardless of sendCh capacity. Pinned by
TestWriteLoopExitsOnWriteDeadline's final assertion.
@TeoSlayer TeoSlayer merged commit 276eb9a into main May 29, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants