Skip to content

fix(ocap-kernel): peer-incarnation restart detection across receiver state loss#948

Open
sirtimid wants to merge 12 commits intomainfrom
sirtimid/fix-issue-944-peer-incarnation-restart
Open

fix(ocap-kernel): peer-incarnation restart detection across receiver state loss#948
sirtimid wants to merge 12 commits intomainfrom
sirtimid/fix-issue-944-peer-incarnation-restart

Conversation

@sirtimid
Copy link
Copy Markdown
Contributor

@sirtimid sirtimid commented May 4, 2026

What

Closes #944: a remote peer restarting with a fresh incarnation (e.g. plugin kernel reload) caused the receiving kernel to silently drop the new connection's seq=1 messages.

Why

The receiver's persisted RemoteHandle.#highestReceivedSeq outlived the in-memory PeerStateManager that was supposed to detect the restart. The handshake then mis-classified a real restart as a first connection, and dedup ate the messages.

How

  • Persist the peer's last-observed incarnation in a new peerIncarnation.{peerId} KV namespace, so the comparison survives receiver state loss.
  • Fire onIncarnationChange on every successful handshake; the kernel compares observed against persisted and returns whether a restart was detected. The transport uses that verdict to close the channel and re-dial, symmetric on both inbound and outbound paths.
  • On detected restart, clear the peer's c-list contributions (forgetEndpointImports) and reject promises it was deciding, atomically with the persisted incarnation update via a savepoint. In-memory state changes and run-queue notifications are deferred until after commit so a kv rollback can't leave the views inconsistent.
  • Add PeerRestartedError, IntentionalCloseError, NetworkStoppedError to @metamask/kernel-errors (with isTerminalSendError) so the kernel-side abort-retransmit predicate stops relying on string matching.

Test plan

  • Unit tests: yarn workspace @metamask/kernel-errors test:dev:quiet, yarn workspace @metamask/ocap-kernel test:dev:quiet, kernel-browser-runtime, kernel-node-runtime
  • yarn build, yarn workspace @metamask/ocap-kernel lint:fix
  • E2E packages/extension/test/e2e/remote-comms.test.ts (timeout budget raised to fit the 40s URL redemption ceiling)
  • E2E packages/kernel-node-runtime/test/e2e/remote-comms.test.ts Restart of one member of a communicating pair but not the other causes remote comms problems #944 regression

🤖 Generated with Claude Code


Note

Medium Risk
Touches remote-communications handshake, kernel persistence, and retransmit logic; mistakes could cause dropped messages, excessive reconnects, or incorrect promise rejection. Mitigated by extensive new unit/e2e coverage and fail-closed defaults.

Overview
Fixes a remote-comms restart edge case by persisting the last-seen peer incarnation and consulting it on every handshake, allowing restart detection even after receiver-side in-memory state loss.

Updates the transport and platform-services RPC plumbing so onIncarnationChange(peerId, observedIncarnation) returns a boolean verdict; the transport uses this verdict to drop/close channels and suppress potentially stale outbound messages (with explicit fail-closed behavior on RPC/handler errors).

Hardens kernel restart handling by splitting peer-restart cleanup into persisted vs in-memory phases (savepoint-protected), adding forgetEndpointImports c-list teardown, improving retransmit to send sequentially and abort on terminal send errors, and introducing new sentinel errors (PeerRestartedError, IntentionalCloseError, NetworkStoppedError) plus isTerminalSendError. E2E timeouts/tests are adjusted/added to cover the regression scenario.

Reviewed by Cursor Bugbot for commit 6aae9ea. Bugbot is set up for automated code reviews on this repo. Configure here.

@sirtimid sirtimid changed the title fix(ocap-kernel): peer-incarnation restart detection across receiver state loss (#944) fix(ocap-kernel): peer-incarnation restart detection across receiver state loss May 4, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

Coverage Report

Status Category Percentage Covered / Total
🔵 Lines 71.43%
⬆️ +0.48%
8259 / 11561
🔵 Statements 71.26%
⬆️ +0.47%
8395 / 11780
🔵 Functions 72.25%
⬆️ +0.35%
1992 / 2757
🔵 Branches 65.09%
⬆️ +0.51%
3342 / 5134
File Coverage
File Stmts Branches Functions Lines Uncovered Lines
Changed Files
packages/kernel-browser-runtime/src/PlatformServicesClient.ts 92.2%
⬆️ +2.65%
87.09%
⬆️ +11.23%
76%
⬇️ -1.27%
93.42%
⬆️ +3.87%
111, 133, 291-302, 424, 446
packages/kernel-browser-runtime/src/PlatformServicesServer.ts 90.72%
⬆️ +1.48%
82.75%
🟰 ±0%
76%
⬆️ +5.17%
91.66%
⬆️ +2.42%
144, 167, 189, 217, 399-403, 446
packages/kernel-errors/src/constants.ts 93.75%
🟰 ±0%
100%
🟰 ±0%
80%
🟰 ±0%
90.9%
🟰 ±0%
20
packages/kernel-errors/src/index.ts 100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
packages/kernel-errors/src/errors/IntentionalCloseError.ts 100% 100% 100% 100%
packages/kernel-errors/src/errors/NetworkStoppedError.ts 100% 100% 100% 100%
packages/kernel-errors/src/errors/PeerRestartedError.ts 100% 100% 100% 100%
packages/kernel-errors/src/errors/index.ts 100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
packages/kernel-errors/src/utils/isTerminalSendError.ts 100% 100% 100% 100%
packages/ocap-kernel/src/remotes/types.ts 100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
packages/ocap-kernel/src/remotes/kernel/RemoteHandle.ts 95.84%
⬆️ +6.62%
89.47%
⬆️ +5.55%
98.03%
⬆️ +10.28%
95.81%
⬆️ +6.32%
389, 396-401, 447, 526, 569, 579-581, 641-644, 983, 1051-1053, 1104
packages/ocap-kernel/src/remotes/kernel/RemoteManager.ts 99.05%
⬆️ +0.15%
100%
🟰 ±0%
95.65%
🟰 ±0%
99.05%
⬆️ +0.15%
437-439
packages/ocap-kernel/src/remotes/platform/transport.ts 82.5%
⬆️ +0.36%
84.34%
⬆️ +3.39%
74.28%
🟰 ±0%
82.5%
⬆️ +0.36%
65, 140, 143, 164, 208-209, 228-229, 279-280, 298-307, 366, 400-418, 442, 526, 569-572, 596, 620-625, 628-629, 641-645, 699, 729, 747-749, 758, 771, 851, 880
packages/ocap-kernel/src/rpc/kernel-remote/remoteIncarnationChange.ts 75%
⬆️ +15.00%
100%
🟰 ±0%
0%
🟰 ±0%
75%
⬆️ +15.00%
51
packages/ocap-kernel/src/store/methods/remote.ts 98.43%
⬇️ -1.57%
100%
🟰 ±0%
100%
🟰 ±0%
98.43%
⬇️ -1.57%
94-98
packages/ocap-kernel/src/store/methods/vat.ts 97.26%
⬆️ +0.93%
83.33%
⬆️ +4.76%
100%
🟰 ±0%
97.24%
⬆️ +0.95%
224, 294, 301-302
Generated in workflow #4397 for commit 6aae9ea by the Vitest Coverage Report Action

sirtimid and others added 11 commits May 4, 2026 21:50
Adds an e2e regression test under Incarnation Detection that exercises
the case where a receiver's in-memory PeerStateManager has lost the
peer's last-seen incarnationId (here via receiver restart) while the
RemoteHandle's persisted #highestReceivedSeq survives. With the same
peerId but a fresh incarnationId on the sender, the handshake
mis-classifies the reconnect as a first connection, no
handlePeerRestart fires, and the sender's seq=1 messages are silently
dropped — matching the symptom in issue #944. The test currently fails
with a URL redemption timeout and will pass once the dedup-vs-incarnation
state lifetimes are aligned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eceiver state loss

Resolves issue #944. When a remote peer restarted with the same peerId
but a fresh incarnation (e.g. plugin kernel reload), the receiving
kernel could silently drop the peer's seq=1 messages because its
in-memory PeerStateManager had no record of the prior incarnation to
compare against — but its persisted RemoteHandle still held a non-zero
#highestReceivedSeq from before. The dedup state outlived the
restart-detection state, so the handshake mis-classified a real restart
as a first connection.

Changes:

- Persist the peer's last-observed incarnationId in a new
  `peerIncarnation.{peerId}` KV namespace. Aligns its lifetime with the
  seq dedup state it gates.
- OnIncarnationChange now fires on every successful handshake, carrying
  the observed incarnation, and returns a boolean telling the transport
  whether the kernel detected a real restart. The receiver's PSM check
  is no longer the authority — the persisted comparison is.
- RemoteManager#handleIncarnationChange compares observed against the
  persisted value inside a savepoint: on mismatch with a defined prior
  value it calls handlePeerRestart, rejects promises the peer was
  deciding, and persists the new incarnation atomically.
- handlePeerRestart additionally clears the peer's "+"-direction c-list
  entries (the peer's exports — dead after restart) via the new
  `forgetEndpointImports` helper. Without this, fresh incarnations'
  reused eref labels would resolve to already-resolved kernel promises
  from the prior incarnation.
- Transport closes (rather than registers) the freshly-dialed channel
  on outbound restart detection, so concurrent in-flight retransmits
  can't reuse it to write pre-restart payloads.
- RemoteHandle#retransmitPending is sequential so a peer-restart
  detection during the first send short-circuits the rest before any
  stale message hits the wire.
- Browser RPC plumbing forwards the observed incarnation across the
  platform-services boundary and surfaces the kernel's restart verdict
  back to the transport.

The regression test added in the prior commit now passes end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ackTimeoutMs gates the URL redemption timeout
(ackTimeoutMs × (MAX_RETRIES + 1)), so the previous 500ms turned the
2s redemption budget into a flaky one for fresh K2's libp2p dial +
handshake under CI load. Bump ackTimeoutMs to 2s and widen the
reconnect/handshake/write timeouts proportionally; the test still
finishes in ~5–6s but passes reliably across repeated runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue #944 follow-up. The first cut of forgetEndpointImports lived in
clist.ts and only invalidated c-list lookups: deleteCListEntry's
recognizable-only decrement leaves object-export refcounts and owner
entries intact, so kernel objects exported by the peer could never be
GC'd, and promise deciders weren't released.

Move the helper to vat.ts (next to cleanupTerminatedVat, where the
analogous bookkeeping for terminated vats lives) and mirror its
export/promise legs:

- Object exports: clear the owner key (peer was owning the object),
  decrement the baseline refcount, queue for GC, then tear down the
  c-list pair.
- Promise exports: tear down the c-list pair, then drop the decider
  refcount if the peer was still recorded as decider.

Imports stay scoped to the peer's "+"-direction entries so our exports
to the peer (alice's root, etc.) keep working when a fresh incarnation
reconnects with the same peer ID.

Caller contract is unchanged: handlePeerRestart already rejects
promises the peer was deciding before invoking this; this function
takes care of the bookkeeping that termination would otherwise leak.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds vat.test.ts coverage for the helper added in the prior commit.
Locks in the bookkeeping the peer-restart cleanup is responsible for
that the e2e regression test would only catch as a slow leak:

- Promise exports decrement the decider refcount only while the peer is
  still recorded as decider (nothing extra otherwise).
- Object exports drop the owner mapping, tear down the c-list pair
  directly, decrement the baseline refcount, and add the kref to
  maybeFreeKrefs.
- Object exports whose owner has migrated to another endpoint keep
  that owner intact while still releasing the peer's c-list pair.
- Import-direction entries (our exports to the peer) are preserved
  unchanged so fresh incarnations can keep using them.
- A mixed batch of entries is processed in one pass without
  disturbing the imports.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tightens the post-review state of the issue #944 fix:

- OnIncarnationChange: drop the boolean | Promise<boolean> union and the
  now-stale "may return synchronously" hint; the type is always async.
  Make RemoteManager.#handleIncarnationChange async to match.
- transport.ts: introduce PeerRestartedError sentinel and skip
  handleConnectionLoss for it on the outbound throw path. The peer is
  reachable (handshake just succeeded) and handleConnectionLoss would
  clobber an inbound channel a concurrent handshake just registered.
- RemoteManager.ts: drop the production-source "See issue #944."
  reference; the issue link belongs in commit history, not the source.

Tests:
- transport.test.ts: cover the close-then-throw-without-register path
  on outbound restart-detected, plus the OR-with-PSM case.
- RemoteHandle.test.ts: spy-based fake-timer harness (vi.useFakeTimers
  is incompatible with SES) for #retransmitPending — sequential await,
  terminal-error abort + giveUp, transient continue.
- RemoteManager.test.ts: savepoint rollback when handlePeerRestart
  throws, plus updated verdict semantics (true even when no remote
  handle exists — it's the persisted-state comparison, not the reset
  outcome, that the transport cares about).
- PlatformServicesServer.test.ts and PlatformServicesClient.test.ts:
  cover the new remoteIncarnationChange RPC plumbing — forwards args,
  returns kernel verdict, fails closed on RPC error, coerces
  non-boolean / throwing handlers to true.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Error

Adds PeerRestartedError, IntentionalCloseError, and NetworkStoppedError
as BaseError-backed sentinels with full marshal/unmarshal support, plus
the `isTerminalSendError` predicate that the kernel-side RemoteHandle
uses to discriminate retry-worthy from terminal verdicts thrown by the
transport.

These were previously ad-hoc plain Error throws scattered across the
transport, with the predicate bound to brittle string matching on the
error message. Centralizing them removes a class of drift hazards: the
TERMINAL_NAMES set is derived from the class statics, so adding a new
terminal class without registering it is a one-line obvious omission
rather than a needle-in-a-haystack drift.

Detection still uses the `name` field rather than `instanceof` because
errors lose class identity when serialized via the platform-services
JSON-RPC boundary; the `name` is preserved.

Tests: 38 new tests across the three error classes (8 each) and the
predicate (14) — covering canonical message/code, name preservation,
cause/stack options, valid unmarshal, and three failure cases per
class plus full coverage of the predicate's positive and negative
matches including non-Error values.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… path

Follow-up review of the issue #944 fix surfaced four real defects, each
fixed here:

1. RemoteHandle.#sendRemoteCommand swallowed PeerRestartedError and
   NetworkStoppedError. Only `intentional close` triggered the cleanup
   path; the other two terminal verdicts fell through to logger.error
   with the message persisted to remotePending and #nextSendSeq
   advanced, leaving it pending until ACK timeout × MAX_RETRIES. The
   catch now routes all terminal errors through the shared
   isTerminalSendError predicate.

2. RemoteManager.#handleIncarnationChange enqueued kernelQueue
   notifications and ran in-memory mutations inside the savepoint
   window. A kv rollback would not undo either, leaving the kernel
   with run-queue entries against promises the persisted store still
   considers unresolved. Split RemoteHandle.handlePeerRestart into
   `persistPeerRestart` (kv-only, savepoint-safe) and
   `finalizePeerRestart` (in-memory + redemption rejections); the
   manager now snapshots `getPromisesByDecider` before any kv mutation,
   runs `persistPeerRestart` inside the savepoint, and defers
   `finalizePeerRestart` and `resolvePromises` to after release.

3. doInboundHandshake silently ignored kernelDetectedRestart. The
   outbound path closes the freshly-dialed channel and throws
   PeerRestartedError on detected restart; the inbound path only
   logged. A concurrent in-flight outbound send could then write a
   pre-restart payload over a fresh-incarnation channel. Inbound now
   returns false when the kernel detects a restart, and the caller
   closes the channel without registering it — symmetric with the
   outbound guard.

4. deleteRemoteInfo left peerIncarnation.{peerId} stranded. A
   re-established remote with the same peerId would mis-classify its
   first handshake as a restart against the leftover incarnation. Now
   reads the info before deletion and clears the peerIncarnation row;
   corrupt JSON is logged and swallowed so the rest of the cleanup
   still runs.

Other tightening:
- #rejectAllPending is a no-op when nothing is pending, so the catch
  path is safe even when the kernel-side restart cleanup ran upstream.
- #handleIncarnationChange logs a warning when the restart verdict
  fires but no live RemoteHandle exists (boot race / peer teardown
  scoping issue), instead of silently advancing the persisted
  incarnation.
- Sentinel error matching uses the centralized `isTerminalSendError`
  imported from @metamask/kernel-errors (this commit's prerequisite),
  removing the brittle string-match in the predicate.
- Outbound PeerRestartedError now logs at info severity, not error —
  it is an expected, recoverable event.

Tests:
- New: PeerRestartedError/IntentionalCloseError/NetworkStoppedError
  from initial #sendRemoteCommand reject pending and fire onGiveUp.
- New: handlePeerRestart actually clears c-list state (seeds an
  exportFromEndpoint pair, asserts both halves gone after restart).
- New: PeerRestartedError does not trigger reconnection from the
  outbound catch path.
- New: inbound handshake closes the channel without registration when
  the kernel detects a restart.
- New: deleteRemoteInfo clears the peer-incarnation row.
- Updated: RemoteManager savepoint-rollback test now spies on
  persistPeerRestart and asserts finalizePeerRestart is NOT called
  when the persisted phase throws.
- Updated: #handleIncarnationChange verdict tests assert false on
  first observation and stable incarnation (not just true on restart).

E2E: extension/test/e2e/remote-comms.test.ts had its toBeVisible
budget raised to 50s (suite to 90s) to fit inside the 40s URL
redemption budget that the new awaited handshake RPC pushes the
round-trip into under CI load.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eadlock

The drain loop awaited each request handler before pulling the next
message, including responses. That serialization deadlocked any
request handler that fired an outbound RPC back to the other side and
awaited its response — the response could not be processed by the
drain until the request handler returned, but the request handler was
waiting for the response.

This bit the new `onIncarnationChange` callback the transport invokes
during the handshake: kernel-worker sends `sendRemoteMessage` →
offscreen runs the handler → handshake calls `onIncarnationChange` RPC
back to kernel-worker → kernel-worker returns a verdict → response
arrives at offscreen but its drain is still awaiting the original
request handler. Hung until the 40s URL redemption timeout.

Fix: dispatch request handlers in the background (no await inside the
drain). Responses still process synchronously. Apply on both sides
since either drain can hit the same shape.

The new `onIncarnationChange` round-trip during every handshake makes
this latent issue actually surface in the e2e — the original
asymmetry (kernel-worker drain rarely blocked, offscreen drain
frequently blocked while sendRemoteMessage was in flight) is what
made it hide for so long.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sirtimid sirtimid force-pushed the sirtimid/fix-issue-944-peer-incarnation-restart branch from 7632fb1 to efde2d7 Compare May 4, 2026 19:51
@sirtimid sirtimid marked this pull request as ready for review May 4, 2026 19:51
@sirtimid sirtimid requested a review from a team as a code owner May 4, 2026 19:51
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit efde2d7. Configure here.

Comment thread packages/ocap-kernel/src/remotes/platform/transport.ts
…hake

`doOutboundHandshake` and `doInboundHandshake` already wrap the wire
handshake in try/catch but left the awaited `onIncarnationChange`
callback unguarded. A throw from the callback (e.g. `persistPeerRestart`
hits corrupt c-list data) would propagate past `do*Handshake`'s
contract: the inbound caller relies on a `false` return to close the
channel before registering it, so an unhandled throw would escape to
the outer `onInboundConnection` catch and only close the channel
incidentally. Treat callback throws as handshake failures instead —
return false / `success: false` and log via `outputError`.

In production the platform-services layer already coerces callback
errors to a `true` verdict, so this is defense in depth for unit
tests, startup races, and any future callback wiring that doesn't
catch internally.

Reported by Cursor review on PR #948.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@FUDCo FUDCo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks right to me. (Now, of course, I'm dreading the rebase this will inflict on me, but I think it will be worth it :-) )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Restart of one member of a communicating pair but not the other causes remote comms problems

2 participants