fix(ocap-kernel): peer-incarnation restart detection across receiver state loss#948
Open
fix(ocap-kernel): peer-incarnation restart detection across receiver state loss#948
Conversation
Contributor
Coverage Report
File Coverage |
Adds an e2e regression test under Incarnation Detection that exercises the case where a receiver's in-memory PeerStateManager has lost the peer's last-seen incarnationId (here via receiver restart) while the RemoteHandle's persisted #highestReceivedSeq survives. With the same peerId but a fresh incarnationId on the sender, the handshake mis-classifies the reconnect as a first connection, no handlePeerRestart fires, and the sender's seq=1 messages are silently dropped — matching the symptom in issue #944. The test currently fails with a URL redemption timeout and will pass once the dedup-vs-incarnation state lifetimes are aligned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eceiver state loss Resolves issue #944. When a remote peer restarted with the same peerId but a fresh incarnation (e.g. plugin kernel reload), the receiving kernel could silently drop the peer's seq=1 messages because its in-memory PeerStateManager had no record of the prior incarnation to compare against — but its persisted RemoteHandle still held a non-zero #highestReceivedSeq from before. The dedup state outlived the restart-detection state, so the handshake mis-classified a real restart as a first connection. Changes: - Persist the peer's last-observed incarnationId in a new `peerIncarnation.{peerId}` KV namespace. Aligns its lifetime with the seq dedup state it gates. - OnIncarnationChange now fires on every successful handshake, carrying the observed incarnation, and returns a boolean telling the transport whether the kernel detected a real restart. The receiver's PSM check is no longer the authority — the persisted comparison is. - RemoteManager#handleIncarnationChange compares observed against the persisted value inside a savepoint: on mismatch with a defined prior value it calls handlePeerRestart, rejects promises the peer was deciding, and persists the new incarnation atomically. - handlePeerRestart additionally clears the peer's "+"-direction c-list entries (the peer's exports — dead after restart) via the new `forgetEndpointImports` helper. Without this, fresh incarnations' reused eref labels would resolve to already-resolved kernel promises from the prior incarnation. - Transport closes (rather than registers) the freshly-dialed channel on outbound restart detection, so concurrent in-flight retransmits can't reuse it to write pre-restart payloads. - RemoteHandle#retransmitPending is sequential so a peer-restart detection during the first send short-circuits the rest before any stale message hits the wire. - Browser RPC plumbing forwards the observed incarnation across the platform-services boundary and surfaces the kernel's restart verdict back to the transport. The regression test added in the prior commit now passes end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ackTimeoutMs gates the URL redemption timeout (ackTimeoutMs × (MAX_RETRIES + 1)), so the previous 500ms turned the 2s redemption budget into a flaky one for fresh K2's libp2p dial + handshake under CI load. Bump ackTimeoutMs to 2s and widen the reconnect/handshake/write timeouts proportionally; the test still finishes in ~5–6s but passes reliably across repeated runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue #944 follow-up. The first cut of forgetEndpointImports lived in clist.ts and only invalidated c-list lookups: deleteCListEntry's recognizable-only decrement leaves object-export refcounts and owner entries intact, so kernel objects exported by the peer could never be GC'd, and promise deciders weren't released. Move the helper to vat.ts (next to cleanupTerminatedVat, where the analogous bookkeeping for terminated vats lives) and mirror its export/promise legs: - Object exports: clear the owner key (peer was owning the object), decrement the baseline refcount, queue for GC, then tear down the c-list pair. - Promise exports: tear down the c-list pair, then drop the decider refcount if the peer was still recorded as decider. Imports stay scoped to the peer's "+"-direction entries so our exports to the peer (alice's root, etc.) keep working when a fresh incarnation reconnects with the same peer ID. Caller contract is unchanged: handlePeerRestart already rejects promises the peer was deciding before invoking this; this function takes care of the bookkeeping that termination would otherwise leak. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds vat.test.ts coverage for the helper added in the prior commit. Locks in the bookkeeping the peer-restart cleanup is responsible for that the e2e regression test would only catch as a slow leak: - Promise exports decrement the decider refcount only while the peer is still recorded as decider (nothing extra otherwise). - Object exports drop the owner mapping, tear down the c-list pair directly, decrement the baseline refcount, and add the kref to maybeFreeKrefs. - Object exports whose owner has migrated to another endpoint keep that owner intact while still releasing the peer's c-list pair. - Import-direction entries (our exports to the peer) are preserved unchanged so fresh incarnations can keep using them. - A mixed batch of entries is processed in one pass without disturbing the imports. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tightens the post-review state of the issue #944 fix: - OnIncarnationChange: drop the boolean | Promise<boolean> union and the now-stale "may return synchronously" hint; the type is always async. Make RemoteManager.#handleIncarnationChange async to match. - transport.ts: introduce PeerRestartedError sentinel and skip handleConnectionLoss for it on the outbound throw path. The peer is reachable (handshake just succeeded) and handleConnectionLoss would clobber an inbound channel a concurrent handshake just registered. - RemoteManager.ts: drop the production-source "See issue #944." reference; the issue link belongs in commit history, not the source. Tests: - transport.test.ts: cover the close-then-throw-without-register path on outbound restart-detected, plus the OR-with-PSM case. - RemoteHandle.test.ts: spy-based fake-timer harness (vi.useFakeTimers is incompatible with SES) for #retransmitPending — sequential await, terminal-error abort + giveUp, transient continue. - RemoteManager.test.ts: savepoint rollback when handlePeerRestart throws, plus updated verdict semantics (true even when no remote handle exists — it's the persisted-state comparison, not the reset outcome, that the transport cares about). - PlatformServicesServer.test.ts and PlatformServicesClient.test.ts: cover the new remoteIncarnationChange RPC plumbing — forwards args, returns kernel verdict, fails closed on RPC error, coerces non-boolean / throwing handlers to true. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Error Adds PeerRestartedError, IntentionalCloseError, and NetworkStoppedError as BaseError-backed sentinels with full marshal/unmarshal support, plus the `isTerminalSendError` predicate that the kernel-side RemoteHandle uses to discriminate retry-worthy from terminal verdicts thrown by the transport. These were previously ad-hoc plain Error throws scattered across the transport, with the predicate bound to brittle string matching on the error message. Centralizing them removes a class of drift hazards: the TERMINAL_NAMES set is derived from the class statics, so adding a new terminal class without registering it is a one-line obvious omission rather than a needle-in-a-haystack drift. Detection still uses the `name` field rather than `instanceof` because errors lose class identity when serialized via the platform-services JSON-RPC boundary; the `name` is preserved. Tests: 38 new tests across the three error classes (8 each) and the predicate (14) — covering canonical message/code, name preservation, cause/stack options, valid unmarshal, and three failure cases per class plus full coverage of the predicate's positive and negative matches including non-Error values. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… path Follow-up review of the issue #944 fix surfaced four real defects, each fixed here: 1. RemoteHandle.#sendRemoteCommand swallowed PeerRestartedError and NetworkStoppedError. Only `intentional close` triggered the cleanup path; the other two terminal verdicts fell through to logger.error with the message persisted to remotePending and #nextSendSeq advanced, leaving it pending until ACK timeout × MAX_RETRIES. The catch now routes all terminal errors through the shared isTerminalSendError predicate. 2. RemoteManager.#handleIncarnationChange enqueued kernelQueue notifications and ran in-memory mutations inside the savepoint window. A kv rollback would not undo either, leaving the kernel with run-queue entries against promises the persisted store still considers unresolved. Split RemoteHandle.handlePeerRestart into `persistPeerRestart` (kv-only, savepoint-safe) and `finalizePeerRestart` (in-memory + redemption rejections); the manager now snapshots `getPromisesByDecider` before any kv mutation, runs `persistPeerRestart` inside the savepoint, and defers `finalizePeerRestart` and `resolvePromises` to after release. 3. doInboundHandshake silently ignored kernelDetectedRestart. The outbound path closes the freshly-dialed channel and throws PeerRestartedError on detected restart; the inbound path only logged. A concurrent in-flight outbound send could then write a pre-restart payload over a fresh-incarnation channel. Inbound now returns false when the kernel detects a restart, and the caller closes the channel without registering it — symmetric with the outbound guard. 4. deleteRemoteInfo left peerIncarnation.{peerId} stranded. A re-established remote with the same peerId would mis-classify its first handshake as a restart against the leftover incarnation. Now reads the info before deletion and clears the peerIncarnation row; corrupt JSON is logged and swallowed so the rest of the cleanup still runs. Other tightening: - #rejectAllPending is a no-op when nothing is pending, so the catch path is safe even when the kernel-side restart cleanup ran upstream. - #handleIncarnationChange logs a warning when the restart verdict fires but no live RemoteHandle exists (boot race / peer teardown scoping issue), instead of silently advancing the persisted incarnation. - Sentinel error matching uses the centralized `isTerminalSendError` imported from @metamask/kernel-errors (this commit's prerequisite), removing the brittle string-match in the predicate. - Outbound PeerRestartedError now logs at info severity, not error — it is an expected, recoverable event. Tests: - New: PeerRestartedError/IntentionalCloseError/NetworkStoppedError from initial #sendRemoteCommand reject pending and fire onGiveUp. - New: handlePeerRestart actually clears c-list state (seeds an exportFromEndpoint pair, asserts both halves gone after restart). - New: PeerRestartedError does not trigger reconnection from the outbound catch path. - New: inbound handshake closes the channel without registration when the kernel detects a restart. - New: deleteRemoteInfo clears the peer-incarnation row. - Updated: RemoteManager savepoint-rollback test now spies on persistPeerRestart and asserts finalizePeerRestart is NOT called when the persisted phase throws. - Updated: #handleIncarnationChange verdict tests assert false on first observation and stable incarnation (not just true on restart). E2E: extension/test/e2e/remote-comms.test.ts had its toBeVisible budget raised to 50s (suite to 90s) to fit inside the 40s URL redemption budget that the new awaited handshake RPC pushes the round-trip into under CI load. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eadlock The drain loop awaited each request handler before pulling the next message, including responses. That serialization deadlocked any request handler that fired an outbound RPC back to the other side and awaited its response — the response could not be processed by the drain until the request handler returned, but the request handler was waiting for the response. This bit the new `onIncarnationChange` callback the transport invokes during the handshake: kernel-worker sends `sendRemoteMessage` → offscreen runs the handler → handshake calls `onIncarnationChange` RPC back to kernel-worker → kernel-worker returns a verdict → response arrives at offscreen but its drain is still awaiting the original request handler. Hung until the 40s URL redemption timeout. Fix: dispatch request handlers in the background (no await inside the drain). Responses still process synchronously. Apply on both sides since either drain can hit the same shape. The new `onIncarnationChange` round-trip during every handshake makes this latent issue actually surface in the e2e — the original asymmetry (kernel-worker drain rarely blocked, offscreen drain frequently blocked while sendRemoteMessage was in flight) is what made it hide for so long. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7632fb1 to
efde2d7
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit efde2d7. Configure here.
…hake `doOutboundHandshake` and `doInboundHandshake` already wrap the wire handshake in try/catch but left the awaited `onIncarnationChange` callback unguarded. A throw from the callback (e.g. `persistPeerRestart` hits corrupt c-list data) would propagate past `do*Handshake`'s contract: the inbound caller relies on a `false` return to close the channel before registering it, so an unhandled throw would escape to the outer `onInboundConnection` catch and only close the channel incidentally. Treat callback throws as handshake failures instead — return false / `success: false` and log via `outputError`. In production the platform-services layer already coerces callback errors to a `true` verdict, so this is defense in depth for unit tests, startup races, and any future callback wiring that doesn't catch internally. Reported by Cursor review on PR #948. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
FUDCo
approved these changes
May 6, 2026
Contributor
FUDCo
left a comment
There was a problem hiding this comment.
This looks right to me. (Now, of course, I'm dreading the rebase this will inflict on me, but I think it will be worth it :-) )
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

What
Closes #944: a remote peer restarting with a fresh incarnation (e.g. plugin kernel reload) caused the receiving kernel to silently drop the new connection's seq=1 messages.
Why
The receiver's persisted
RemoteHandle.#highestReceivedSeqoutlived the in-memory PeerStateManager that was supposed to detect the restart. The handshake then mis-classified a real restart as a first connection, and dedup ate the messages.How
peerIncarnation.{peerId}KV namespace, so the comparison survives receiver state loss.onIncarnationChangeon every successful handshake; the kernel compares observed against persisted and returns whether a restart was detected. The transport uses that verdict to close the channel and re-dial, symmetric on both inbound and outbound paths.forgetEndpointImports) and reject promises it was deciding, atomically with the persisted incarnation update via a savepoint. In-memory state changes and run-queue notifications are deferred until after commit so a kv rollback can't leave the views inconsistent.PeerRestartedError,IntentionalCloseError,NetworkStoppedErrorto@metamask/kernel-errors(withisTerminalSendError) so the kernel-side abort-retransmit predicate stops relying on string matching.Test plan
yarn workspace @metamask/kernel-errors test:dev:quiet,yarn workspace @metamask/ocap-kernel test:dev:quiet,kernel-browser-runtime,kernel-node-runtimeyarn build,yarn workspace @metamask/ocap-kernel lint:fixpackages/extension/test/e2e/remote-comms.test.ts(timeout budget raised to fit the 40s URL redemption ceiling)packages/kernel-node-runtime/test/e2e/remote-comms.test.tsRestart of one member of a communicating pair but not the other causes remote comms problems #944 regression🤖 Generated with Claude Code
Note
Medium Risk
Touches remote-communications handshake, kernel persistence, and retransmit logic; mistakes could cause dropped messages, excessive reconnects, or incorrect promise rejection. Mitigated by extensive new unit/e2e coverage and fail-closed defaults.
Overview
Fixes a remote-comms restart edge case by persisting the last-seen peer incarnation and consulting it on every handshake, allowing restart detection even after receiver-side in-memory state loss.
Updates the transport and platform-services RPC plumbing so
onIncarnationChange(peerId, observedIncarnation)returns a boolean verdict; the transport uses this verdict to drop/close channels and suppress potentially stale outbound messages (with explicit fail-closed behavior on RPC/handler errors).Hardens kernel restart handling by splitting peer-restart cleanup into persisted vs in-memory phases (savepoint-protected), adding
forgetEndpointImportsc-list teardown, improving retransmit to send sequentially and abort on terminal send errors, and introducing new sentinel errors (PeerRestartedError,IntentionalCloseError,NetworkStoppedError) plusisTerminalSendError. E2E timeouts/tests are adjusted/added to cover the regression scenario.Reviewed by Cursor Bugbot for commit 6aae9ea. Bugbot is set up for automated code reviews on this repo. Configure here.