Skip to content

fix(rivetkit): restore hibernatable sockets and hydrate serverless starts#4658

Draft
NathanFlurry wants to merge 1 commit intobreak-up/expose-hibernation-metadatafrom
break-up/restore-hibernating-sockets
Draft

fix(rivetkit): restore hibernatable sockets and hydrate serverless starts#4658
NathanFlurry wants to merge 1 commit intobreak-up/expose-hibernation-metadatafrom
break-up/restore-hibernating-sockets

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 14, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude bot commented Apr 14, 2026

PR Review: fix(rivetkit): restore hibernatable sockets and hydrate serverless starts

Status: DRAFT

Overview

This PR fixes two related issues:

  1. Hibernatable socket restoration - WebSocket connections were not properly re-bound after an actor woke from hibernation, leaving sockets in a limbo state.
  2. Serverless start hydration - Serverless start payloads were not enriched with persisted hibernating WebSocket metadata from KV, so the envoy did not know which sockets to restore.

The approach is sound: introduce explicit HibernatableConnectBinding / HibernatableRunnerWebSocketBinding structs, store them in maps keyed by (gatewayId, requestId), and rebind them on actor wake. The test hardening is a significant improvement over fixed-delay polling.

Issues & Observations

Minor Bugs

Inconsistent error field in logger (dynamic actor restore block)

Every other logger call in this file uses error: as the field name. The new block uses err:, which causes inconsistent structured log output. Rename err to error to match all other call sites.

Binding inserted into map before proxyToActorWs is assigned (#bindDynamicHibernatableRunnerWebSocket)

Between the .set() and the assignment of proxyToActorWs there is an await, so the binding is observable in the map without a proxy socket. The onMessage handler guards against this with a null check and closes the socket — but it would silently close a connection unnecessarily if another coroutine runs during this window. Consider restructuring so the binding is only inserted once fully initialized:

this.#hibernatableRunnerWebSocketBindings.set(key, binding);
// ... await runtime.openWebSocket(...) -- async gap here
binding.proxyToActorWs = proxyToActorWs;

Design Observations

Magic 100 ms wait in waitForHibernatableRegistration

The name implies the wait is sufficient for registration, but it is still a fixed delay. On CI under load this could still be racy. Consider whether this can be replaced with a deterministic signal (e.g., polling a count endpoint) as was done in readAfterSleepCycle. If a fixed delay is unavoidable, a brief comment explaining why 100 ms is sufficient would help future readers.

readAfterSleepCycle default minStartCount encodes a hidden invariant

const minStartCount = options?.minStartCount ?? minSleepCount + 1;

This embeds the invariant that startCount == sleepCount + 1 as a hidden default, then the same invariant is checked again in assertions. A one-line comment documenting the relationship ("each sleep is followed by exactly one wake/start") would help future readers understand why this default is correct.

O(n) scan in #rebindHibernatableConnectSockets / #rebindDynamicHibernatableRunnerWebSockets

Both methods iterate the entire flat map and filter by actorId. For typical actor/connection counts this is fine, but a nested Map<actorId, Map<key, Binding>> would make the per-actor lookup O(1). Not a blocker.

Binding cleanup on forced shutdown

In #disposeAllDynamicRuntimes and the new forced-actor-stop loop in the shutdown path, #hibernatableConnectBindings and #hibernatableRunnerWebSocketBindings are not explicitly cleared. After process shutdown this is harmless, but a clear() call at the end of the shutdown sequence would make the intent explicit.

Test Coverage

The removal of intermediate-state assertions in the raw WebSocket sleep tests (e.g., verifying messageStarted=1, messageFinished=0 before the final sleep) is a reasonable trade-off for fewer flaky tests. The final-state assertions still cover the meaningful outcomes.

The gateway routing URL changes (/api/hello to /request/api/hello) look like they reflect an actual router path change. Confirm the corresponding source change is in this PR or a linked one, since updating tests without the underlying source change could mask a regression.

Positive

  • readAfterSleepCycle with retry logic is a real improvement over single-shot fixed-delay polling and will make the sleep test suite significantly more reliable on slow CI runners.
  • Making hibernation tests .sequential() is the right call — these tests share actor state and must not run concurrently.
  • #disposeDynamicRuntime correctly uses finally to ensure the map entry is removed even if dispose() throws.
  • The serverless hydration path correctly short-circuits when nothing changed, avoiding unnecessary re-encoding.
  • Force-stopping remaining actors on shutdown timeout prevents the process from hanging indefinitely.

@NathanFlurry NathanFlurry force-pushed the break-up/expose-hibernation-metadata branch from 26f98bc to fde1e0b Compare April 15, 2026 02:40
@NathanFlurry NathanFlurry force-pushed the break-up/restore-hibernating-sockets branch from ccc38b5 to 8293235 Compare April 15, 2026 02:40
@NathanFlurry NathanFlurry force-pushed the break-up/expose-hibernation-metadata branch from fde1e0b to 789b9cd Compare April 15, 2026 02:50
@NathanFlurry NathanFlurry force-pushed the break-up/restore-hibernating-sockets branch from 8293235 to 1ba6a3b Compare April 15, 2026 02:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant