Skip to content

fix(pegboard): persist and replay hibernating requests#4656

Draft
NathanFlurry wants to merge 1 commit intobreak-up/keep-error-exposure-consistentfrom
break-up/persist-hibernating-requests
Draft

fix(pegboard): persist and replay hibernating requests#4656
NathanFlurry wants to merge 1 commit intobreak-up/keep-error-exposure-consistentfrom
break-up/persist-hibernating-requests

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 14, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude bot commented Apr 14, 2026

Code Review: fix(pegboard): persist and replay hibernating requests

Overview

This PR fixes a correctness bug: in-flight requests to sleeping actors were not persisted to the DB, so they were lost on gateway or workflow restart. The fix:

  1. Persists each hibernating request via upsert at the start of the keepalive task.
  2. Populates CommandStartActor.hibernating_requests in the workflow activities directly, replacing the prior ephemeral guard-level population.
  3. Adds !actor.sleeping guard to prevent sleeping actors from being treated as ready.
  4. Adds .max(1) guards to prevent zero-duration intervals/timeouts.

Correctness

Responsibility shift in send_outbound (actor2/runtime.rs)

The removed comment said the field was left empty because a guard populated it ephemerally before reaching the runner. The diff does not show whether that guard-side code was removed. If both paths remain, hibernating_requests could be double-populated. Please confirm the guard-side path is removed or a no-op.

Idempotency race in insert_and_send_commands

The hibernating request list is fetched before the DB transaction. If the activity is retried, the list may differ (e.g. a request was GCd between retries). The persisted CommandStartActor could carry a stale list. Likely acceptable within the retry window, but worth a brief comment.

send_outbound vs insert_and_send_commands divergence (actor2)

send_outbound fetches and sends hibernating requests but does not persist them to the command key store (that only happens in insert_and_send_commands). If the actor restarts after send_outbound fires but before the command is durably stored, the requests may not be replayed. Is this intentional?


Code Quality

Double clone of commands_for_tx

In both actor/runtime.rs and actor2/runtime.rs, commands is cloned to commands_for_tx outside the closure and then cloned again inside. This is correct for the retry semantics of udb().run() but subtle. A brief comment would help future readers.

Logging level

upsert.rs, delete.rs, and list.rs all log at tracing::info!. These fire on every keepalive and actor start. Consider tracing::debug! to avoid log noise in production, unless these are intentional audit-level events.


Style

Log messages are lowercase with structured fields consistent with CLAUDE.md. Gateway/gateway2 and actor/actor2 runtime changes are kept in parity as required.


Unrelated Change

envoy-client/src/connection.rs adds "rivet_target.envoy" to the WebSocket subprotocol list. This appears unrelated to the main fix. A brief explanation in the PR description (or a separate commit) would clarify intent.


Minor

The .max(1) guards on ping interval and hibernation timeout are good defensive coding. A one-line comment explaining why zero is disallowed (prevents busy-loop) would add clarity.


Summary

The core logic is sound. Key questions before graduating from DRAFT:

  1. Is the guard-side hibernating_requests population definitively removed?
  2. Is the send_outbound / insert_and_send_commands divergence in actor2 intentional?
  3. Confirm or document the acceptable race in activity retry idempotency.
  4. Clarify the unrelated envoy-client protocol change.

Generated with Claude Code

@NathanFlurry NathanFlurry force-pushed the break-up/serialize-gateway-actor-keys branch from 9550912 to 3948546 Compare April 15, 2026 02:40
@NathanFlurry NathanFlurry force-pushed the break-up/persist-hibernating-requests branch from 9191b42 to 88f4613 Compare April 15, 2026 02:40
@NathanFlurry NathanFlurry changed the base branch from break-up/serialize-gateway-actor-keys to graphite-base/4656 April 15, 2026 02:50
@NathanFlurry NathanFlurry force-pushed the break-up/persist-hibernating-requests branch from 88f4613 to ae4fe54 Compare April 15, 2026 02:50
@NathanFlurry NathanFlurry changed the base branch from graphite-base/4656 to break-up/keep-error-exposure-consistent April 15, 2026 02:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant