Skip to content

[rqd/cuebot] Add a retry mechanism for FrameCompleteReport #2473

Draft
DiegoTavares wants to merge 3 commits into
AcademySoftwareFoundation:masterfrom
DiegoTavares:double_booking_03
Draft

[rqd/cuebot] Add a retry mechanism for FrameCompleteReport #2473
DiegoTavares wants to merge 3 commits into
AcademySoftwareFoundation:masterfrom
DiegoTavares:double_booking_03

Conversation

@DiegoTavares

Copy link
Copy Markdown
Collaborator

RQD silently drops FrameCompleteReports

In rust/crates/rqd/src/system/machine.rs:

  1. monitor_running_frames evicts the finished frame from the cache before the report is confirmed
    (line 405-420): the retain() closure returns false for FrameState::Finished, removing it, then
    calls handle_finished_frames.
  2. handle_finished_frames is fire-and-forget (line 573-588): if send_frame_complete_report returns
    Err, it just error!(...) and moves on. No re-queue, no retry on the next monitor cycle, no
    persistence. The frame is already gone from the cache. The completion is lost forever.
  3. The retry layer cannot save it (report/retry/backoff_policy.rs:74-103): it inspects the HTTP
    status and retries only on transport errors or HTTP 500/502/503. But a gRPC application error —
    which is exactly what Cuebot returns when handleFrameCompleteReport throws
    RqdRetryReportException — arrives as HTTP 200 with a grpc-status trailer. The policy sees HTTP
    200 → Outcome::Return → not retried; send_frame_complete_report then surfaces it as Err
    (report_client.rs:226-231) → dropped.

With this a completed frame under a failing Rqd (caused by memory pressure or cpu eviction for
example) could lead to a completed frame not having its state synchronized to Cuebot.

An orphaned frame (RUNNING with no proc, because its proc was removed
while RQD was still rendering) was reset straight to WAITING by the
maintenance reaper. The dispatcher could then rebook it onto a second
host while the original RQD process was still alive -- a double booking.

Redesign the reaper to kill-then-confirm-dead before clearing:

  • Add RqdClient.isFrameRunning(host, frameId), backed by RQD's
    GetRunningFrameStatus. A NOT_FOUND response means the render has been
    reaped (confirmed dead); any other failure is raised as an
    RqdClientException so callers can tell "confirmed gone" from
    "could not reach".

  • clearOrphanedFrames() now kills each orphan on its last-known host and
    polls until death is confirmed. Only then is the frame reset to
    WAITING (safe auto-retry). If death cannot be confirmed -- the per-pass
    kill budget is exhausted or the host is unreachable -- the frame is
    failed (DEAD) so it needs a manual retry instead of risking a rebook.
    Fail closed: a stuck frame is acceptable, a silent double booking is
    not.

  • The whole pass is bounded by maintenance.orphaned_frame_kill_budget_ms
    (default 30s) so a batch of unreachable hosts cannot stall maintenance.

  • getOrphanedFrames() now returns List (mapped via the
    existing FRAME_DETAIL_MAPPER) so the reaper reads the last host from
    lastResource directly, dropping a second per-frame DB lookup.

Also document why FrameCompleteHandler's orphan-finalize path
intentionally skips the layer min-memory bump (it is proc-specific and
there is no proc on that path).

Tests: add MaintenanceManagerOrphanFrameTests (DB-free) covering the
confirmed-dead -> WAITING, host-unreachable -> DEAD, budget-exhausted ->
DEAD, and never-ran -> WAITING branches; extend
FrameDaoTests.testgetOrphanedFrames to assert lastResource is populated.

LLM Usage disclosure

Claude Opus was used to suggest the fix for this issue

RQD silently drops FrameCompleteReports

In rust/crates/rqd/src/system/machine.rs:

1. monitor_running_frames evicts the finished frame from the cache before the report is confirmed
   (line 405-420): the retain() closure returns false for FrameState::Finished, removing it, then
   calls handle_finished_frames.
2. handle_finished_frames is fire-and-forget (line 573-588): if send_frame_complete_report returns
   Err, it just error!(...) and moves on. No requeue, no retry on the next monitor cycle, no
   persistence. The frame is already gone from the cache. The completion is lost forever.
3. The retry layer cannot save it (report/retry/backoff_policy.rs:74-103): it inspects the HTTP
   status and retries only on transport errors or HTTP 500/502/503. But a gRPC application error —
   which is exactly what Cuebot returns when handleFrameCompleteReport throws
   RqdRetryReportException — arrives as HTTP 200 with a grpc-status trailer. The policy sees HTTP
   200 → Outcome::Return → not retried; send_frame_complete_report then surfaces it as Err
   (report_client.rs:226-231) → dropped.

With this a completed frame under a failing rqd (caused by memory pressure or cpu eviction for
exemple) could lead to a completed frame not having its state synchronized to Cuebot.
@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1210e87b-a723-4d85-8c56-c5b14a1e341a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

An orphaned frame (RUNNING with no proc, because its proc was removed
while RQD was still rendering) was reset straight to WAITING by the
maintenance reaper. The dispatcher could then rebook it onto a second
host while the original RQD process was still alive -- a double booking.

Redesign the reaper to kill-then-confirm-dead before clearing:

- Add RqdClient.isFrameRunning(host, frameId), backed by RQD's
  GetRunningFrameStatus. A NOT_FOUND response means the render has been
  reaped (confirmed dead); any other failure is raised as an
  RqdClientException so callers can tell "confirmed gone" from
  "could not reach".

- clearOrphanedFrames() now kills each orphan on its last-known host and
  polls until death is confirmed. Only then is the frame reset to
  WAITING (safe auto-retry). If death cannot be confirmed -- the per-pass
  kill budget is exhausted or the host is unreachable -- the frame is
  failed (DEAD) so it needs a manual retry instead of risking a rebook.
  Fail closed: a stuck frame is acceptable, a silent double booking is
  not.

- The whole pass is bounded by maintenance.orphaned_frame_kill_budget_ms
  (default 30s) so a batch of unreachable hosts cannot stall maintenance.

- getOrphanedFrames() now returns List<FrameDetail> (mapped via the
  existing FRAME_DETAIL_MAPPER) so the reaper reads the last host from
  lastResource directly, dropping a second per-frame DB lookup.

Also document why FrameCompleteHandler's orphan-finalize path
intentionally skips the layer min-memory bump (it is proc-specific and
there is no proc on that path).

Tests: add MaintenanceManagerOrphanFrameTests (DB-free) covering the
confirmed-dead -> WAITING, host-unreachable -> DEAD, budget-exhausted ->
DEAD, and never-ran -> WAITING branches; extend
FrameDaoTests.testgetOrphanedFrames to assert lastResource is populated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Diego Tavares <dtavares@imageworks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant