[rqd/cuebot] Add a retry mechanism for FrameCompleteReport by DiegoTavares · Pull Request #2473 · AcademySoftwareFoundation/OpenCue

DiegoTavares · 2026-06-29T20:17:37Z

RQD silently drops FrameCompleteReports

In rust/crates/rqd/src/system/machine.rs:

monitor_running_frames evicts the finished frame from the cache before the report is confirmed
(line 405-420): the retain() closure returns false for FrameState::Finished, removing it, then
calls handle_finished_frames.
handle_finished_frames is fire-and-forget (line 573-588): if send_frame_complete_report returns
Err, it just error!(...) and moves on. No re-queue, no retry on the next monitor cycle, no
persistence. The frame is already gone from the cache. The completion is lost forever.
The retry layer cannot save it (report/retry/backoff_policy.rs:74-103): it inspects the HTTP
status and retries only on transport errors or HTTP 500/502/503. But a gRPC application error —
which is exactly what Cuebot returns when handleFrameCompleteReport throws
RqdRetryReportException — arrives as HTTP 200 with a grpc-status trailer. The policy sees HTTP
200 → Outcome::Return → not retried; send_frame_complete_report then surfaces it as Err
(report_client.rs:226-231) → dropped.

With this a completed frame under a failing Rqd (caused by memory pressure or cpu eviction for
example) could lead to a completed frame not having its state synchronized to Cuebot.

An orphaned frame (RUNNING with no proc, because its proc was removed
while RQD was still rendering) was reset straight to WAITING by the
maintenance reaper. The dispatcher could then rebook it onto a second
host while the original RQD process was still alive -- a double booking.

Redesign the reaper to kill-then-confirm-dead before clearing:

Add RqdClient.isFrameRunning(host, frameId), backed by RQD's
GetRunningFrameStatus. A NOT_FOUND response means the render has been
reaped (confirmed dead); any other failure is raised as an
RqdClientException so callers can tell "confirmed gone" from
"could not reach".
clearOrphanedFrames() now kills each orphan on its last-known host and
polls until death is confirmed. Only then is the frame reset to
WAITING (safe auto-retry). If death cannot be confirmed -- the per-pass
kill budget is exhausted or the host is unreachable -- the frame is
failed (DEAD) so it needs a manual retry instead of risking a rebook.
Fail closed: a stuck frame is acceptable, a silent double booking is
not.
The whole pass is bounded by maintenance.orphaned_frame_kill_budget_ms
(default 30s) so a batch of unreachable hosts cannot stall maintenance.
getOrphanedFrames() now returns List (mapped via the
existing FRAME_DETAIL_MAPPER) so the reaper reads the last host from
lastResource directly, dropping a second per-frame DB lookup.

Also document why FrameCompleteHandler's orphan-finalize path
intentionally skips the layer min-memory bump (it is proc-specific and
there is no proc on that path).

Tests: add MaintenanceManagerOrphanFrameTests (DB-free) covering the
confirmed-dead -> WAITING, host-unreachable -> DEAD, budget-exhausted ->
DEAD, and never-ran -> WAITING branches; extend
FrameDaoTests.testgetOrphanedFrames to assert lastResource is populated.

LLM Usage disclosure

Claude Opus was used to suggest the fix for this issue

RQD silently drops FrameCompleteReports In rust/crates/rqd/src/system/machine.rs: 1. monitor_running_frames evicts the finished frame from the cache before the report is confirmed (line 405-420): the retain() closure returns false for FrameState::Finished, removing it, then calls handle_finished_frames. 2. handle_finished_frames is fire-and-forget (line 573-588): if send_frame_complete_report returns Err, it just error!(...) and moves on. No requeue, no retry on the next monitor cycle, no persistence. The frame is already gone from the cache. The completion is lost forever. 3. The retry layer cannot save it (report/retry/backoff_policy.rs:74-103): it inspects the HTTP status and retries only on transport errors or HTTP 500/502/503. But a gRPC application error — which is exactly what Cuebot returns when handleFrameCompleteReport throws RqdRetryReportException — arrives as HTTP 200 with a grpc-status trailer. The policy sees HTTP 200 → Outcome::Return → not retried; send_frame_complete_report then surfaces it as Err (report_client.rs:226-231) → dropped. With this a completed frame under a failing rqd (caused by memory pressure or cpu eviction for exemple) could lead to a completed frame not having its state synchronized to Cuebot.

coderabbitai · 2026-06-29T20:17:44Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1210e87b-a723-4d85-8c56-c5b14a1e341a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

An orphaned frame (RUNNING with no proc, because its proc was removed while RQD was still rendering) was reset straight to WAITING by the maintenance reaper. The dispatcher could then rebook it onto a second host while the original RQD process was still alive -- a double booking. Redesign the reaper to kill-then-confirm-dead before clearing: - Add RqdClient.isFrameRunning(host, frameId), backed by RQD's GetRunningFrameStatus. A NOT_FOUND response means the render has been reaped (confirmed dead); any other failure is raised as an RqdClientException so callers can tell "confirmed gone" from "could not reach". - clearOrphanedFrames() now kills each orphan on its last-known host and polls until death is confirmed. Only then is the frame reset to WAITING (safe auto-retry). If death cannot be confirmed -- the per-pass kill budget is exhausted or the host is unreachable -- the frame is failed (DEAD) so it needs a manual retry instead of risking a rebook. Fail closed: a stuck frame is acceptable, a silent double booking is not. - The whole pass is bounded by maintenance.orphaned_frame_kill_budget_ms (default 30s) so a batch of unreachable hosts cannot stall maintenance. - getOrphanedFrames() now returns List<FrameDetail> (mapped via the existing FRAME_DETAIL_MAPPER) so the reaper reads the last host from lastResource directly, dropping a second per-frame DB lookup. Also document why FrameCompleteHandler's orphan-finalize path intentionally skips the layer min-memory bump (it is proc-specific and there is no proc on that path). Tests: add MaintenanceManagerOrphanFrameTests (DB-free) covering the confirmed-dead -> WAITING, host-unreachable -> DEAD, budget-exhausted -> DEAD, and never-ran -> WAITING branches; extend FrameDaoTests.testgetOrphanedFrames to assert lastResource is populated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Signed-off-by: Diego Tavares <dtavares@imageworks.com>

DiegoTavares force-pushed the double_booking_03 branch from abeec3f to bd98c03 Compare June 30, 2026 22:37

Merge branch 'master' into double_booking_03

1aa4fce

Signed-off-by: Diego Tavares <dtavares@imageworks.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[rqd/cuebot] Add a retry mechanism for FrameCompleteReport #2473

[rqd/cuebot] Add a retry mechanism for FrameCompleteReport #2473
DiegoTavares wants to merge 3 commits into
AcademySoftwareFoundation:masterfrom
DiegoTavares:double_booking_03

DiegoTavares commented Jun 29, 2026

Uh oh!

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

DiegoTavares commented Jun 29, 2026

LLM Usage disclosure

Uh oh!

coderabbitai Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading