[rqd/cuebot] Add a retry mechanism for FrameCompleteReport #2473
Draft
DiegoTavares wants to merge 3 commits into
Draft
[rqd/cuebot] Add a retry mechanism for FrameCompleteReport #2473DiegoTavares wants to merge 3 commits into
DiegoTavares wants to merge 3 commits into
Conversation
RQD silently drops FrameCompleteReports In rust/crates/rqd/src/system/machine.rs: 1. monitor_running_frames evicts the finished frame from the cache before the report is confirmed (line 405-420): the retain() closure returns false for FrameState::Finished, removing it, then calls handle_finished_frames. 2. handle_finished_frames is fire-and-forget (line 573-588): if send_frame_complete_report returns Err, it just error!(...) and moves on. No requeue, no retry on the next monitor cycle, no persistence. The frame is already gone from the cache. The completion is lost forever. 3. The retry layer cannot save it (report/retry/backoff_policy.rs:74-103): it inspects the HTTP status and retries only on transport errors or HTTP 500/502/503. But a gRPC application error — which is exactly what Cuebot returns when handleFrameCompleteReport throws RqdRetryReportException — arrives as HTTP 200 with a grpc-status trailer. The policy sees HTTP 200 → Outcome::Return → not retried; send_frame_complete_report then surfaces it as Err (report_client.rs:226-231) → dropped. With this a completed frame under a failing rqd (caused by memory pressure or cpu eviction for exemple) could lead to a completed frame not having its state synchronized to Cuebot.
Contributor
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
An orphaned frame (RUNNING with no proc, because its proc was removed while RQD was still rendering) was reset straight to WAITING by the maintenance reaper. The dispatcher could then rebook it onto a second host while the original RQD process was still alive -- a double booking. Redesign the reaper to kill-then-confirm-dead before clearing: - Add RqdClient.isFrameRunning(host, frameId), backed by RQD's GetRunningFrameStatus. A NOT_FOUND response means the render has been reaped (confirmed dead); any other failure is raised as an RqdClientException so callers can tell "confirmed gone" from "could not reach". - clearOrphanedFrames() now kills each orphan on its last-known host and polls until death is confirmed. Only then is the frame reset to WAITING (safe auto-retry). If death cannot be confirmed -- the per-pass kill budget is exhausted or the host is unreachable -- the frame is failed (DEAD) so it needs a manual retry instead of risking a rebook. Fail closed: a stuck frame is acceptable, a silent double booking is not. - The whole pass is bounded by maintenance.orphaned_frame_kill_budget_ms (default 30s) so a batch of unreachable hosts cannot stall maintenance. - getOrphanedFrames() now returns List<FrameDetail> (mapped via the existing FRAME_DETAIL_MAPPER) so the reaper reads the last host from lastResource directly, dropping a second per-frame DB lookup. Also document why FrameCompleteHandler's orphan-finalize path intentionally skips the layer min-memory bump (it is proc-specific and there is no proc on that path). Tests: add MaintenanceManagerOrphanFrameTests (DB-free) covering the confirmed-dead -> WAITING, host-unreachable -> DEAD, budget-exhausted -> DEAD, and never-ran -> WAITING branches; extend FrameDaoTests.testgetOrphanedFrames to assert lastResource is populated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
abeec3f to
bd98c03
Compare
Signed-off-by: Diego Tavares <dtavares@imageworks.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
RQD silently drops FrameCompleteReports
In rust/crates/rqd/src/system/machine.rs:
(line 405-420): the retain() closure returns false for FrameState::Finished, removing it, then
calls handle_finished_frames.
Err, it just error!(...) and moves on. No re-queue, no retry on the next monitor cycle, no
persistence. The frame is already gone from the cache. The completion is lost forever.
status and retries only on transport errors or HTTP 500/502/503. But a gRPC application error —
which is exactly what Cuebot returns when handleFrameCompleteReport throws
RqdRetryReportException — arrives as HTTP 200 with a grpc-status trailer. The policy sees HTTP
200 → Outcome::Return → not retried; send_frame_complete_report then surfaces it as Err
(report_client.rs:226-231) → dropped.
With this a completed frame under a failing Rqd (caused by memory pressure or cpu eviction for
example) could lead to a completed frame not having its state synchronized to Cuebot.
An orphaned frame (RUNNING with no proc, because its proc was removed
while RQD was still rendering) was reset straight to WAITING by the
maintenance reaper. The dispatcher could then rebook it onto a second
host while the original RQD process was still alive -- a double booking.
Redesign the reaper to kill-then-confirm-dead before clearing:
Add RqdClient.isFrameRunning(host, frameId), backed by RQD's
GetRunningFrameStatus. A NOT_FOUND response means the render has been
reaped (confirmed dead); any other failure is raised as an
RqdClientException so callers can tell "confirmed gone" from
"could not reach".
clearOrphanedFrames() now kills each orphan on its last-known host and
polls until death is confirmed. Only then is the frame reset to
WAITING (safe auto-retry). If death cannot be confirmed -- the per-pass
kill budget is exhausted or the host is unreachable -- the frame is
failed (DEAD) so it needs a manual retry instead of risking a rebook.
Fail closed: a stuck frame is acceptable, a silent double booking is
not.
The whole pass is bounded by maintenance.orphaned_frame_kill_budget_ms
(default 30s) so a batch of unreachable hosts cannot stall maintenance.
getOrphanedFrames() now returns List (mapped via the
existing FRAME_DETAIL_MAPPER) so the reaper reads the last host from
lastResource directly, dropping a second per-frame DB lookup.
Also document why FrameCompleteHandler's orphan-finalize path
intentionally skips the layer min-memory bump (it is proc-specific and
there is no proc on that path).
Tests: add MaintenanceManagerOrphanFrameTests (DB-free) covering the
confirmed-dead -> WAITING, host-unreachable -> DEAD, budget-exhausted ->
DEAD, and never-ran -> WAITING branches; extend
FrameDaoTests.testgetOrphanedFrames to assert lastResource is populated.
LLM Usage disclosure
Claude Opus was used to suggest the fix for this issue