improvement(scheduler): drain due schedules in chunks instead of a single capped claim#4578
Conversation
Replaces the fixed MAX_CRON_CLAIMS (200) with a chunked drain loop: claim WORKFLOW_CHUNK_SIZE + JOB_CHUNK_SIZE per iteration, process via Promise.allSettled, repeat until both claim queries return empty or MAX_TICK_DURATION_MS elapses. Throughput is no longer bounded by a static per-tick ceiling; it scales until DB or trigger.dev is the limit. Per-iteration chunk size still bounds row-lock set and fan-out concurrency. Extracts processScheduleItem and processJobItem so the loop body stays readable. Existing claim semantics (FOR UPDATE SKIP LOCKED, lastQueuedAt as the claim signal, staleness reclaim) are unchanged.
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
PR SummaryMedium Risk Overview Refactors per-item handling into Reviewed by Cursor Bugbot for commit 1fb247b. Bugbot is set up for automated code reviews on this repo. Configure here. |
Greptile SummaryThis PR replaces the previous fixed
Confidence Score: 4/5Safe to merge — the drain loop correctly replaces the static cap, claim semantics are unchanged, and concurrent invocations remain safe via SKIP LOCKED. The refactoring is logically sound and all existing tests pass. The two flagged items are optimization opportunities rather than correctness problems: the loop issues unnecessary DB calls once one queue is exhausted, and the workflowUtils non-null assertion relies on an implicit invariant that holds today but is not enforced by the type system. apps/sim/app/api/schedules/execute/route.ts — specifically the while-loop claim strategy and the workflowUtils! assertion. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[GET /api/schedules/execute] --> B[verifyCronAuth]
B -->|authError| Z1[Return 401]
B -->|ok| C[getJobQueue]
C --> D[tickStart = Date.now]
D --> E{Date.now - tickStart < MAX_TICK_DURATION_MS 3 min?}
E -->|No / timeout| LOG[Log summary & return 200]
E -->|Yes| F[Promise.all: claimWorkflowSchedules + claimJobSchedules]
F --> G{both empty?}
G -->|Yes| LOG
G -->|No| H[Lazy-load workflowUtils if needed]
H --> I[Promise.allSettled: processScheduleItem x N + processJobItem x M]
I --> E
subgraph processScheduleItem
S1[buildScheduleExecutionJobId] --> S2{existingJob pending/processing?}
S2 -->|Yes| S_skip[return early]
S2 -->|stale finished| S_rel[releaseScheduleLock → return]
S2 -->|null| S3[getWorkflowById → enqueue job]
S3 --> S4{just-enqueued job already finished?}
S4 -->|Yes| S_rel2[releaseScheduleLock → return]
S4 -->|No| S5{shouldExecuteInline?}
S5 -->|Yes| S6[startJob → executeScheduleJob → completeJob]
S5 -->|No| S_done[return]
S6 -->|error| S7[markJobFailed + releaseScheduleLock]
end
subgraph processJobItem
J1[build payload] --> J2[executeJobInline]
J2 -->|error| J3[releaseScheduleLock]
end
Reviews (1): Last reviewed commit: "improvement(scheduler): drain in chunks ..." | Re-trigger Greptile |
… workflowUtils non-null assertion Addresses Greptile review on PR #4578: - track per-queue exhaustion when a claim returns fewer than CHUNK_SIZE rows; subsequent iterations skip the claim query for that queue. Saves one DB round-trip per iteration once one queue drains while the other is still working. - narrow workflowUtils to a local const inside the loop body so the schedule processing branch only runs when the import has completed. Removes the misleading non-null assertion.
Summary
MAX_CRON_CLAIMS(200) with a chunked drain loop: claimWORKFLOW_CHUNK_SIZE + JOB_CHUNK_SIZEper iteration, process viaPromise.allSettled, repeat until both claim queries return empty orMAX_TICK_DURATION_MS(3 min) elapses.processScheduleItemandprocessJobItemso the loop body stays readable.FOR UPDATE SKIP LOCKED,lastQueuedAtas the claim signal, staleness-window reclaim) are unchanged. Successive cron invocations may overlap during a real drain — already safe because the lock model prevents double-claim.Follow-up to #4567 (the 20 → 200 bump): #4567 was a quick raise to the static cap; this PR removes the cap as a concept.
Type of Change
Testing
bunx vitest run app/api/schedules/execute/route.test.ts— 6/6 passbun run lint— cleanbun run check:api-validation:strict— passesmockResolvedValueOncefor the first iteration's claim queries; second iteration sees default empty mocks and the loop breaks, matching the expectedexecutedCount.Checklist