Skip to content

improvement(scheduler): drain due schedules in chunks instead of a single capped claim#4578

Merged
TheodoreSpeaks merged 2 commits into
stagingfrom
improvement/scheduler-chunked-pagination
May 14, 2026
Merged

improvement(scheduler): drain due schedules in chunks instead of a single capped claim#4578
TheodoreSpeaks merged 2 commits into
stagingfrom
improvement/scheduler-chunked-pagination

Conversation

@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator

Summary

  • Replaces the fixed MAX_CRON_CLAIMS (200) with a chunked drain loop: claim WORKFLOW_CHUNK_SIZE + JOB_CHUNK_SIZE per iteration, process via Promise.allSettled, repeat until both claim queries return empty or MAX_TICK_DURATION_MS (3 min) elapses.
  • Per-tick throughput is no longer bounded by a static ceiling; it scales until the DB or trigger.dev is the limit. The per-iteration chunk size (100/100) still bounds the row-lock set and fan-out concurrency for each chunk.
  • Extracts processScheduleItem and processJobItem so the loop body stays readable.
  • Claim semantics (FOR UPDATE SKIP LOCKED, lastQueuedAt as the claim signal, staleness-window reclaim) are unchanged. Successive cron invocations may overlap during a real drain — already safe because the lock model prevents double-claim.

Follow-up to #4567 (the 20 → 200 bump): #4567 was a quick raise to the static cap; this PR removes the cap as a concept.

Type of Change

  • Improvement

Testing

  • bunx vitest run app/api/schedules/execute/route.test.ts — 6/6 pass
  • bun run lint — clean
  • bun run check:api-validation:strict — passes
  • Manual reasoning: tests use mockResolvedValueOnce for the first iteration's claim queries; second iteration sees default empty mocks and the loop breaks, matching the expected executedCount.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

Replaces the fixed MAX_CRON_CLAIMS (200) with a chunked drain loop:
claim WORKFLOW_CHUNK_SIZE + JOB_CHUNK_SIZE per iteration, process via
Promise.allSettled, repeat until both claim queries return empty or
MAX_TICK_DURATION_MS elapses. Throughput is no longer bounded by a
static per-tick ceiling; it scales until DB or trigger.dev is the
limit. Per-iteration chunk size still bounds row-lock set and fan-out
concurrency.

Extracts processScheduleItem and processJobItem so the loop body stays
readable. Existing claim semantics (FOR UPDATE SKIP LOCKED, lastQueuedAt
as the claim signal, staleness reclaim) are unchanged.
@vercel
Copy link
Copy Markdown

vercel Bot commented May 13, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped May 14, 2026 1:44am

Request Review

@cursor
Copy link
Copy Markdown

cursor Bot commented May 13, 2026

PR Summary

Medium Risk
Changes the cron execution handler from a single capped claim to a time-bounded draining loop, which can increase per-tick DB/queue load and concurrency. While claim semantics are unchanged, the new iterative behavior could surface performance or locking edge cases under heavy backlog.

Overview
Removes the single-run MAX_CRON_CLAIMS cap and replaces it with a chunked drain loop that repeatedly claims up to WORKFLOW_CHUNK_SIZE + JOB_CHUNK_SIZE items per iteration and processes each chunk via Promise.allSettled until no more work is found or MAX_TICK_DURATION_MS elapses.

Refactors per-item handling into processScheduleItem and processJobItem, keeping existing job de-dupe/stale-claim release behavior while improving logging/metrics to report per-iteration and total processed counts.

Reviewed by Cursor Bugbot for commit 1fb247b. Bugbot is set up for automated code reviews on this repo. Configure here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 13, 2026

Greptile Summary

This PR replaces the previous fixed MAX_CRON_CLAIMS ceiling (200 rows per invocation) with a chunked drain loop that repeatedly claims WORKFLOW_CHUNK_SIZE (100) workflow schedules and JOB_CHUNK_SIZE (100) job schedules per iteration, continuing until both queues are empty or MAX_TICK_DURATION_MS (3 minutes) elapses. The processing helpers processScheduleItem and processJobItem are extracted to keep the loop body readable, and all claim semantics (FOR UPDATE SKIP LOCKED, lastQueuedAt as the claim signal, staleness-window reclaim) are preserved.

  • Chunked drain loop: both claim queries now run concurrently via Promise.all each iteration, and results are processed with Promise.allSettled; successive cron invocations may overlap safely because SKIP LOCKED prevents double-claim.
  • Lazy workflowUtils import: the dynamic import is hoisted to a module-level let and loaded at most once per invocation, regardless of iteration count.
  • No cap on per-tick throughput: the loop drains until empty or the 3-minute wall-clock limit, so backlogs of any size are handled without touching the constants.

Confidence Score: 4/5

Safe to merge — the drain loop correctly replaces the static cap, claim semantics are unchanged, and concurrent invocations remain safe via SKIP LOCKED.

The refactoring is logically sound and all existing tests pass. The two flagged items are optimization opportunities rather than correctness problems: the loop issues unnecessary DB calls once one queue is exhausted, and the workflowUtils non-null assertion relies on an implicit invariant that holds today but is not enforced by the type system.

apps/sim/app/api/schedules/execute/route.ts — specifically the while-loop claim strategy and the workflowUtils! assertion.

Important Files Changed

Filename Overview
apps/sim/app/api/schedules/execute/route.ts Replaces the static MAX_CRON_CLAIMS cap with a chunked drain loop (WORKFLOW_CHUNK_SIZE=100, JOB_CHUNK_SIZE=100) bounded by MAX_TICK_DURATION_MS (3 min); extracts processScheduleItem and processJobItem helpers. Core claim semantics and error handling are preserved. Two style-level concerns: unnecessary DB round-trips when one queue drains before the other, and a non-null assertion on workflowUtils that relies on an implicit invariant TypeScript cannot verify.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[GET /api/schedules/execute] --> B[verifyCronAuth]
    B -->|authError| Z1[Return 401]
    B -->|ok| C[getJobQueue]
    C --> D[tickStart = Date.now]
    D --> E{Date.now - tickStart < MAX_TICK_DURATION_MS 3 min?}
    E -->|No / timeout| LOG[Log summary & return 200]
    E -->|Yes| F[Promise.all: claimWorkflowSchedules + claimJobSchedules]
    F --> G{both empty?}
    G -->|Yes| LOG
    G -->|No| H[Lazy-load workflowUtils if needed]
    H --> I[Promise.allSettled: processScheduleItem x N + processJobItem x M]
    I --> E

    subgraph processScheduleItem
        S1[buildScheduleExecutionJobId] --> S2{existingJob pending/processing?}
        S2 -->|Yes| S_skip[return early]
        S2 -->|stale finished| S_rel[releaseScheduleLock → return]
        S2 -->|null| S3[getWorkflowById → enqueue job]
        S3 --> S4{just-enqueued job already finished?}
        S4 -->|Yes| S_rel2[releaseScheduleLock → return]
        S4 -->|No| S5{shouldExecuteInline?}
        S5 -->|Yes| S6[startJob → executeScheduleJob → completeJob]
        S5 -->|No| S_done[return]
        S6 -->|error| S7[markJobFailed + releaseScheduleLock]
    end

    subgraph processJobItem
        J1[build payload] --> J2[executeJobInline]
        J2 -->|error| J3[releaseScheduleLock]
    end
Loading

Reviews (1): Last reviewed commit: "improvement(scheduler): drain in chunks ..." | Re-trigger Greptile

Comment thread apps/sim/app/api/schedules/execute/route.ts
Comment thread apps/sim/app/api/schedules/execute/route.ts Outdated
… workflowUtils non-null assertion

Addresses Greptile review on PR #4578:
- track per-queue exhaustion when a claim returns fewer than CHUNK_SIZE
  rows; subsequent iterations skip the claim query for that queue. Saves
  one DB round-trip per iteration once one queue drains while the other
  is still working.
- narrow workflowUtils to a local const inside the loop body so the
  schedule processing branch only runs when the import has completed.
  Removes the misleading non-null assertion.
@TheodoreSpeaks TheodoreSpeaks merged commit 642231f into staging May 14, 2026
14 checks passed
@TheodoreSpeaks TheodoreSpeaks deleted the improvement/scheduler-chunked-pagination branch May 14, 2026 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant