Skip to content

feat(webapp,run-engine): mollifier drainer replay + stale sweep + cancelled-run engine API#3754

Open
d-cs wants to merge 14 commits into
mollifier-phase-3-triggerfrom
mollifier-phase-3-replay
Open

feat(webapp,run-engine): mollifier drainer replay + stale sweep + cancelled-run engine API#3754
d-cs wants to merge 14 commits into
mollifier-phase-3-triggerfrom
mollifier-phase-3-replay

Conversation

@d-cs
Copy link
Copy Markdown
Collaborator

@d-cs d-cs commented May 26, 2026

Summary

The replay side of the mollifier:

  • DrainerHandler: reads buffered snapshots and replays them through engine.trigger to materialise PG rows.
  • RunEngine.createCancelledRun: new public method the handler uses to write CANCELED rows directly from snapshots (bypass queue + waitpoint, emit runCancelled). Tolerates the cjson empty-table tags edge case found during validation.
  • Drainer fairness: org → env rotation so a heavy env doesn't starve light ones in the same org.
  • Stale-entry sweep + telemetry + alertable gauge so a stuck/offline drainer surfaces in alerts.

Both the drainer and sweep default-off; nothing fires unless flagged on (TRIGGER_MOLLIFIER_DRAINER_ENABLED, TRIGGER_MOLLIFIER_STALE_SWEEP_ENABLED).

Stacked on the trigger-time decisions PR.

Test plan

  • `pnpm run typecheck --filter webapp` passes
  • `pnpm run test --filter webapp test/mollifierDrainerHandler.test.ts` passes
  • `pnpm run test --filter webapp test/mollifierStaleSweep.test.ts` passes
  • `pnpm run test --filter @internal/run-engine src/engine/tests/createCancelledRun.test.ts` passes
  • `pnpm run test --filter @trigger.dev/redis-worker packages/redis-worker/src/mollifier/drainer.test.ts` passes

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 26, 2026

⚠️ No Changeset found

Latest commit: d451918

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 26, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: cb1adb69-8b9b-4cf7-ba46-4998cdfa532c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch mollifier-phase-3-replay

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread apps/webapp/app/v3/mollifier/mollifierDrainerHandler.server.ts
Comment thread apps/webapp/app/v3/mollifier/mollifierTelemetry.server.ts
Comment thread apps/webapp/test/mollifierStaleSweep.test.ts Outdated
Comment thread internal-packages/run-engine/src/engine/index.ts
@d-cs d-cs self-assigned this May 26, 2026
@d-cs d-cs force-pushed the mollifier-phase-3-trigger branch from 626a8dc to af7368e Compare May 26, 2026 11:12
@d-cs d-cs force-pushed the mollifier-phase-3-replay branch from 31f4726 to b05929b Compare May 26, 2026 11:12
@d-cs d-cs force-pushed the mollifier-phase-3-trigger branch from 5a7bc19 to baa6f17 Compare May 26, 2026 13:24
@d-cs d-cs force-pushed the mollifier-phase-3-replay branch from b05929b to b89da52 Compare May 26, 2026 13:24
devin-ai-integration[bot]

This comment was marked as resolved.

@d-cs d-cs force-pushed the mollifier-phase-3-replay branch from 74fdf6d to c6fa61f Compare May 26, 2026 16:20
d-cs added a commit that referenced this pull request May 26, 2026
Two bugs flagged by Devin on PR #3754:

1. entry.server.tsx reverted to \`void sessionsReplicationInstance;\`,
   which esbuild tree-shakes under \`"sideEffects": false\`. Restored
   the globalThis assignment + warning comment from #3738 (incident
   TRI-9864). Without this the sessions→ClickHouse logical replication
   slot stops being consumed at boot.

2. createFailedTaskRun unconditionally emitted \`runFailed\`, which
   the \`completeFailedRunEvent\` listener uses to write a span
   completion into ClickHouse. But TriggerFailedTaskService.call()
   already wraps createFailedTaskRun inside
   \`repository.traceEvent({ incomplete: false, isError: true })\`
   which writes its own completion row for the same (traceId, spanId).
   Two completions racing on the same span row is a real
   observability bug.

   Added an \`emitRunFailedEvent: boolean = true\` opt-out. The
   TriggerFailedTaskService.call() path now passes \`false\` and
   enqueues \`PerformTaskRunAlertsService\` directly after the trace
   event closes so the alerts side of \`runFailed\` is preserved.
   \`callWithoutTraceEvents\` and the mollifier drainer's terminal-
   failure path keep the default emit (they have no outer trace
   event managing the span).

   Regression test pins the opt-out: \`emitRunFailedEvent: false\`
   writes the PG row but does NOT fire the bus event.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d-cs added a commit that referenced this pull request May 27, 2026
Two bugs flagged by Devin on PR #3754:

1. entry.server.tsx reverted to \`void sessionsReplicationInstance;\`,
   which esbuild tree-shakes under \`"sideEffects": false\`. Restored
   the globalThis assignment + warning comment from #3738 (incident
   TRI-9864). Without this the sessions→ClickHouse logical replication
   slot stops being consumed at boot.

2. createFailedTaskRun unconditionally emitted \`runFailed\`, which
   the \`completeFailedRunEvent\` listener uses to write a span
   completion into ClickHouse. But TriggerFailedTaskService.call()
   already wraps createFailedTaskRun inside
   \`repository.traceEvent({ incomplete: false, isError: true })\`
   which writes its own completion row for the same (traceId, spanId).
   Two completions racing on the same span row is a real
   observability bug.

   Added an \`emitRunFailedEvent: boolean = true\` opt-out. The
   TriggerFailedTaskService.call() path now passes \`false\` and
   enqueues \`PerformTaskRunAlertsService\` directly after the trace
   event closes so the alerts side of \`runFailed\` is preserved.
   \`callWithoutTraceEvents\` and the mollifier drainer's terminal-
   failure path keep the default emit (they have no outer trace
   event managing the span).

   Regression test pins the opt-out: \`emitRunFailedEvent: false\`
   writes the PG row but does NOT fire the bus event.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@d-cs d-cs force-pushed the mollifier-phase-3-trigger branch from 01f3958 to 449a0bc Compare May 27, 2026 12:04
@d-cs d-cs force-pushed the mollifier-phase-3-replay branch from 242ba73 to 6a8404d Compare May 27, 2026 12:04
@d-cs d-cs force-pushed the mollifier-phase-3-trigger branch from 449a0bc to ffe51b8 Compare May 27, 2026 12:15
d-cs added a commit that referenced this pull request May 27, 2026
Two bugs flagged by Devin on PR #3754:

1. entry.server.tsx reverted to \`void sessionsReplicationInstance;\`,
   which esbuild tree-shakes under \`"sideEffects": false\`. Restored
   the globalThis assignment + warning comment from #3738 (incident
   TRI-9864). Without this the sessions→ClickHouse logical replication
   slot stops being consumed at boot.

2. createFailedTaskRun unconditionally emitted \`runFailed\`, which
   the \`completeFailedRunEvent\` listener uses to write a span
   completion into ClickHouse. But TriggerFailedTaskService.call()
   already wraps createFailedTaskRun inside
   \`repository.traceEvent({ incomplete: false, isError: true })\`
   which writes its own completion row for the same (traceId, spanId).
   Two completions racing on the same span row is a real
   observability bug.

   Added an \`emitRunFailedEvent: boolean = true\` opt-out. The
   TriggerFailedTaskService.call() path now passes \`false\` and
   enqueues \`PerformTaskRunAlertsService\` directly after the trace
   event closes so the alerts side of \`runFailed\` is preserved.
   \`callWithoutTraceEvents\` and the mollifier drainer's terminal-
   failure path keep the default emit (they have no outer trace
   event managing the span).

   Regression test pins the opt-out: \`emitRunFailedEvent: false\`
   writes the PG row but does NOT fire the bus event.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@d-cs d-cs force-pushed the mollifier-phase-3-replay branch from 6a8404d to bc9f4e2 Compare May 27, 2026 12:15
@d-cs d-cs force-pushed the mollifier-phase-3-trigger branch from ffe51b8 to 7ddb17d Compare May 27, 2026 12:21
d-cs added a commit that referenced this pull request May 27, 2026
Two bugs flagged by Devin on PR #3754:

1. entry.server.tsx reverted to \`void sessionsReplicationInstance;\`,
   which esbuild tree-shakes under \`"sideEffects": false\`. Restored
   the globalThis assignment + warning comment from #3738 (incident
   TRI-9864). Without this the sessions→ClickHouse logical replication
   slot stops being consumed at boot.

2. createFailedTaskRun unconditionally emitted \`runFailed\`, which
   the \`completeFailedRunEvent\` listener uses to write a span
   completion into ClickHouse. But TriggerFailedTaskService.call()
   already wraps createFailedTaskRun inside
   \`repository.traceEvent({ incomplete: false, isError: true })\`
   which writes its own completion row for the same (traceId, spanId).
   Two completions racing on the same span row is a real
   observability bug.

   Added an \`emitRunFailedEvent: boolean = true\` opt-out. The
   TriggerFailedTaskService.call() path now passes \`false\` and
   enqueues \`PerformTaskRunAlertsService\` directly after the trace
   event closes so the alerts side of \`runFailed\` is preserved.
   \`callWithoutTraceEvents\` and the mollifier drainer's terminal-
   failure path keep the default emit (they have no outer trace
   event managing the span).

   Regression test pins the opt-out: \`emitRunFailedEvent: false\`
   writes the PG row but does NOT fire the bus event.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@d-cs d-cs force-pushed the mollifier-phase-3-replay branch 2 times, most recently from 637e8c0 to 65219db Compare May 27, 2026 12:58
@d-cs d-cs force-pushed the mollifier-phase-3-trigger branch 2 times, most recently from 4229f9a to 4f31074 Compare May 27, 2026 14:06
d-cs added a commit that referenced this pull request May 27, 2026
Two bugs flagged by Devin on PR #3754:

1. entry.server.tsx reverted to \`void sessionsReplicationInstance;\`,
   which esbuild tree-shakes under \`"sideEffects": false\`. Restored
   the globalThis assignment + warning comment from #3738 (incident
   TRI-9864). Without this the sessions→ClickHouse logical replication
   slot stops being consumed at boot.

2. createFailedTaskRun unconditionally emitted \`runFailed\`, which
   the \`completeFailedRunEvent\` listener uses to write a span
   completion into ClickHouse. But TriggerFailedTaskService.call()
   already wraps createFailedTaskRun inside
   \`repository.traceEvent({ incomplete: false, isError: true })\`
   which writes its own completion row for the same (traceId, spanId).
   Two completions racing on the same span row is a real
   observability bug.

   Added an \`emitRunFailedEvent: boolean = true\` opt-out. The
   TriggerFailedTaskService.call() path now passes \`false\` and
   enqueues \`PerformTaskRunAlertsService\` directly after the trace
   event closes so the alerts side of \`runFailed\` is preserved.
   \`callWithoutTraceEvents\` and the mollifier drainer's terminal-
   failure path keep the default emit (they have no outer trace
   event managing the span).

   Regression test pins the opt-out: \`emitRunFailedEvent: false\`
   writes the PG row but does NOT fire the bus event.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@d-cs d-cs force-pushed the mollifier-phase-3-replay branch from 65219db to ccdcd9c Compare May 27, 2026 14:06
@d-cs d-cs force-pushed the mollifier-phase-3-trigger branch from 4f31074 to e56b937 Compare May 27, 2026 15:07
@d-cs d-cs force-pushed the mollifier-phase-3-replay branch from ccdcd9c to 5f50940 Compare May 27, 2026 15:07
@d-cs d-cs force-pushed the mollifier-phase-3-trigger branch from cae33fa to 16bfff0 Compare May 27, 2026 16:50
@d-cs d-cs force-pushed the mollifier-phase-3-replay branch 2 times, most recently from 014313e to 5521698 Compare May 28, 2026 08:49
@d-cs d-cs force-pushed the mollifier-phase-3-trigger branch from f126737 to 36cc024 Compare May 28, 2026 09:41
d-cs added a commit that referenced this pull request May 28, 2026
Two bugs flagged by Devin on PR #3754:

1. entry.server.tsx reverted to \`void sessionsReplicationInstance;\`,
   which esbuild tree-shakes under \`"sideEffects": false\`. Restored
   the globalThis assignment + warning comment from #3738 (incident
   TRI-9864). Without this the sessions→ClickHouse logical replication
   slot stops being consumed at boot.

2. createFailedTaskRun unconditionally emitted \`runFailed\`, which
   the \`completeFailedRunEvent\` listener uses to write a span
   completion into ClickHouse. But TriggerFailedTaskService.call()
   already wraps createFailedTaskRun inside
   \`repository.traceEvent({ incomplete: false, isError: true })\`
   which writes its own completion row for the same (traceId, spanId).
   Two completions racing on the same span row is a real
   observability bug.

   Added an \`emitRunFailedEvent: boolean = true\` opt-out. The
   TriggerFailedTaskService.call() path now passes \`false\` and
   enqueues \`PerformTaskRunAlertsService\` directly after the trace
   event closes so the alerts side of \`runFailed\` is preserved.
   \`callWithoutTraceEvents\` and the mollifier drainer's terminal-
   failure path keep the default emit (they have no outer trace
   event managing the span).

   Regression test pins the opt-out: \`emitRunFailedEvent: false\`
   writes the PG row but does NOT fire the bus event.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@d-cs d-cs force-pushed the mollifier-phase-3-replay branch from 5521698 to 51c982f Compare May 28, 2026 09:41
@d-cs d-cs marked this pull request as ready for review May 28, 2026 09:56
devin-ai-integration[bot]

This comment was marked as resolved.

d-cs added a commit that referenced this pull request May 28, 2026
…on terminal failure

Two Devin review findings on PR #3754, both real and unresolved:

1. Sharded stale sweep's counts hash never cleared for fully-drained
   envs — gauge stayed permanently elevated, false-alerting the
   recommended `> 0 for 5m` rule.

   Root cause: when an env's last buffered entry is popped, the buffer's
   atomic Lua removes the env from `mollifier:org-envs:${orgId}` (and
   removes the org from `mollifier:orgs` if it has no other envs). The
   sweep's inner loop walks `buffer.listEnvsForOrg(orgId)`, so the env
   disappears from the iteration entirely — `setEnvStaleCount(envId, 0)`
   (which HDELs the field) is never called, and the counts hash retains
   the env's last-known stale count forever.

   Fix (Devin's Approach 2): cycle-bounded reconciliation. Add a Redis
   SET `mollifier:stale_sweep:visited` that the sweep SADDs into for
   every env it touches. When the cursor wraps (cycle complete),
   `reconcileVisited()` does `HKEYS counts → SMEMBERS visited → HDEL the
   difference → DEL visited`. Pipelined; orphans clear within at most
   one full cursor cycle of the env going quiet, which matches the
   sharding contract's existing one-cycle freshness window.

   Test: "evicts fully-drained envs from the counts hash at cycle wrap"
   — accepts an entry, sweep flags it stale, pops the entry (env
   vanishes from listEnvsForOrg), runs another sweep that triggers
   wrap, asserts the env is HDEL'd from both the snapshot and the
   underlying counts hash.

2. Drainer handler's terminal SYSTEM_FAILURE write dropped the
   snapshot's `batch` field. If the buffered run was part of a batch,
   the failure row wasn't associated with the batch and the batch
   parent's completion tracking could hang indefinitely waiting on a
   child that landed but isn't visible to the batch.

   Fix: extract `snapshot.batch` with structural type guards and pass
   it through to `createFailedTaskRun`. Same defensive pattern as the
   other snapshot fields in this code path (the snapshot is typed
   `Record<string, unknown>` because it came from cjson-decoded buffer
   payload).

   Test: "propagates the batch association into createFailedTaskRun" —
   asserts the call site receives `{ id, index }` from the snapshot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@d-cs d-cs force-pushed the mollifier-phase-3-replay branch from cc42721 to 70611c2 Compare May 28, 2026 13:04
devin-ai-integration[bot]

This comment was marked as resolved.

d-cs added a commit that referenced this pull request May 28, 2026
…Redis

Devin review on PR #3754: `stop()` previously called `deps.state.close()`
immediately after `clearInterval`, but `tick()` only checks `stopped` at
entry. A tick that was already past that guard would keep making
`state.*` calls against an ioredis client that `stop()` had already
`quit()`ed — those calls would throw, the tick's own try/catch would
swallow them as `mollifier.stale_sweep.failed` warnings, and every
graceful shutdown would emit spurious noise.

Track the current tick promise as `currentTick`. `stop()` awaits it (if
present) before invoking `state.close()`, so the tick's last state call
lands BEFORE the Redis client is quit. The tick's own try/catch handles
the (unexpected) case where it rejects; the await in `stop()` is solely
for ordering.

Also drop the `instanceof MollifierStaleSweepState` guard around
`state.close()` — `close()` is part of the `StaleSweepStateStore`
contract, so unconditional invocation is correct. Test fake states
implement `close()` as a no-op.

Test: `stop() waits for an in-flight tick to finish before closing the
state` — gates a fake state's `readCursor` on a Deferred, kicks off the
interval, waits for the tick to start, then races `stop()` against the
gate. Asserts the stop promise stays unresolved while the tick is
mid-flight and that the tick's final state operation lands before
`close()`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

d-cs and others added 13 commits May 28, 2026 14:43
…celled-run engine API

The replay side of the mollifier:
- DrainerHandler that reads buffered snapshots and replays them
  through engine.trigger to materialise PG rows.
- RunEngine.createCancelledRun: new public method the handler uses to
  write CANCELED rows directly from snapshots (bypass queue +
  waitpoint, emit runCancelled). Tolerates cjson empty-table tags.
- Drainer fairness: org → env rotation so a heavy env doesn't starve
  light ones in the same org.
- Stale-entry sweep + telemetry + alertable gauge for stuck drainers.

Both drainer and sweep default-off; nothing fires unless flagged on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- `isRetryablePgError`: also accept `errorCode === "P1001"` so
  `PrismaClientInitializationError` (which surfaces P1001 on a
  different field than `PrismaClientKnownRequestError`) retries.
- Drop `envId` from OTel metric labels on
  `mollifier.realtime_subscriptions.buffered`,
  `mollifier.stale_entries`, and the
  `mollifier.stale_entries.current` gauge. `envId` is a banned
  high-cardinality attribute; the structured warn log alongside each
  counter tick still carries envId for forensic drill-down.
- Stale-sweep test name + comments now match the assertion shape
  (all three entries stale, not "two stale + one fresh").
- `RunEngine.createCancelledRun` P2002 path now requires the existing
  row's status to be CANCELED; a non-canceled conflict throws rather
  than silently reporting success, so the caller can route to
  `engine.cancelRun()` or skip.
- Regression test pins the new conflict guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…leton

Importing the production drainer wiring transitively loads
\`~/v3/runEngine.server\`, whose top-level \`singleton(...)\` eagerly
constructs a RunEngine. The constructor spins up Prisma + Redis
workers that try to connect to localhost — in CI (no PG, no Redis)
that produces an unhandled \`PrismaClientInitializationError\` which
fails the run even though every assertion passes. Mock the runEngine
and prisma modules so the unit test exercises only the bootstrap's
error classification, not a live engine.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Container startup + the sweep loop can exceed Vitest's 5s default on
CI runners (passes in ~1.7-2s locally). Matches the explicit
\`{ timeout: 20_000 }\` other mollifier redisTests carry across the
project.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bugs flagged by Devin on PR #3754:

1. entry.server.tsx reverted to \`void sessionsReplicationInstance;\`,
   which esbuild tree-shakes under \`"sideEffects": false\`. Restored
   the globalThis assignment + warning comment from #3738 (incident
   TRI-9864). Without this the sessions→ClickHouse logical replication
   slot stops being consumed at boot.

2. createFailedTaskRun unconditionally emitted \`runFailed\`, which
   the \`completeFailedRunEvent\` listener uses to write a span
   completion into ClickHouse. But TriggerFailedTaskService.call()
   already wraps createFailedTaskRun inside
   \`repository.traceEvent({ incomplete: false, isError: true })\`
   which writes its own completion row for the same (traceId, spanId).
   Two completions racing on the same span row is a real
   observability bug.

   Added an \`emitRunFailedEvent: boolean = true\` opt-out. The
   TriggerFailedTaskService.call() path now passes \`false\` and
   enqueues \`PerformTaskRunAlertsService\` directly after the trace
   event closes so the alerts side of \`runFailed\` is preserved.
   \`callWithoutTraceEvents\` and the mollifier drainer's terminal-
   failure path keep the default emit (they have no outer trace
   event managing the span).

   Regression test pins the opt-out: \`emitRunFailedEvent: false\`
   writes the PG row but does NOT fire the bus event.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rename the catch-all mollifier.md and trim it to the drainer replay
handler, stale sweep, telemetry gauge, and run-engine cancelled/failed
APIs; later read/mutation/dashboard work is documented in its own PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tage in mollifier drainer

The mollifier drainer's cancel bifurcation called engine.createCancelledRun
without handling its documented conflict contract: when the normal trigger
replay path races ahead and materialises a live (non-CANCELED) row, the engine
throws a conflict so the caller can "decide between engine.cancelRun() and
skipping". The handler did neither — the conflict propagated, isRetryablePgError
returned false, and the drainer buffer.fail()'d the entry, silently losing the
cancellation while the run kept executing. Now route conflicts to
engine.cancelRun() so the cancel actually wins.

Separately, when engine.trigger fails non-retryably and the SYSTEM_FAILURE
fallback write then fails because PG is transiently unreachable, rethrowing the
original non-retryable error made the drainer buffer.fail() the entry — losing
the run with no PG row ever landing, and dropping the write error entirely.
Rethrow the retryable write error instead so the drainer requeues; the failure
row lands once PG recovers. Non-retryable write failures still rethrow the
original error as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove plan-tracking shorthand (Q# bifurcation, Phase C1/Q4) from replay-layer mollifier comments and test names; reword to plain English. Comment/test-name only; no behaviour change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…labels

The mollifier queue moved from ZSET to LIST in an earlier refactor, but two
comments still described it as a ZSET (`mollifierStaleSweep.server.ts:9` and
`env.server.ts:1104` — both narrating the periodic stale sweep). Update to
LIST.

Also clean four residual internal plan-doc labels left over from prior cleanup
passes:

- `createCancelledRun` docstring (`engine/index.ts`) referenced "Q4 mollifier-
  cancel design" and "F4 bypass" — both dead nomenclature now that the gate's
  C1/C3/F4 labels have been rewritten. Restate the waitpoint-skip rationale
  in plain English: the mollifier gate refuses to buffer triggerAndWait
  children, so a cancelled buffered run never has a waiting parent to unblock.
- `createCancelledRun.test.ts` empty-tags regression dropped "Found while
  running the Phase F challenge suite." — the comment describes the bug
  itself, which is self-contained.
- `mollifierStaleSweep.test.ts` "scans across multiple orgs" rephrased away
  from "Phase-3 design has org-level fairness"; the prose now states the
  invariant directly.

Comment/docstring-only; no behaviour change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous sweep was unbounded along two dimensions: every tick walked
every org and every env (via buffer.listOrgs + listEnvsForOrg). At the
sweep's default per-env entry cap of 1000, an incident-scale fan-out
gave O(orgs * envs * 1000) Redis round-trips per tick — running far
longer than the 5-minute interval and triggering the inFlight guard
to drop every subsequent tick until the slow pass finished.

Shard the work via a durable cursor:

- New file `mollifierStaleSweepState.server.ts` owns three Redis keys
  (`mollifier:stale_sweep:{cursor,org_list,counts}`), all under the
  mollifier namespace but separated from the buffer's own state. The
  state class has its own Redis client; the buffer's existing
  `MollifierBuffer` API surface is untouched.
- On `cursor === 0` the org list is rebuilt by snapshotting
  `buffer.listOrgs()` into the frozen LIST — the cycle's frozen view.
- Each tick consumes up to `maxOrgsPerPass` orgs (default 100),
  processes them, and advances the cursor. When the cursor reaches the
  end of the LIST it wraps to 0; the next tick rebuilds and starts the
  next cycle.
- The per-env counts HASH is the source of truth for the gauge
  snapshot. Visiting an env with zero stale entries HDEL's its hash
  field — gauge clears immediately on revisit. Envs not revisited this
  tick keep their last-known value (durability across ticks AND across
  webapp restarts), accepting a worst-case lag of one full cursor cycle
  before a no-longer-stale env clears.

Snapshot contract change: only envs with non-zero stale counts appear
in the reported `Map`. The telemetry layer (`mollifierTelemetry.server.ts`
`reportStaleEntrySnapshot`) sums values, so absence is equivalent to
zero for the gauge — the alert behaviour is unchanged.

Tests:
- New: "shards work across ticks: cursor advances by maxOrgsPerPass and
  wraps after a full cycle" — drives a 5-org fixture with cap=2,
  asserts the cursor's three-tick progression and wrap.
- New: "clears an env from the durable snapshot on revisit when it has
  entries but none currently stale" — same entry flips
  stale→not-stale by varying the sweep's `now`, asserts HDEL on
  revisit.
- Existing tests updated to inject `state`; one assertion shape
  rewritten ("snapshot omits envs that have entries but none stale")
  to match the new HDEL semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five gaps in the sharded sweep's test coverage, added 9 new tests
covering each:

- State durability across process restart: state1 populates the
  cursor + counts hash, closes its Redis client (simulated webapp
  restart), state2 constructs against the same Redis and picks up
  exactly where state1 left off. This is the headline benefit of
  storing sweep state in Redis instead of process memory; without
  this test it could silently regress.
- Cycle wrap rebuilds the org list: a third org joins between cycles
  and is visible only in the next cycle's snapshot. Pins the
  rebuildOrgList-on-cursor=0 contract.
- Empty buffer (no orgs) advances cleanly with zero work, empty
  snapshot, cursor stays at 0 instead of tripping the wrap math.
- Buffer-null branch's clearAll: previously asserted only "snapshot
  is empty"; now also asserts the durable state was actually wiped
  (cursor=0, counts hash empty) so a re-enable doesn't resume on a
  stale cursor.
- MollifierStaleSweepState direct unit tests (5 tests): readCursor
  default + corrupt-value tolerance, writeCursor round-trip,
  rebuildOrgList replace-not-append semantics, setEnvStaleCount HSET
  vs HDEL, clearAll DELs all three keys.

Suite total: 7 existing + 9 new = 16 tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on terminal failure

Two Devin review findings on PR #3754, both real and unresolved:

1. Sharded stale sweep's counts hash never cleared for fully-drained
   envs — gauge stayed permanently elevated, false-alerting the
   recommended `> 0 for 5m` rule.

   Root cause: when an env's last buffered entry is popped, the buffer's
   atomic Lua removes the env from `mollifier:org-envs:${orgId}` (and
   removes the org from `mollifier:orgs` if it has no other envs). The
   sweep's inner loop walks `buffer.listEnvsForOrg(orgId)`, so the env
   disappears from the iteration entirely — `setEnvStaleCount(envId, 0)`
   (which HDELs the field) is never called, and the counts hash retains
   the env's last-known stale count forever.

   Fix (Devin's Approach 2): cycle-bounded reconciliation. Add a Redis
   SET `mollifier:stale_sweep:visited` that the sweep SADDs into for
   every env it touches. When the cursor wraps (cycle complete),
   `reconcileVisited()` does `HKEYS counts → SMEMBERS visited → HDEL the
   difference → DEL visited`. Pipelined; orphans clear within at most
   one full cursor cycle of the env going quiet, which matches the
   sharding contract's existing one-cycle freshness window.

   Test: "evicts fully-drained envs from the counts hash at cycle wrap"
   — accepts an entry, sweep flags it stale, pops the entry (env
   vanishes from listEnvsForOrg), runs another sweep that triggers
   wrap, asserts the env is HDEL'd from both the snapshot and the
   underlying counts hash.

2. Drainer handler's terminal SYSTEM_FAILURE write dropped the
   snapshot's `batch` field. If the buffered run was part of a batch,
   the failure row wasn't associated with the batch and the batch
   parent's completion tracking could hang indefinitely waiting on a
   child that landed but isn't visible to the batch.

   Fix: extract `snapshot.batch` with structural type guards and pass
   it through to `createFailedTaskRun`. Same defensive pattern as the
   other snapshot fields in this code path (the snapshot is typed
   `Record<string, unknown>` because it came from cjson-decoded buffer
   payload).

   Test: "propagates the batch association into createFailedTaskRun" —
   asserts the call site receives `{ id, index }` from the snapshot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Redis

Devin review on PR #3754: `stop()` previously called `deps.state.close()`
immediately after `clearInterval`, but `tick()` only checks `stopped` at
entry. A tick that was already past that guard would keep making
`state.*` calls against an ioredis client that `stop()` had already
`quit()`ed — those calls would throw, the tick's own try/catch would
swallow them as `mollifier.stale_sweep.failed` warnings, and every
graceful shutdown would emit spurious noise.

Track the current tick promise as `currentTick`. `stop()` awaits it (if
present) before invoking `state.close()`, so the tick's last state call
lands BEFORE the Redis client is quit. The tick's own try/catch handles
the (unexpected) case where it rejects; the await in `stop()` is solely
for ordering.

Also drop the `instanceof MollifierStaleSweepState` guard around
`state.close()` — `close()` is part of the `StaleSweepStateStore`
contract, so unconditional invocation is correct. Test fake states
implement `close()` as a no-op.

Test: `stop() waits for an in-flight tick to finish before closing the
state` — gates a fake state's `readCursor` on a Deferred, kicks off the
interval, waits for the tick to start, then races `stop()` against the
gate. Asserts the stop promise stays unresolved while the tick is
mid-flight and that the tick's final state operation lands before
`close()`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@d-cs d-cs force-pushed the mollifier-phase-3-replay branch from bbf59b5 to 4a1bcfe Compare May 28, 2026 13:44
devin-ai-integration[bot]

This comment was marked as resolved.

Three unresolved Devin threads, all addressed in this commit. Committed
locally only — not pushed.

1. `callWithoutTraceEvents` was inheriting the new `emitRunFailedEvent`
   default of `true`, so its `createFailedTaskRun` call would fire the
   `runFailed` bus emit and the listener would write a ClickHouse
   completion event row with empty `traceId`/`spanId` — orphan row,
   directly contradicting the method's "without trace events" contract.
   Pass `emitRunFailedEvent: false` and enqueue
   `PerformTaskRunAlertsService` directly, mirroring the `call()`
   pattern so customers' ERROR channels still see the failure.

2. The cjson empty-tags defense lived only on `createCancelledRun`, not
   on `engine.trigger`. When the mollifier buffer's mutate-side Lua
   re-serialises a payload (e.g. `append_tags` on a buffered run that
   never had tags), an empty Lua table encodes as `{}` and decodes
   back to a JS object — and the previous `tags.length === 0` check
   passes that object straight to Prisma's `String[]` column.
   Mirror the same `Array.isArray && tags.length > 0 ? tags : undefined`
   guard `createCancelledRun` already uses. The defense is symmetric
   with the existing tested case for createCancelledRun, so the same
   contract holds for the trigger replay path.

3. `runCancelled` handler's `cancelRunEvent` lookup fails for
   buffered-only runs (no primary trace event exists, since the
   mollifier gate skipped `repository.traceEvent` for the
   not-yet-materialised run). The handler's `tryCatch` swallowed the
   error, but the systematic `[runCancelled] Failed to cancel run
   event` log fired on every cancelled buffered run.
   Add `emitRunCancelledEvent: boolean = true` to `createCancelledRun`
   (symmetric with the existing `emitRunFailedEvent` flag on
   `createFailedTaskRun`); drainer handler passes `false`. CANCELED PG
   row still writes; only the trace-event mirror is skipped.

Tests:
- `RunEngine.createCancelledRun > emitRunCancelledEvent: false
  suppresses the bus emit but still writes the CANCELED PG row` —
  pins the new flag's semantics.
- `createDrainerHandler > calls createCancelledRun with
  emitRunCancelledEvent: false (suppresses orphan trace-event log
  noise)` — pins the call site's contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 11 additional findings in Devin Review.

Open in Devin Review

Comment thread apps/webapp/app/v3/mollifier/mollifierDrainerHandler.server.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant