Skip to content

fix(reward): scoring pipeline dirty-scan and skip-path fixes#1847

Open
chiefmojo wants to merge 6 commits into
MemTensor:mainfrom
chiefmojo:pr/scoring-dirty-scan-skip-fixes
Open

fix(reward): scoring pipeline dirty-scan and skip-path fixes#1847
chiefmojo wants to merge 6 commits into
MemTensor:mainfrom
chiefmojo:pr/scoring-dirty-scan-skip-fixes

Conversation

@chiefmojo
Copy link
Copy Markdown

Summary

Six targeted fixes to the dirty-episode recovery and scoring skip path. These close infinite rescore loops that appeared in companion infrastructure after the v2.0.5 regression fixes (PR #1784) landed.

1. Paginate dirty-closed episode scan

collectDirtyClosedEpisodes() replaces the fixed list({ limit: 500 }) call. Walks all pages so installations with >500 closed dirty episodes are fully recovered on startup, not silently truncated.

2. Set r_task=0 on skipped episodes

When reward.ts skips an episode (too few exchanges, trivial content), it sets reward.skipped: true but returns without writing r_task. The episode stays in the dirty-scan queue and is re-queued on every startup pass. Fix: write r_task=0 on skip exit.

3. Clear rewardDirty flag on skip path

rewardDirty was only cleared on successful scoring. Skipped episodes kept the dirty flag and re-entered the scan on the next pass. Fix: clear rewardDirty on the skip path alongside setting r_task=0.

4. Drain reward after dirty-closed recovery in lightweight mode

recoverDirtyClosedEpisodes() emits episode.finalized and calls flush(). In lightweight mode flush() returns before draining the reward subscriber. Fix: explicitly call rewardRunner.run() for each recovered episode that remains dirty after flush.

5. Clear rewardDirty before recovery scoring

Recovery stamped closeReason and recoveryReason but did not clear rewardDirty. If the watchdog fires mid-scoring and leaves rTask null, the next startup re-picks the episode indefinitely. Fix: add rewardDirty: undefined to the updateMeta call before recovery scoring begins.

6. Prevent open-episode crash-respawn loop on watchdog interrupt

episodeRewardIsDirty() matched any episode with closeReason="finalized" and no rTask, including those already stamped with recoveryReason="dirty_reward_rescore" by recovery. Adds a guard: episodes already in a recovery path are excluded from the dirty scan.

Test plan

  • Fresh DB: startup recovery with 0 dirty episodes — no change in behavior
  • DB with >500 closed dirty episodes: all pages are walked, all are recovered
  • Episode with too few exchanges: scored once, r_task=0 written, not re-queued on next startup
  • Lightweight mode: recovery episodes are fully scored after flush()
  • Watchdog interrupt simulation: episode stamped with recoveryReason="dirty_reward_rescore" is not re-picked on next startup

🤖 Generated with Claude Code

chiefmojo and others added 6 commits May 31, 2026 17:52
The triviality-gate skip path wrote reward.skipped=true but never
cleared rewardDirty from meta_json. Since episodeRewardIsDirty()
checks the rewardDirty object flag before the skip gate, skipped
episodes with the flag set would re-enter the dirty scan on every
bridge restart, scoring and re-skipping indefinitely.

Normal scoring path already had rewardDirty: undefined — this
mirrors that pattern in the skip branch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The skip path wrote reward.skipped=true but never called setRTask(),
leaving r_task=NULL. For abandoned episodes episodeRewardIsDirty()
falls through to the r_task==null check and returns true, causing
those episodes to re-enter the dirty scan on every bridge start,
get re-skipped, and loop indefinitely.

Setting r_task=0 before updateMeta means the null-r_task branch in
episodeRewardIsDirty() no longer fires, permanently clearing the
episode from the dirty scan after its first skip.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
clampLimit() in _helpers.ts caps list() at 500 regardless of the
limit argument passed. The prior limit:1000 change (87165daf) was
a no-op — both values hit the same ceiling, leaving episodes beyond
rank 500 permanently invisible to the dirty scan.

Replace both scan sites (startup + periodic) with collectDirtyClosedEpisodes(),
which paginates in 500-row pages until exhausted. All closed episodes
are now covered regardless of total count.

This was also the root cause of the "dirty-17" mystery: those episodes
were at ranks 536-924, outside the 500-row window.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mode

recoverDirtyClosedEpisodes relied on flush() → reward.drain() to fire
R_human scoring after the capture pass. flush() returns early in
lightweight mode (the default), so the reward subscriber's 30 s timer
was cancelled by shutdown() before it fired — leaving traceCount
permanently mismatched and the episode dirty on every restart.

Fix: after flush() drains the capture pass, explicitly call
rewardRunner.run() for any episode that episodeRewardIsDirty() still
considers dirty — mirroring the pattern already used by
recoverOpenEpisodesAsSessionEnd. A second flush() then drains
downstream (L2 / L3 / skills).

Regression test: dirty-reward recovery does not insert orphan traces
— seeded episode with traceCount=1 and 2 trace IDs (one having a tool
call whose endedAt differs from the trace ts, which produces an orphan
step in runReflect). Verifies that:
  1. trace_ids_json stays at 2 after recovery (orphan insert guard).
  2. traceCount is updated to 2 after the first recovery pass.
  3. A second restart does not re-score the episode (loop stopped).

Also fixes the pre-existing test "rescoring closed episodes when traces
were appended after the last reward" which failed for the same reason.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rash-respawn loop

recoverDirtyClosedEpisodes() emits episode.finalized and awaits flush()
which runs per-step capture reflection (potentially hundreds of LLM calls).
If the daemon init watchdog fires (120 s) before flush() completes, the
rewardDirty flag is never cleared by reward.ts — so the episode appears
dirty on every subsequent startup and triggers the same scoring attempt,
creating an infinite crash-respawn loop that hammers the configured LLM
at ~5 500 calls/hour.

Fix: clear rewardDirty in updateMeta before starting recovery. reward.ts
already sets rewardDirty: undefined on successful scoring (idempotent);
if the watchdog fires mid-scoring the flag is already gone, so the next
startup finds the episode clean and init completes in milliseconds.

Root cause of the incident: PR MemTensor#8's 120 s init watchdog (correct) combined
with a large episode (254 traces, 238 per-step reflection calls, ~160 s)
that had rewardDirty set from a follow_up reopen. The episode was never
able to finish scoring within the watchdog window.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…terrupt

Both recoverOpenEpisodesAsSessionEnd and recoverDirtyClosedEpisodes stamp
recoveryReason=DIRTY_REWARD_RESCORE before emitting episode.finalized.
The condition-4 guard in episodeRewardIsDirty now excludes episodes with
this reason, so a watchdog-killed scoring run (rTask=null, closeReason=
finalized) no longer re-triggers rescoring on every subsequent startup.

Root cause: PR MemTensor#8's initWatchdog (120s default) interrupted scoring for
episodes with 80+ steps (~130s). The episode remained rTask=null with
closeReason=finalized — matching condition 4 exactly — and looped at
~30 restarts/hour consuming ~5,400 Qwen calls/hour.

Fixes MemTensor#11

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant