fix(reward): scoring pipeline dirty-scan and skip-path fixes#1847
Open
chiefmojo wants to merge 6 commits into
Open
fix(reward): scoring pipeline dirty-scan and skip-path fixes#1847chiefmojo wants to merge 6 commits into
chiefmojo wants to merge 6 commits into
Conversation
The triviality-gate skip path wrote reward.skipped=true but never cleared rewardDirty from meta_json. Since episodeRewardIsDirty() checks the rewardDirty object flag before the skip gate, skipped episodes with the flag set would re-enter the dirty scan on every bridge restart, scoring and re-skipping indefinitely. Normal scoring path already had rewardDirty: undefined — this mirrors that pattern in the skip branch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The skip path wrote reward.skipped=true but never called setRTask(), leaving r_task=NULL. For abandoned episodes episodeRewardIsDirty() falls through to the r_task==null check and returns true, causing those episodes to re-enter the dirty scan on every bridge start, get re-skipped, and loop indefinitely. Setting r_task=0 before updateMeta means the null-r_task branch in episodeRewardIsDirty() no longer fires, permanently clearing the episode from the dirty scan after its first skip. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
clampLimit() in _helpers.ts caps list() at 500 regardless of the limit argument passed. The prior limit:1000 change (87165daf) was a no-op — both values hit the same ceiling, leaving episodes beyond rank 500 permanently invisible to the dirty scan. Replace both scan sites (startup + periodic) with collectDirtyClosedEpisodes(), which paginates in 500-row pages until exhausted. All closed episodes are now covered regardless of total count. This was also the root cause of the "dirty-17" mystery: those episodes were at ranks 536-924, outside the 500-row window. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mode recoverDirtyClosedEpisodes relied on flush() → reward.drain() to fire R_human scoring after the capture pass. flush() returns early in lightweight mode (the default), so the reward subscriber's 30 s timer was cancelled by shutdown() before it fired — leaving traceCount permanently mismatched and the episode dirty on every restart. Fix: after flush() drains the capture pass, explicitly call rewardRunner.run() for any episode that episodeRewardIsDirty() still considers dirty — mirroring the pattern already used by recoverOpenEpisodesAsSessionEnd. A second flush() then drains downstream (L2 / L3 / skills). Regression test: dirty-reward recovery does not insert orphan traces — seeded episode with traceCount=1 and 2 trace IDs (one having a tool call whose endedAt differs from the trace ts, which produces an orphan step in runReflect). Verifies that: 1. trace_ids_json stays at 2 after recovery (orphan insert guard). 2. traceCount is updated to 2 after the first recovery pass. 3. A second restart does not re-score the episode (loop stopped). Also fixes the pre-existing test "rescoring closed episodes when traces were appended after the last reward" which failed for the same reason. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rash-respawn loop recoverDirtyClosedEpisodes() emits episode.finalized and awaits flush() which runs per-step capture reflection (potentially hundreds of LLM calls). If the daemon init watchdog fires (120 s) before flush() completes, the rewardDirty flag is never cleared by reward.ts — so the episode appears dirty on every subsequent startup and triggers the same scoring attempt, creating an infinite crash-respawn loop that hammers the configured LLM at ~5 500 calls/hour. Fix: clear rewardDirty in updateMeta before starting recovery. reward.ts already sets rewardDirty: undefined on successful scoring (idempotent); if the watchdog fires mid-scoring the flag is already gone, so the next startup finds the episode clean and init completes in milliseconds. Root cause of the incident: PR MemTensor#8's 120 s init watchdog (correct) combined with a large episode (254 traces, 238 per-step reflection calls, ~160 s) that had rewardDirty set from a follow_up reopen. The episode was never able to finish scoring within the watchdog window. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…terrupt Both recoverOpenEpisodesAsSessionEnd and recoverDirtyClosedEpisodes stamp recoveryReason=DIRTY_REWARD_RESCORE before emitting episode.finalized. The condition-4 guard in episodeRewardIsDirty now excludes episodes with this reason, so a watchdog-killed scoring run (rTask=null, closeReason= finalized) no longer re-triggers rescoring on every subsequent startup. Root cause: PR MemTensor#8's initWatchdog (120s default) interrupted scoring for episodes with 80+ steps (~130s). The episode remained rTask=null with closeReason=finalized — matching condition 4 exactly — and looped at ~30 restarts/hour consuming ~5,400 Qwen calls/hour. Fixes MemTensor#11 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Six targeted fixes to the dirty-episode recovery and scoring skip path. These close infinite rescore loops that appeared in companion infrastructure after the v2.0.5 regression fixes (PR #1784) landed.
1. Paginate dirty-closed episode scan
collectDirtyClosedEpisodes()replaces the fixedlist({ limit: 500 })call. Walks all pages so installations with >500 closed dirty episodes are fully recovered on startup, not silently truncated.2. Set
r_task=0on skipped episodesWhen
reward.tsskips an episode (too few exchanges, trivial content), it setsreward.skipped: truebut returns without writingr_task. The episode stays in the dirty-scan queue and is re-queued on every startup pass. Fix: writer_task=0on skip exit.3. Clear
rewardDirtyflag on skip pathrewardDirtywas only cleared on successful scoring. Skipped episodes kept the dirty flag and re-entered the scan on the next pass. Fix: clearrewardDirtyon the skip path alongside settingr_task=0.4. Drain reward after dirty-closed recovery in lightweight mode
recoverDirtyClosedEpisodes()emitsepisode.finalizedand callsflush(). In lightweight modeflush()returns before draining the reward subscriber. Fix: explicitly callrewardRunner.run()for each recovered episode that remains dirty after flush.5. Clear
rewardDirtybefore recovery scoringRecovery stamped
closeReasonandrecoveryReasonbut did not clearrewardDirty. If the watchdog fires mid-scoring and leavesrTasknull, the next startup re-picks the episode indefinitely. Fix: addrewardDirty: undefinedto theupdateMetacall before recovery scoring begins.6. Prevent open-episode crash-respawn loop on watchdog interrupt
episodeRewardIsDirty()matched any episode withcloseReason="finalized"and norTask, including those already stamped withrecoveryReason="dirty_reward_rescore"by recovery. Adds a guard: episodes already in a recovery path are excluded from the dirty scan.Test plan
r_task=0written, not re-queued on next startupflush()recoveryReason="dirty_reward_rescore"is not re-picked on next startup🤖 Generated with Claude Code