fix(consolidation): resolve orphaned-task and scheduling bugs#1767
Closed
savanne-kham wants to merge 2 commits into
Closed
fix(consolidation): resolve orphaned-task and scheduling bugs#1767savanne-kham wants to merge 2 commits into
savanne-kham wants to merge 2 commits into
Conversation
b230d4a to
e01139e
Compare
Fix a cluster of consolidation scheduling, recovery, and observability gaps that made interrupted consolidation hard to recover from. Root cause: old processing rows can have claimed_at=NULL. Stale detection did not recognize them, and bank-level serialization treated any processing consolidation row as active. That could leave a bank looking busy indefinitely even after the worker was alive again, so pending consolidation jobs for the same bank were skipped on every poll cycle. async_operations ───────────────────────────────────────────────────────────── consolidation processing claimed_at=NULL → orphaned blocker consolidation pending → skipped forever consolidation pending → skipped forever 1. Pass 2 orphan recovery (poller.py): reset processing tasks from any worker that are stuck >2h, including rows with NULL claimed_at. 2. Bank serialization guard (ops_postgresql.py, ops_oracle.py): only exclude banks with recent, non-NULL claimed_at. Orphaned or stale processing rows no longer permanently block new consolidation claims. 3. Operation visibility (memory_engine.py): write items_count and observation counters into result_metadata on completion. 4. pending_consolidation accuracy (memory_engine.py): exclude memories with consolidation_failed_at from the pending count. 5. Cancel stale processing ops (memory_engine.py): allow cancellation of processing ops claimed >5 min ago. Previously only pending ops could be cancelled without manual SQL. 6. Consolidation diagnostics (consolidator.py): log a WARNING when source_fact_ids don't match any memory UUID in the current batch. Real-world impact (live instance with 20k+ memory units): consolidation ran overnight until the LLM provider quota was rate-limited, then the interrupted processing row kept the bank locked instead of recovering cleanly. 530 observations, frozen for hours → 814 obs (+284 in 20 min) 11,900+ pending (inflated) → accurate counter 347 stale operations → Pass 2 recovery on restart Cancel stuck ops: API 409 → DELETE after 5-min grace period
A processing op with NULL claimed_at was never claimed by a worker, so cancelling it is safe (no worker to interrupt). Changed the test from expecting 409 to expecting 200, with full cancelled-status verification.
Collaborator
|
this can't happen in newer versions and we already fixed with a db migration - please fill a reproducer or explain how it could happen, closing for now |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix a cluster of consolidation scheduling, recovery, and observability gaps that made interrupted consolidation hard to recover from. The central issue is a
claimed_at IS NULLrecovery gap: oldprocessingrows can have no claim timestamp, stale detection does not recognize them, and the bank serialization guard can keep treating the bank as busy indefinitely.Root cause
Hindsight uses
async_operationsto coordinate background memory work such as retain, batch retain, and consolidation. The memories themselves live indocuments/memory_units; this table matters here because a single stuckprocessingqueue row can make the consolidation scheduler believe the whole bank is still busy.The blocker is bank-level serialization: Hindsight only allows one active consolidation per bank. Before this fix, any
processingconsolidation row made the bank look busy, even if it was orphaned withclaimed_at IS NULL. That putbank-xyzinbusy_bank_ids, so the pending consolidation rows for the same bank were skipped on every poll cycle.The flow is:
Result: a single orphaned row in
async_operationsis enough to permanently lock a bank out of consolidation. Aprocessingrow withclaimed_at IS NULLis not recognized as stale, so it can survive recovery and keep blocking the bank even when the worker is alive again. The bank serialization guard then treats that orphaned row as an active consolidation, leaving the bank frozen with no warning logged, for hours or indefinitely.The fix adds two guards: Pass 2 resets orphaned processing tasks at startup, and the bank guard now ignores rows with
claimed_at IS NULLorclaimed_atolder than 2 hours. This is safe for consolidation becauseasync_operationsis only the job queue. The durable source of truth ismemory_units; when the operation is retried, the worker reselects still-unconsolidated units from the bank.Real-world impact
These bugs were discovered on a live Hindsight instance with 20k+ memory units. In the incident, consolidation ran overnight until the LLM provider quota was rate-limited; after interruption/restart, the orphaned
processingrow kept the bank locked, so new consolidation requests deduplicated or stayed pending instead of recovering.hermesbank; 21,249 wereexperience/worldunitsGoUsageLimitErrorlog linesclaimable=1 ... assigned=0repeated 2,051 times[STUCK?]298 timesexperience/worldunits in DBWhat changed
worker/poller.py): startup now runs a second recovery pass for abandonedprocessingrows, including rows withclaimed_at IS NULL; batch parent operations stay on their existing recovery path.db/ops_postgresql.py,db/ops_oracle.py): a bank is considered busy only when it has a recently claimed consolidation task; orphanedNULLor >2h-old claims no longer block pending work.memory_engine.py): completed consolidation jobs now writeitems_countand observation counters intoresult_metadata.memory_engine.py): failed memory units are no longer counted as still pending consolidation.memory_engine.py): staleprocessingoperations can be cancelled through the API after the grace period instead of requiring manual SQL.consolidation/consolidator.py): create/update outputs whosesource_fact_idsmatch no memory in the batch now emit a warning with batch context.Tests (
tests/test_operation_status.py)_insert_operation()now accepts an optionalclaimed_atparameter, enabling tests to simulate operations at various claim ages.Unit test scenarios
claimed_at> 10 min ago → cancel succeeds (200), status =cancelledtest_cancel_allows_stale_processing_operationsclaimed_at IS NULL→ cancel succeeds (200), status =cancelled(never claimed, safe)test_cancel_allows_processing_with_null_claimed_atclaimed_at< 1 min ago → cancel rejected (409, within 5-min grace)test_cancel_rejects_terminal_and_recent_processing_operationscompleted,failed) → cancel rejected (409)cancelled(row preserved)test_cancel_sets_cancelled_statuspendingtest_retry_cancelled_operationtest_retry_rejects_non_retriable_statusesprocessingstatus (not collapsed topending)test_list_operations_returns_processing_status?status=processingreturns only processing opstest_list_operations_filter_by_processing?status=pendingexcludes processing opstest_list_operations_filter_by_pending_excludes_processingprocessingstatus correctlytest_get_operation_returns_processing_statustest_all_statuses_returned_correctlyTest plan
ruff format --checkclean on all touched filesclaimed_at=NULLand deadworker_id, restart Hindsight, verify second scan recovers itcancelledRelated
claimed_atnot reclaimed after worker crash (closed — matchedworker_idonly)worker_idonly)Checklist