fix(consolidation): resolve orphaned-task and scheduling bugs by savanne-kham · Pull Request #1767 · vectorize-io/hindsight

savanne-kham · 2026-05-26T16:03:24Z

Summary

Fix a cluster of consolidation scheduling, recovery, and observability gaps that made interrupted consolidation hard to recover from. The central issue is a claimed_at IS NULL recovery gap: old processing rows can have no claim timestamp, stale detection does not recognize them, and the bank serialization guard can keep treating the bank as busy indefinitely.

Root cause

Hindsight uses async_operations to coordinate background memory work such as retain, batch retain, and consolidation. The memories themselves live in documents / memory_units; this table matters here because a single stuck processing queue row can make the consolidation scheduler believe the whole bank is still busy.

-- async_operations, simplified to columns relevant to this bug
┌──────────┬────────────────┬────────────┬───────────┬────────────┬──────────────────────┐
│ bank_id  │ operation_type │ status     │ worker_id │ claimed_at │ effect               │
├──────────┼────────────────┼────────────┼───────────┼────────────┼──────────────────────┤
│ bank-xyz │ retain         │ completed  │ worker-1  │ 14:22      │ normal recent retain │
│ bank-xyz │ retain         │ completed  │ worker-1  │ 14:22      │ normal recent retain │
│ bank-xyz │ batch_retain   │ completed  │           │            │ parent aggregator    │
│ bank-xyz │ consolidation  │ completed  │ worker-1  │ 13:58      │ last successful run  │
│ bank-xyz │ consolidation  │ processing │ worker-1  │ NULL       │ orphaned blocker ✗   │
│ bank-xyz │ consolidation  │ pending    │           │            │ claimable, blocked ✗ │
│ bank-xyz │ consolidation  │ pending    │           │            │ claimable, blocked ✗ │
│ bank-xyz │ consolidation  │ pending    │           │            │ claimable, blocked ✗ │
└──────────┴────────────────┴────────────┴───────────┴────────────┴──────────────────────┘

The blocker is bank-level serialization: Hindsight only allows one active consolidation per bank. Before this fix, any processing consolidation row made the bank look busy, even if it was orphaned with claimed_at IS NULL. That put bank-xyz in busy_bank_ids, so the pending consolidation rows for the same bank were skipped on every poll cycle.

The flow is:

  Before fix                                      After fix
  ──────────                                      ─────────
  │                                               │
  │  async_operations has one stuck job row       │  Worker startup recovery
  │                                               │
  │    bank_id=bank-xyz                           │    Pass 1: reset this worker's
  │    operation_type=consolidation               │      own processing rows
  │    status=processing                          │
  │    claimed_at=NULL                            │    Pass 2: reset any processing
  │                                               │      row stuck > 2 hours,
  │  ─────────────────────────────                │      including claimed_at=NULL ✓
  │                                               │
  │  New consolidation jobs arrive                │  Bank serialization guard
  │                                               │
  │    bank_id=bank-xyz                           │    only treats recently claimed
  │    status=pending                             │    processing rows as active ✓
  │                                               │
  │  Bank serialization guard                     │
  │                                               │
  │    sees status=processing                     │
  │    assumes bank is busy                       │
  │    skips pending rows forever ✗               │
  │                                               │
  ▼                                               ▼
  claimable=3, assigned=0                         stuck job recovered or ignored;
                                                  pending rows can be claimed

Result: a single orphaned row in async_operations is enough to permanently lock a bank out of consolidation. A processing row with claimed_at IS NULL is not recognized as stale, so it can survive recovery and keep blocking the bank even when the worker is alive again. The bank serialization guard then treats that orphaned row as an active consolidation, leaving the bank frozen with no warning logged, for hours or indefinitely.

The fix adds two guards: Pass 2 resets orphaned processing tasks at startup, and the bank guard now ignores rows with claimed_at IS NULL or claimed_at older than 2 hours. This is safe for consolidation because async_operations is only the job queue. The durable source of truth is memory_units; when the operation is retried, the worker reselects still-unconsolidated units from the bank.

Real-world impact

These bugs were discovered on a live Hindsight instance with 20k+ memory units. In the incident, consolidation ran overnight until the LLM provider quota was rate-limited; after interruption/restart, the orphaned processing row kept the bank locked, so new consolidation requests deduplicated or stayed pending instead of recovering.

Evidence from the live incident	Operational impact
22,103 memory units in the `hermes` bank; 21,249 were `experience` / `world` units	Large consolidation backlog
1,458 consolidation HTTP 429 log lines; 9,945 `GoUsageLimitError` log lines	Provider quota was exhausted during consolidation
`claimable=1 ... assigned=0` repeated 2,051 times	Work looked ready but stayed unassigned
Active consolidation reported as `[STUCK?]` 298 times	Processing row needed recovery / cancellation
1,803 failed `experience` / `world` units in DB	Pending counter was inflated by failed work

What changed

Recover orphaned processing tasks (worker/poller.py): startup now runs a second recovery pass for abandoned processing rows, including rows with claimed_at IS NULL; batch parent operations stay on their existing recovery path.
Fix bank serialization (db/ops_postgresql.py, db/ops_oracle.py): a bank is considered busy only when it has a recently claimed consolidation task; orphaned NULL or >2h-old claims no longer block pending work.
Make operation progress visible (memory_engine.py): completed consolidation jobs now write items_count and observation counters into result_metadata.
Fix pending consolidation counts (memory_engine.py): failed memory units are no longer counted as still pending consolidation.
Allow safe cancellation of stuck processing ops (memory_engine.py): stale processing operations can be cancelled through the API after the grace period instead of requiring manual SQL.
Log rejected LLM source IDs (consolidation/consolidator.py): create/update outputs whose source_fact_ids match no memory in the batch now emit a warning with batch context.

Tests (`tests/test_operation_status.py`)

_insert_operation() now accepts an optional claimed_at parameter, enabling tests to simulate operations at various claim ages.

cd hindsight-api-slim
uv sync --extra test --extra local-ml --extra embedded-db
uv run pytest tests/test_operation_status.py -q
# 11 passed

Unit test scenarios

Scenario	Test
`claimed_at` > 10 min ago → cancel succeeds (200), status = `cancelled`	`test_cancel_allows_stale_processing_operations`
`claimed_at IS NULL` → cancel succeeds (200), status = `cancelled` (never claimed, safe)	`test_cancel_allows_processing_with_null_claimed_at`
`claimed_at` < 1 min ago → cancel rejected (409, within 5-min grace)	`test_cancel_rejects_terminal_and_recent_processing_operations`
Terminal statuses (`completed`, `failed`) → cancel rejected (409)	(same test as above, first loop)
Cancel pending operation → status becomes `cancelled` (row preserved)	`test_cancel_sets_cancelled_status`
Retry cancelled operation → status becomes `pending`	`test_retry_cancelled_operation`
Retry pending/processing/completed → rejected (409)	`test_retry_rejects_non_retriable_statuses`
List endpoint preserves `processing` status (not collapsed to `pending`)	`test_list_operations_returns_processing_status`
Filter `?status=processing` returns only processing ops	`test_list_operations_filter_by_processing`
Filter `?status=pending` excludes processing ops	`test_list_operations_filter_by_pending_excludes_processing`
Get endpoint returns `processing` status correctly	`test_get_operation_returns_processing_status`
All five statuses round-trip through list and get	`test_all_statuses_returned_correctly`

Test plan

Ruff + ruff format --check clean on all touched files
Type annotations consistent with existing codebase style

Unit tests:

cd hindsight-api-slim
HINDSIGHT_API_LLM_PROVIDER=none uv run pytest tests/test_operation_status.py -q
# 11 passed

Orphan recovery: insert processing task with claimed_at=NULL and dead worker_id, restart Hindsight, verify second scan recovers it
Cancel stale processing operation: trigger consolidation, wait 6 min for grace period, cancel via API, verify status = cancelled
items_count in result_metadata: after consolidation completes, verify counters present and non-zero

Stale claimed_at not reclaimed after worker crash — operations stuck in 'processing' forever #991 Stale claimed_at not reclaimed after worker crash (closed — matched worker_id only)
Zombie worker tasks permanently block consolidation after container restart #1719 Zombie worker tasks permanently block consolidation after container restart (closed — matched worker_id only)
Async operations need poison quarantine / degraded queue health #1671 Async operations need poison quarantine / degraded queue health (open — partially addressed by better recovery, cancellation, and visibility)

Checklist

🐛 Bug fix
🔒 No secrets disclosed
📝 Conventional Commits
🔄 Rebased on upstream/main
🗄️ PostgreSQL + Oracle paths updated symmetrically
Ruff format + lint clean on touched files

Fix a cluster of consolidation scheduling, recovery, and observability gaps that made interrupted consolidation hard to recover from. Root cause: old processing rows can have claimed_at=NULL. Stale detection did not recognize them, and bank-level serialization treated any processing consolidation row as active. That could leave a bank looking busy indefinitely even after the worker was alive again, so pending consolidation jobs for the same bank were skipped on every poll cycle. async_operations ───────────────────────────────────────────────────────────── consolidation processing claimed_at=NULL → orphaned blocker consolidation pending → skipped forever consolidation pending → skipped forever 1. Pass 2 orphan recovery (poller.py): reset processing tasks from any worker that are stuck >2h, including rows with NULL claimed_at. 2. Bank serialization guard (ops_postgresql.py, ops_oracle.py): only exclude banks with recent, non-NULL claimed_at. Orphaned or stale processing rows no longer permanently block new consolidation claims. 3. Operation visibility (memory_engine.py): write items_count and observation counters into result_metadata on completion. 4. pending_consolidation accuracy (memory_engine.py): exclude memories with consolidation_failed_at from the pending count. 5. Cancel stale processing ops (memory_engine.py): allow cancellation of processing ops claimed >5 min ago. Previously only pending ops could be cancelled without manual SQL. 6. Consolidation diagnostics (consolidator.py): log a WARNING when source_fact_ids don't match any memory UUID in the current batch. Real-world impact (live instance with 20k+ memory units): consolidation ran overnight until the LLM provider quota was rate-limited, then the interrupted processing row kept the bank locked instead of recovering cleanly. 530 observations, frozen for hours → 814 obs (+284 in 20 min) 11,900+ pending (inflated) → accurate counter 347 stale operations → Pass 2 recovery on restart Cancel stuck ops: API 409 → DELETE after 5-min grace period

A processing op with NULL claimed_at was never claimed by a worker, so cancelling it is safe (no worker to interrupt). Changed the test from expecting 409 to expecting 200, with full cancelled-status verification.

nicoloboschi · 2026-05-27T14:34:12Z

this can't happen in newer versions and we already fixed with a db migration - please fill a reproducer or explain how it could happen, closing for now

savanne-kham force-pushed the fix/consolidation-scheduling branch from b230d4a to e01139e Compare May 26, 2026 18:43

savanne-kham added 2 commits May 26, 2026 20:44

nicoloboschi closed this May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(consolidation): resolve orphaned-task and scheduling bugs#1767

fix(consolidation): resolve orphaned-task and scheduling bugs#1767
savanne-kham wants to merge 2 commits into
vectorize-io:mainfrom
savanne-kham:fix/consolidation-scheduling

savanne-kham commented May 26, 2026 •

edited

Loading

Uh oh!

nicoloboschi commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

savanne-kham commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Real-world impact

What changed

Tests (tests/test_operation_status.py)

Test plan

Related

Checklist

Uh oh!

nicoloboschi commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

savanne-kham commented May 26, 2026 •

edited

Loading

Tests (`tests/test_operation_status.py`)