Skip to content

fix(consolidation): resolve orphaned-task and scheduling bugs#1767

Closed
savanne-kham wants to merge 2 commits into
vectorize-io:mainfrom
savanne-kham:fix/consolidation-scheduling
Closed

fix(consolidation): resolve orphaned-task and scheduling bugs#1767
savanne-kham wants to merge 2 commits into
vectorize-io:mainfrom
savanne-kham:fix/consolidation-scheduling

Conversation

@savanne-kham
Copy link
Copy Markdown

@savanne-kham savanne-kham commented May 26, 2026

Summary

Fix a cluster of consolidation scheduling, recovery, and observability gaps that made interrupted consolidation hard to recover from. The central issue is a claimed_at IS NULL recovery gap: old processing rows can have no claim timestamp, stale detection does not recognize them, and the bank serialization guard can keep treating the bank as busy indefinitely.

Root cause

Hindsight uses async_operations to coordinate background memory work such as retain, batch retain, and consolidation. The memories themselves live in documents / memory_units; this table matters here because a single stuck processing queue row can make the consolidation scheduler believe the whole bank is still busy.

-- async_operations, simplified to columns relevant to this bug
┌──────────┬────────────────┬────────────┬───────────┬────────────┬──────────────────────┐
│ bank_id  │ operation_type │ status     │ worker_id │ claimed_at │ effect               │
├──────────┼────────────────┼────────────┼───────────┼────────────┼──────────────────────┤
│ bank-xyz │ retain         │ completed  │ worker-1  │ 14:22      │ normal recent retain │
│ bank-xyz │ retain         │ completed  │ worker-1  │ 14:22      │ normal recent retain │
│ bank-xyz │ batch_retain   │ completed  │           │            │ parent aggregator    │
│ bank-xyz │ consolidation  │ completed  │ worker-1  │ 13:58      │ last successful run  │
│ bank-xyz │ consolidation  │ processing │ worker-1  │ NULL       │ orphaned blocker ✗   │
│ bank-xyz │ consolidation  │ pending    │           │            │ claimable, blocked ✗ │
│ bank-xyz │ consolidation  │ pending    │           │            │ claimable, blocked ✗ │
│ bank-xyz │ consolidation  │ pending    │           │            │ claimable, blocked ✗ │
└──────────┴────────────────┴────────────┴───────────┴────────────┴──────────────────────┘

The blocker is bank-level serialization: Hindsight only allows one active consolidation per bank. Before this fix, any processing consolidation row made the bank look busy, even if it was orphaned with claimed_at IS NULL. That put bank-xyz in busy_bank_ids, so the pending consolidation rows for the same bank were skipped on every poll cycle.

The flow is:

  Before fix                                      After fix
  ──────────                                      ─────────
  │                                               │
  │  async_operations has one stuck job row       │  Worker startup recovery
  │                                               │
  │    bank_id=bank-xyz                           │    Pass 1: reset this worker's
  │    operation_type=consolidation               │      own processing rows
  │    status=processing                          │
  │    claimed_at=NULL                            │    Pass 2: reset any processing
  │                                               │      row stuck > 2 hours,
  │  ─────────────────────────────                │      including claimed_at=NULL ✓
  │                                               │
  │  New consolidation jobs arrive                │  Bank serialization guard
  │                                               │
  │    bank_id=bank-xyz                           │    only treats recently claimed
  │    status=pending                             │    processing rows as active ✓
  │                                               │
  │  Bank serialization guard                     │
  │                                               │
  │    sees status=processing                     │
  │    assumes bank is busy                       │
  │    skips pending rows forever ✗               │
  │                                               │
  ▼                                               ▼
  claimable=3, assigned=0                         stuck job recovered or ignored;
                                                  pending rows can be claimed

Result: a single orphaned row in async_operations is enough to permanently lock a bank out of consolidation. A processing row with claimed_at IS NULL is not recognized as stale, so it can survive recovery and keep blocking the bank even when the worker is alive again. The bank serialization guard then treats that orphaned row as an active consolidation, leaving the bank frozen with no warning logged, for hours or indefinitely.

The fix adds two guards: Pass 2 resets orphaned processing tasks at startup, and the bank guard now ignores rows with claimed_at IS NULL or claimed_at older than 2 hours. This is safe for consolidation because async_operations is only the job queue. The durable source of truth is memory_units; when the operation is retried, the worker reselects still-unconsolidated units from the bank.

Real-world impact

These bugs were discovered on a live Hindsight instance with 20k+ memory units. In the incident, consolidation ran overnight until the LLM provider quota was rate-limited; after interruption/restart, the orphaned processing row kept the bank locked, so new consolidation requests deduplicated or stayed pending instead of recovering.

Evidence from the live incident Operational impact
22,103 memory units in the hermes bank; 21,249 were experience / world units Large consolidation backlog
1,458 consolidation HTTP 429 log lines; 9,945 GoUsageLimitError log lines Provider quota was exhausted during consolidation
claimable=1 ... assigned=0 repeated 2,051 times Work looked ready but stayed unassigned
Active consolidation reported as [STUCK?] 298 times Processing row needed recovery / cancellation
1,803 failed experience / world units in DB Pending counter was inflated by failed work

What changed

  • Recover orphaned processing tasks (worker/poller.py): startup now runs a second recovery pass for abandoned processing rows, including rows with claimed_at IS NULL; batch parent operations stay on their existing recovery path.
  • Fix bank serialization (db/ops_postgresql.py, db/ops_oracle.py): a bank is considered busy only when it has a recently claimed consolidation task; orphaned NULL or >2h-old claims no longer block pending work.
  • Make operation progress visible (memory_engine.py): completed consolidation jobs now write items_count and observation counters into result_metadata.
  • Fix pending consolidation counts (memory_engine.py): failed memory units are no longer counted as still pending consolidation.
  • Allow safe cancellation of stuck processing ops (memory_engine.py): stale processing operations can be cancelled through the API after the grace period instead of requiring manual SQL.
  • Log rejected LLM source IDs (consolidation/consolidator.py): create/update outputs whose source_fact_ids match no memory in the batch now emit a warning with batch context.

Tests (tests/test_operation_status.py)

_insert_operation() now accepts an optional claimed_at parameter, enabling tests to simulate operations at various claim ages.

cd hindsight-api-slim
uv sync --extra test --extra local-ml --extra embedded-db
uv run pytest tests/test_operation_status.py -q
# 11 passed
Unit test scenarios
Scenario Test
claimed_at > 10 min ago → cancel succeeds (200), status = cancelled test_cancel_allows_stale_processing_operations
claimed_at IS NULL → cancel succeeds (200), status = cancelled (never claimed, safe) test_cancel_allows_processing_with_null_claimed_at
claimed_at < 1 min ago → cancel rejected (409, within 5-min grace) test_cancel_rejects_terminal_and_recent_processing_operations
Terminal statuses (completed, failed) → cancel rejected (409) (same test as above, first loop)
Cancel pending operation → status becomes cancelled (row preserved) test_cancel_sets_cancelled_status
Retry cancelled operation → status becomes pending test_retry_cancelled_operation
Retry pending/processing/completed → rejected (409) test_retry_rejects_non_retriable_statuses
List endpoint preserves processing status (not collapsed to pending) test_list_operations_returns_processing_status
Filter ?status=processing returns only processing ops test_list_operations_filter_by_processing
Filter ?status=pending excludes processing ops test_list_operations_filter_by_pending_excludes_processing
Get endpoint returns processing status correctly test_get_operation_returns_processing_status
All five statuses round-trip through list and get test_all_statuses_returned_correctly

Test plan

  • Ruff + ruff format --check clean on all touched files
  • Type annotations consistent with existing codebase style
  • Unit tests:
    cd hindsight-api-slim
    HINDSIGHT_API_LLM_PROVIDER=none uv run pytest tests/test_operation_status.py -q
    # 11 passed
  • Orphan recovery: insert processing task with claimed_at=NULL and dead worker_id, restart Hindsight, verify second scan recovers it
  • Cancel stale processing operation: trigger consolidation, wait 6 min for grace period, cancel via API, verify status = cancelled
  • items_count in result_metadata: after consolidation completes, verify counters present and non-zero

Related

Checklist

  • 🐛 Bug fix
  • 🔒 No secrets disclosed
  • 📝 Conventional Commits
  • 🔄 Rebased on upstream/main
  • 🗄️ PostgreSQL + Oracle paths updated symmetrically
  • Ruff format + lint clean on touched files

@savanne-kham savanne-kham force-pushed the fix/consolidation-scheduling branch from b230d4a to e01139e Compare May 26, 2026 18:43
Fix a cluster of consolidation scheduling, recovery, and observability gaps
that made interrupted consolidation hard to recover from.

Root cause: old processing rows can have claimed_at=NULL. Stale detection
did not recognize them, and bank-level serialization treated any processing
consolidation row as active. That could leave a bank looking busy indefinitely
even after the worker was alive again, so pending consolidation jobs for the
same bank were skipped on every poll cycle.

  async_operations
  ─────────────────────────────────────────────────────────────
  consolidation processing claimed_at=NULL  → orphaned blocker
  consolidation pending                      → skipped forever
  consolidation pending                      → skipped forever

1. Pass 2 orphan recovery (poller.py): reset processing tasks from any
   worker that are stuck >2h, including rows with NULL claimed_at.

2. Bank serialization guard (ops_postgresql.py, ops_oracle.py): only
   exclude banks with recent, non-NULL claimed_at. Orphaned or stale
   processing rows no longer permanently block new consolidation claims.

3. Operation visibility (memory_engine.py): write items_count and
   observation counters into result_metadata on completion.

4. pending_consolidation accuracy (memory_engine.py): exclude memories
   with consolidation_failed_at from the pending count.

5. Cancel stale processing ops (memory_engine.py): allow cancellation of
   processing ops claimed >5 min ago. Previously only pending ops could be
   cancelled without manual SQL.

6. Consolidation diagnostics (consolidator.py): log a WARNING when
   source_fact_ids don't match any memory UUID in the current batch.

Real-world impact (live instance with 20k+ memory units): consolidation ran
overnight until the LLM provider quota was rate-limited, then the interrupted
processing row kept the bank locked instead of recovering cleanly.

  530 observations, frozen for hours → 814 obs (+284 in 20 min)
  11,900+ pending (inflated)      → accurate counter
  347 stale operations            → Pass 2 recovery on restart
  Cancel stuck ops: API 409       → DELETE after 5-min grace period
A processing op with NULL claimed_at was never claimed by a worker, so
cancelling it is safe (no worker to interrupt). Changed the test from
expecting 409 to expecting 200, with full cancelled-status verification.
@nicoloboschi
Copy link
Copy Markdown
Collaborator

this can't happen in newer versions and we already fixed with a db migration - please fill a reproducer or explain how it could happen, closing for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants