Skip to content

fix(api): vchord ANN — use cosine opclass and dispatch tuning GUCs per backend#1668

Open
isac322 wants to merge 3 commits into
vectorize-io:mainfrom
isac322:fix/vchord-cosine-opclass
Open

fix(api): vchord ANN — use cosine opclass and dispatch tuning GUCs per backend#1668
isac322 wants to merge 3 commits into
vectorize-io:mainfrom
isac322:fix/vchord-cosine-opclass

Conversation

@isac322
Copy link
Copy Markdown
Contributor

@isac322 isac322 commented May 20, 2026

Closes #1667.

This PR fixes two related vchord-only correctness bugs that together explain the high CPU on vchord deployments reported in #1667:

  1. Opclass mismatch (the hot one). _vector_index.py mapped vchord to vchordrq (embedding vector_l2_ops), but every ANN query in the engine uses <=> (cosine distance), which vector_l2_ops does not match. The planner ignored the index entirely and fell back to sequential scan + per-row cosine computation, so the production primary sat at ~870m CPU during retain.
  2. Tuning GUC mismatch. SET LOCAL hnsw.ef_search = 60 (and SET hnsw.ef_search = 200 at pool init) only exists in pgvector. On a vchord deployment both statements silently no-op'd, which meant the "recall-vs-latency" trade-off the code was designed around had never actually been applied to vchord backends.

What changes

Switch the vchord opclass to vector_cosine_ops:

  • _vector_index.py:_INDEX_USING_CLAUSES["vchord"] flipped to vector_cosine_ops.
  • The four historical migrations that create vchord indexes inline get the same flip so fresh installs end up on cosine ops from the first migration onward: 5a366d414dce, a4b5c6d7e8f9, d5e6f7a8b9c0, n9i0j1k2l3m4.
  • A new online migration b8c9d0e1f2a3_vchord_cosine_opclass rebuilds any existing vector_l2_ops vchordrq index with vector_cosine_ops via CREATE INDEX CONCURRENTLY + drop + rename. No-op when the configured extension is not vchord or when no L2-ops vchordrq indexes are found.

Dispatch ANN tuning GUCs by backend:

  • New helper _vector_index.ann_search_tuning_settings(ext, kind) returns (guc_name, value) pairs per backend. kind="low_latency" powers the retain-side ANN probe (was hnsw.ef_search = 60), kind="high_recall" powers the pool init (was hnsw.ef_search = 200).
  • Defaults: pgvector keeps hnsw.ef_search = 60 / 200; vchord uses vchordrq.probes = 10 / 30. pgvectorscale, pg_diskann, scann return no statements (no equivalent knob exposed today).
  • engine/retain/link_utils.py and engine/memory_engine.py switched to the dispatcher. Callers still own SET vs SET LOCAL since the two sites have different lifetime requirements.

Why

See #1667 for the full diagnosis. On the production cluster the same ANN query goes from 143 ms with <=> against vector_l2_ops (sequential scan, Buffers: shared hit=71531) to 25 ms with the same query body against a vector_cosine_ops index (Index Scan using idx_mu_emb_expr_<hash>, Buffers: shared hit=770 read=520) — ~5.6× per-query, and because the retain pipeline wraps this in CROSS JOIN LATERAL over _ann_seeds the penalty multiplies by seed count.

VectorChord's docs recommend vector_cosine_ops for cosine-similarity datasets; the engine's embeddings are L2-normalized in practice, so the existing <=> SQL is what we want. pgvector / pgvectorscale / pg_diskann were already on cosine ops, so this also restores cross-backend consistency.

Test plan

  • uv run ruff check on the changed files — pass
  • uv run ty check hindsight_api/ — pass
  • uv run pytest tests/test_vector_index.py tests/test_link_utils.py tests/test_migration_shape.py — 119 passed (new regression assertions: test_index_using_clause_vchord_uses_cosine_ops, test_ann_search_tuning_settings_{pgvector,vchord,…}, and the existing test_uses_set_local_for_ann_tuning parametrized over pgvector/vchord)
  • Manual on a vchord docker-compose stack (tensorchord/vchord-suite:pg18-latest): ran the migration cleanly, confirmed pg_indexes shows vector_cosine_ops after both the fresh-install path and a downgrade → upgrade roundtrip from a synthesized vector_l2_ops "broken" state, and confirmed EXPLAIN picks the vchordrq index for the <=> query. CPU return-to-baseline observation deferred to staging/prod rollout. See PR comment for full transcript.

Migration impact

For vchord deployments the new migration rebuilds each vchordrq index online with CREATE INDEX CONCURRENTLY. Disk usage doubles for each index during its rebuild; queries continue to run on the old index until the swap finishes. Downgrade rebuilds back to vector_l2_ops using the same online flow. After the migration completes, vchord pools start applying vchordrq.probes instead of the silently-ignored hnsw.ef_search.

@isac322
Copy link
Copy Markdown
Contributor Author

isac322 commented May 20, 2026

The verify-generated-files failure on this PR is pre-existing on main and unrelated to the changes here. The job reports drift on:

  • hindsight-api-slim/hindsight_api/engine/providers/openai_compatible_llm.py
  • skills/hindsight-docs/references/developer/models.md
  • skills/hindsight-docs/references/faq.md

…which are LLM-provider and docs-generation outputs. This PR only touches _vector_index.py, two engine files (engine/retain/link_utils.py, engine/memory_engine.py), five Alembic migrations under hindsight_api/alembic/versions/, and tests/test_vector_index.py + tests/test_link_utils.py — none of which are inputs to those generators.

Happy to rebase once main is regenerated, or to include the regenerated files in this PR if that's preferable. Let me know which direction you'd prefer.

Copy link
Copy Markdown

@elliotclee elliotclee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for noticing this issue and submitting the patch. Just one small change requested below...



def _configured_vector_extension() -> str:
return os.getenv("HINDSIGHT_API_VECTOR_EXTENSION", "pgvector").lower()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a function (such as this one) in the main hindsight codebase that all places call to find out the configured vector extension. I don't think an alembic migration file is the right place to host such a function, and there are a few places in this patchset that should be using a function like this but are doing the getenv() call directly instead.

This matters because the method of accessing the configuration setting could concievably change in the future, but more importantly because the default value needs to be configurable from one spot.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c4011bc. New configured_vector_extension() lives in hindsight_api/_vector_index.py and both runtime call sites (link_utils.py, memory_engine.py) and this migration now call it instead of doing os.getenv() directly.

One thing I wanted to flag before merging: hindsight has a deliberate test (test_alembic_vector_migrations_freeze_vector_sql_locally) that explicitly forbids the existing vchord-related migrations from importing hindsight_api._vector_index. The intent reads as "each historical migration's vector SQL must be frozen so it can't drift later if _vector_index.py is changed". The four legacy migrations (5a366d414dce, a4b5c6d7e8f9, d5e6f7a8b9c0, n9i0j1k2l3m4) each keep their own inline _vector_index_using_clause() helper for the same reason.

For this PR I narrowed that frozen-state list to those four legacy migrations and let the new b8c9d0e1f2a3 opt into the shared helper, since that matches what you asked for. But I can imagine three ways to draw the line:

  1. Current PR — only the new migration uses the shared helper; legacy 4 stay frozen. ("Don't change history, do use the shared helper going forward.")
  2. Strict frozen-state — revert the new migration to its own inline helper too, so all five vchord migrations look identical. The runtime call-site dedup (link_utils.py, memory_engine.py) still stands.
  3. Drop the frozen-state policy entirely — port the legacy four to the shared helper too and delete the freeze test.

If (2) is what you'd prefer (i.e. you'd like the new migration to stay frozen too), happy to swap. (3) is a bigger policy call that probably belongs in a separate PR if it's worth doing at all. Which direction would you like?

# value tuned for top-50 semantic link creation (lower recall but much
# lower latency than the recall-side default). SET LOCAL auto-reverts
# at commit, so we don't pollute the pool for subsequent queries.
ext = os.getenv("HINDSIGHT_API_VECTOR_EXTENSION", "pgvector").lower()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in c4011bc — see reply on the first thread for the trade-off discussion around the frozen-state test.

# returns the right one for the configured extension, tuned for
# the higher recall the per-fact_type semantic queries in
# retrieve_semantic_bm25_combined() need.
ext = os.getenv("HINDSIGHT_API_VECTOR_EXTENSION", "pgvector").lower()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in c4011bc — see reply on the first thread for the trade-off discussion around the frozen-state test.

isac322 added a commit to isac322/hindsight that referenced this pull request May 21, 2026
Per review on vectorize-io#1668: the env-var lookup that decides which vector backend
is configured was duplicated in three places (the new migration plus the
two runtime call sites in engine/retain/link_utils.py and
engine/memory_engine.py). Centralize the read + validation in
hindsight_api._vector_index.configured_vector_extension() so the default
value and the access mechanism live in one spot.

The new migration b8c9d0e1f2a3_vchord_cosine_opclass now imports the
shared helper instead of inlining its own. The four legacy vchord
migrations stay frozen (they keep their inline helpers); the frozen-state
test is narrowed to that legacy set so future vchord migrations can opt
into the shared helper on a per-migration basis.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@elliotclee elliotclee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Copy link
Copy Markdown
Collaborator

@nicoloboschi nicoloboschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice diagnosis and a well-scoped fix — the EXPLAIN evidence in the description (143ms seq scan → 25ms index scan) makes the root cause unambiguous, and the per-backend dispatcher is a cleaner abstraction than the previous hnsw.ef_search-everywhere assumption. Particularly liked:

  • Flipping the four historical migrations to vector_cosine_ops so fresh vchord installs don't bootstrap-then-rebuild.
  • configured_vector_extension() centralizing the env lookup instead of scattering os.getenv calls.
  • Parametrized test_uses_set_local_for_ann_tuning over pgvector/vchord, and the explicit empty-tuple assertions for backends without an equivalent knob.
  • Renaming changed_migrationsfrozen_migrations in test_alembic_vector_migrations_freeze_vector_sql_locally with a comment carve-out for the new migration — that's the right call.

Requesting changes on a few items before merge:

Must address

1. CREATE INDEX CONCURRENTLY failure-recovery in b8c9d0e1f2a3_vchord_cosine_opclass.py.
If _rebuild_vchordrq_indexes errors midway (disk pressure, lock conflict, etc.), Postgres leaves the partial index as INVALID. On re-run:

  • CREATE INDEX CONCURRENTLY IF NOT EXISTS skips the INVALID temp index (it exists).
  • DROP INDEX IF EXISTS drops the old original.
  • ALTER INDEX … RENAME promotes the INVALID temp index into the canonical name.

Result: a broken index masquerading as the original, and queries silently fall back to seq scan again — exactly the bug we're fixing. Suggest either (a) DROP INDEX IF EXISTS … __opclass_swap at the top of each loop iteration, or (b) querying pg_index.indisvalid after the CONCURRENTLY create and raising if false.

2. Schema-prefix inconsistency. The new migration uses context.config.get_main_option(\"target_schema\") or \"public\", while every other PG migration in the tree uses _pg_schema_prefix() which returns \"\" when unset. Align with the existing convention so behavior is the same when target_schema is not configured.

3. Manual vchord verification is unchecked. The test plan's last bullet (run on a real vchord cluster, verify pg_indexes shows vector_cosine_ops, confirm EXPLAIN picks the index, observe CPU return to baseline) is still [ ]. Given this is the whole point of the PR, please complete it before merge and post the result here — automated tests don't exercise the migration against a real vchord backend.

Nice to have

4. GUC values lack a citation. vchordrq.probes = 10 / 30 look heuristic compared to the pgvector 60 / 200 numbers that presumably have history. If the chosen values land on the wrong knee of the recall/latency curve, we'd be substituting one silent miscalibration for another. Either a quick benchmark note in the PR or a follow-up tuning task would be enough.

5. indexdef.replace(idx_name, temp_name, 1) relies on the index name being the first textual occurrence in pg_indexes.indexdef — true in practice for the canonical reconstructed form, but worth a one-line comment noting the assumption.

6. _init_connection exception swallowing. Now that the dispatcher returns an empty tuple for backends without a knob, the broad except Exception is doing less work than it used to. Tightening to asyncpg.exceptions.PostgresError would avoid masking real bugs.

Overall: solid fix, scoped tightly to vchord, with the rest of the backends provably unchanged. The CONCURRENTLY recovery gap is the only thing I'd block on.

isac322 and others added 2 commits May 26, 2026 09:29
…r backend

Closes vectorize-io#1667.

vchordrq operator classes are bound 1:1 to operators: vector_l2_ops only
matches `<->`, while every Hindsight ANN query uses `<=>` (cosine distance).
The previous vchord mapping used vector_l2_ops, so the planner ignored the
index entirely and fell back to a sequential scan + per-row cosine
computation. Separately, `SET LOCAL hnsw.ef_search = 60` (retain) and
`SET hnsw.ef_search = 200` (pool init) only exist in pgvector and silently
no-op'd under vchord, so the recall-vs-latency trade-off had never been
applied to vchord deployments at all.

This switches the vchord opclass to vector_cosine_ops (matching the
engine's `<=>` queries), updates the four historical migrations that
create vchord indexes inline so fresh installs land on cosine ops, and
adds an online migration that rebuilds any existing L2-ops vchordrq
indexes via CREATE INDEX CONCURRENTLY + drop + rename. Also introduces an
ann_search_tuning_settings dispatcher so link_utils and the pool init
pick the right GUC per backend (hnsw.ef_search for pgvector,
vchordrq.probes for vchord).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per review on vectorize-io#1668: the env-var lookup that decides which vector backend
is configured was duplicated in three places (the new migration plus the
two runtime call sites in engine/retain/link_utils.py and
engine/memory_engine.py). Centralize the read + validation in
hindsight_api._vector_index.configured_vector_extension() so the default
value and the access mechanism live in one spot.

The new migration b8c9d0e1f2a3_vchord_cosine_opclass now imports the
shared helper instead of inlining its own. The four legacy vchord
migrations stay frozen (they keep their inline helpers); the frozen-state
test is narrowed to that legacy set so future vchord migrations can opt
into the shared helper on a per-migration basis.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@isac322 isac322 force-pushed the fix/vchord-cosine-opclass branch from c4011bc to 79ee0a3 Compare May 26, 2026 00:29
- Wrap DROP canonical + RENAME temp in a server-side DO block so the swap
  is atomic; a crash between the two would otherwise leave the temp index
  as a valid orphan and the canonical name missing, with no recovery path
  on retry.
- Drop the temp index at the top of each rebuild loop and assert
  pg_index.indisvalid after CREATE INDEX CONCURRENTLY, so a leftover
  INVALID index from a prior failed run can't be promoted into the
  canonical name.
- Align the migration with the _pg_schema_prefix() convention used by
  other PG migrations, and normalize empty-string target_schema to NULL
  so COALESCE falls back to current_schema() instead of filtering on ''.
- Narrow _init_connection's except Exception to asyncpg.PostgresError so
  real pool/connection bugs surface instead of being silently logged.
- Document the vchordrq.probes 10/30 starting defaults and the
  indexdef.replace first-occurrence assumption.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@isac322
Copy link
Copy Markdown
Contributor Author

isac322 commented May 26, 2026

Thanks for the thorough review — addressed all of it in commit 73dfcf8. Summary:

Must address

1. CREATE INDEX CONCURRENTLY recovery (b8c9d0e1f2a3_vchord_cosine_opclass.py)

Two layers of defense against the INVALID-leftover scenario:

  • DROP INDEX IF EXISTS … __opclass_swap at the top of each loop iteration, so a leftover INVALID temp index from a prior failed run can't be skipped by CREATE … IF NOT EXISTS and then promoted into the canonical name.
  • After CREATE, query pg_index.indisvalid and raise RuntimeError if false — guards against the current run's CONCURRENTLY finishing INVALID (e.g. constraint violation during the second build scan).

While wiring this up I noticed a related gap: a crash between the canonical DROP and the RENAME would leave the temp as a valid orphan and the canonical name missing, with no recovery on retry (next run's WHERE indexdef ILIKE '%vector_l2_ops%' filter wouldn't match anything). Wrapped the DROP + RENAME in a server-side PL/pgSQL DO block so PG runs them in a single implicit transaction — both succeed or both roll back.

2. Schema-prefix convention

Switched to the _pg_schema_prefix() helper used by the other PG migrations. The pg_indexes filter now uses WHERE schemaname = COALESCE(:target_schema, current_schema()), and I normalize the Alembic option with or None so an explicit empty-string value also falls back to current_schema() rather than filtering on schemaname = ''.

3. Manual vchord verification

Done — see the follow-up comment with the docker-compose transcript. Both fresh-install and downgrade→upgrade-from-l2_ops paths confirmed, plus an EXPLAIN proving the index is now picked for <=>.

Nice to have

4. vchordrq.probes provenance

Expanded the comment on _ANN_TUNING_* in _vector_index.py to explain that the pgvector 60/200 pair predates the dispatcher (internal benchmarks tied to our embedding count + recall floor) and that vchord 10/30 are conservative starting defaults pending a workload-specific sweep — vchordrq's recall curve shape differs from HNSW's so the pgvector numbers don't translate directly.

5. indexdef.replace first-occurrence assumption

Added a one-line comment noting that pg_get_indexdef() emits the canonical form CREATE INDEX <name> ON …, so both substitutions rely on <name> being the first textual occurrence.

6. _init_connection exception

Narrowed to asyncpg.exceptions.PostgresError. Worth noting for context: in practice this defensive net is rarely tripped, because PG accepts qualifier.name = value SETs with arbitrary qualifiers — so SET vchordrq.probes = 10 doesn't error on a pgvector-only cluster, it just silently no-ops. The narrowed except is the right hygiene either way.

@isac322
Copy link
Copy Markdown
Contributor Author

isac322 commented May 26, 2026

Manual verification on a vchord docker-compose stack

Ran the migration locally against tensorchord/vchord-suite:pg18-latest using docker/docker-compose/vchord/docker-compose.yaml, applying migrations from our local API code (not the published image). Both the fresh-install path and the broken-state recovery path the migration is designed for behave as expected.

Fresh install (clean DB → upgrade heads)

pg_indexes after applying every migration through b8c9d0e1f2a3:

          indexname          |                                                   indexdef
-----------------------------+---------------------------------------------------------------------------------------------------------------
 idx_mental_models_embedding | CREATE INDEX idx_mental_models_embedding ON public.mental_models USING vchordrq (embedding vector_cosine_ops)
 idx_memory_units_embedding  | CREATE INDEX idx_memory_units_embedding ON public.memory_units USING vchordrq (embedding vector_cosine_ops)

Both vchordrq indexes land on vector_cosine_ops from the first migration onward, as intended by the historical-migration flips.

Broken-state recovery (the production scenario from #1667)

  1. alembic downgrade b8c9d0e1f2a3 → 86f7a033d372_pg_downgrade rebuilt both indexes as vector_l2_ops, reproducing the production "broken" state.
  2. alembic upgrade heads — the new migration rebuilt both indexes back as vector_cosine_ops, no manual intervention needed.
  3. pg_index.indisvalid = true for both indexes after the upgrade.
  4. No *__opclass_swap orphan indexes left behind — confirming the DO-block-wrapped DROP + RENAME is atomic across the swap.

EXPLAIN — operator class binding

After inserting 200 synthetic memory_units rows and ANALYZE-ing, the cosine ANN query picks the vchordrq index:

->  Index Scan using idx_memory_units_embedding on memory_units  (cost=0.20..214.70 rows=200 width=32)
      Order By: (embedding <=> (InitPlan 2).col1)
      Filter: ((embedding IS NOT NULL) AND (bank_id = 'test-bank'::text) AND (fact_type = 'world'::text))

Previously, with vector_l2_ops, the planner ignored the index and fell back to a sequential scan — which is exactly the original bug.

Caveat

CPU return-to-baseline can only be observed against the real production workload; that part is deferred to staging/prod rollout.

@isac322 isac322 requested a review from nicoloboschi May 26, 2026 05:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

vchord backend: ANN queries cannot use vchordrq indexes due to opclass/operator mismatch (vector_l2_ops vs <=>)

3 participants