fix(api): vchord ANN — use cosine opclass and dispatch tuning GUCs per backend by isac322 · Pull Request #1668 · vectorize-io/hindsight

isac322 · 2026-05-20T08:23:57Z

Closes #1667.

This PR fixes two related vchord-only correctness bugs that together explain the high CPU on vchord deployments reported in #1667:

Opclass mismatch (the hot one). _vector_index.py mapped vchord to vchordrq (embedding vector_l2_ops), but every ANN query in the engine uses <=> (cosine distance), which vector_l2_ops does not match. The planner ignored the index entirely and fell back to sequential scan + per-row cosine computation, so the production primary sat at ~870m CPU during retain.
Tuning GUC mismatch. SET LOCAL hnsw.ef_search = 60 (and SET hnsw.ef_search = 200 at pool init) only exists in pgvector. On a vchord deployment both statements silently no-op'd, which meant the "recall-vs-latency" trade-off the code was designed around had never actually been applied to vchord backends.

What changes

Switch the vchord opclass to vector_cosine_ops:

_vector_index.py:_INDEX_USING_CLAUSES["vchord"] flipped to vector_cosine_ops.
The four historical migrations that create vchord indexes inline get the same flip so fresh installs end up on cosine ops from the first migration onward: 5a366d414dce, a4b5c6d7e8f9, d5e6f7a8b9c0, n9i0j1k2l3m4.
A new online migration b8c9d0e1f2a3_vchord_cosine_opclass rebuilds any existing vector_l2_ops vchordrq index with vector_cosine_ops via CREATE INDEX CONCURRENTLY + drop + rename. No-op when the configured extension is not vchord or when no L2-ops vchordrq indexes are found.

Dispatch ANN tuning GUCs by backend:

New helper _vector_index.ann_search_tuning_settings(ext, kind) returns (guc_name, value) pairs per backend. kind="low_latency" powers the retain-side ANN probe (was hnsw.ef_search = 60), kind="high_recall" powers the pool init (was hnsw.ef_search = 200).
Defaults: pgvector keeps hnsw.ef_search = 60 / 200; vchord uses vchordrq.probes = 10 / 30. pgvectorscale, pg_diskann, scann return no statements (no equivalent knob exposed today).
engine/retain/link_utils.py and engine/memory_engine.py switched to the dispatcher. Callers still own SET vs SET LOCAL since the two sites have different lifetime requirements.

Why

See #1667 for the full diagnosis. On the production cluster the same ANN query goes from 143 ms with <=> against vector_l2_ops (sequential scan, Buffers: shared hit=71531) to 25 ms with the same query body against a vector_cosine_ops index (Index Scan using idx_mu_emb_expr_<hash>, Buffers: shared hit=770 read=520) — ~5.6× per-query, and because the retain pipeline wraps this in CROSS JOIN LATERAL over _ann_seeds the penalty multiplies by seed count.

VectorChord's docs recommend vector_cosine_ops for cosine-similarity datasets; the engine's embeddings are L2-normalized in practice, so the existing <=> SQL is what we want. pgvector / pgvectorscale / pg_diskann were already on cosine ops, so this also restores cross-backend consistency.

Test plan

uv run ruff check on the changed files — pass
uv run ty check hindsight_api/ — pass
uv run pytest tests/test_vector_index.py tests/test_link_utils.py tests/test_migration_shape.py — 119 passed (new regression assertions: test_index_using_clause_vchord_uses_cosine_ops, test_ann_search_tuning_settings_{pgvector,vchord,…}, and the existing test_uses_set_local_for_ann_tuning parametrized over pgvector/vchord)
Manual on a vchord docker-compose stack (tensorchord/vchord-suite:pg18-latest): ran the migration cleanly, confirmed pg_indexes shows vector_cosine_ops after both the fresh-install path and a downgrade → upgrade roundtrip from a synthesized vector_l2_ops "broken" state, and confirmed EXPLAIN picks the vchordrq index for the <=> query. CPU return-to-baseline observation deferred to staging/prod rollout. See PR comment for full transcript.

Migration impact

For vchord deployments the new migration rebuilds each vchordrq index online with CREATE INDEX CONCURRENTLY. Disk usage doubles for each index during its rebuild; queries continue to run on the old index until the swap finishes. Downgrade rebuilds back to vector_l2_ops using the same online flow. After the migration completes, vchord pools start applying vchordrq.probes instead of the silently-ignored hnsw.ef_search.

isac322 · 2026-05-20T08:38:37Z

The verify-generated-files failure on this PR is pre-existing on main and unrelated to the changes here. The job reports drift on:

hindsight-api-slim/hindsight_api/engine/providers/openai_compatible_llm.py
skills/hindsight-docs/references/developer/models.md
skills/hindsight-docs/references/faq.md

…which are LLM-provider and docs-generation outputs. This PR only touches _vector_index.py, two engine files (engine/retain/link_utils.py, engine/memory_engine.py), five Alembic migrations under hindsight_api/alembic/versions/, and tests/test_vector_index.py + tests/test_link_utils.py — none of which are inputs to those generators.

Happy to rebase once main is regenerated, or to include the regenerated files in this PR if that's preferable. Let me know which direction you'd prefer.

elliotclee

Thank you for noticing this issue and submitting the patch. Just one small change requested below...

elliotclee · 2026-05-20T12:23:56Z

+
+
+def _configured_vector_extension() -> str:
+    return os.getenv("HINDSIGHT_API_VECTOR_EXTENSION", "pgvector").lower()


There should be a function (such as this one) in the main hindsight codebase that all places call to find out the configured vector extension. I don't think an alembic migration file is the right place to host such a function, and there are a few places in this patchset that should be using a function like this but are doing the getenv() call directly instead.

This matters because the method of accessing the configuration setting could concievably change in the future, but more importantly because the default value needs to be configurable from one spot.

Done in c4011bc. New configured_vector_extension() lives in hindsight_api/_vector_index.py and both runtime call sites (link_utils.py, memory_engine.py) and this migration now call it instead of doing os.getenv() directly.

One thing I wanted to flag before merging: hindsight has a deliberate test (test_alembic_vector_migrations_freeze_vector_sql_locally) that explicitly forbids the existing vchord-related migrations from importing hindsight_api._vector_index. The intent reads as "each historical migration's vector SQL must be frozen so it can't drift later if _vector_index.py is changed". The four legacy migrations (5a366d414dce, a4b5c6d7e8f9, d5e6f7a8b9c0, n9i0j1k2l3m4) each keep their own inline _vector_index_using_clause() helper for the same reason.

For this PR I narrowed that frozen-state list to those four legacy migrations and let the new b8c9d0e1f2a3 opt into the shared helper, since that matches what you asked for. But I can imagine three ways to draw the line:

Current PR — only the new migration uses the shared helper; legacy 4 stay frozen. ("Don't change history, do use the shared helper going forward.")

Strict frozen-state — revert the new migration to its own inline helper too, so all five vchord migrations look identical. The runtime call-site dedup (link_utils.py, memory_engine.py) still stands.

Drop the frozen-state policy entirely — port the legacy four to the shared helper too and delete the freeze test.

If (2) is what you'd prefer (i.e. you'd like the new migration to stay frozen too), happy to swap. (3) is a bigger policy call that probably belongs in a separate PR if it's worth doing at all. Which direction would you like?

elliotclee · 2026-05-20T12:24:54Z

+        # value tuned for top-50 semantic link creation (lower recall but much
+        # lower latency than the recall-side default). SET LOCAL auto-reverts
+        # at commit, so we don't pollute the pool for subsequent queries.
+        ext = os.getenv("HINDSIGHT_API_VECTOR_EXTENSION", "pgvector").lower()


See previous comment.

Addressed in c4011bc — see reply on the first thread for the trade-off discussion around the frozen-state test.

elliotclee · 2026-05-20T12:25:07Z

+            # returns the right one for the configured extension, tuned for
+            # the higher recall the per-fact_type semantic queries in
+            # retrieve_semantic_bm25_combined() need.
+            ext = os.getenv("HINDSIGHT_API_VECTOR_EXTENSION", "pgvector").lower()


See previous comment.

Addressed in c4011bc — see reply on the first thread for the trade-off discussion around the frozen-state test.

Per review on vectorize-io#1668: the env-var lookup that decides which vector backend is configured was duplicated in three places (the new migration plus the two runtime call sites in engine/retain/link_utils.py and engine/memory_engine.py). Centralize the read + validation in hindsight_api._vector_index.configured_vector_extension() so the default value and the access mechanism live in one spot. The new migration b8c9d0e1f2a3_vchord_cosine_opclass now imports the shared helper instead of inlining its own. The four legacy vchord migrations stay frozen (they keep their inline helpers); the frozen-state test is narrowed to that legacy set so future vchord migrations can opt into the shared helper on a per-migration basis. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

elliotclee

Looks good to me!

nicoloboschi

Nice diagnosis and a well-scoped fix — the EXPLAIN evidence in the description (143ms seq scan → 25ms index scan) makes the root cause unambiguous, and the per-backend dispatcher is a cleaner abstraction than the previous hnsw.ef_search-everywhere assumption. Particularly liked:

Flipping the four historical migrations to vector_cosine_ops so fresh vchord installs don't bootstrap-then-rebuild.
configured_vector_extension() centralizing the env lookup instead of scattering os.getenv calls.
Parametrized test_uses_set_local_for_ann_tuning over pgvector/vchord, and the explicit empty-tuple assertions for backends without an equivalent knob.
Renaming changed_migrations → frozen_migrations in test_alembic_vector_migrations_freeze_vector_sql_locally with a comment carve-out for the new migration — that's the right call.

Requesting changes on a few items before merge:

Must address

1. CREATE INDEX CONCURRENTLY failure-recovery in b8c9d0e1f2a3_vchord_cosine_opclass.py.
If _rebuild_vchordrq_indexes errors midway (disk pressure, lock conflict, etc.), Postgres leaves the partial index as INVALID. On re-run:

CREATE INDEX CONCURRENTLY IF NOT EXISTS skips the INVALID temp index (it exists).
DROP INDEX IF EXISTS drops the old original.
ALTER INDEX … RENAME promotes the INVALID temp index into the canonical name.

Result: a broken index masquerading as the original, and queries silently fall back to seq scan again — exactly the bug we're fixing. Suggest either (a) DROP INDEX IF EXISTS … __opclass_swap at the top of each loop iteration, or (b) querying pg_index.indisvalid after the CONCURRENTLY create and raising if false.

2. Schema-prefix inconsistency. The new migration uses context.config.get_main_option(\"target_schema\") or \"public\", while every other PG migration in the tree uses _pg_schema_prefix() which returns \"\" when unset. Align with the existing convention so behavior is the same when target_schema is not configured.

3. Manual vchord verification is unchecked. The test plan's last bullet (run on a real vchord cluster, verify pg_indexes shows vector_cosine_ops, confirm EXPLAIN picks the index, observe CPU return to baseline) is still [ ]. Given this is the whole point of the PR, please complete it before merge and post the result here — automated tests don't exercise the migration against a real vchord backend.

Nice to have

4. GUC values lack a citation. vchordrq.probes = 10 / 30 look heuristic compared to the pgvector 60 / 200 numbers that presumably have history. If the chosen values land on the wrong knee of the recall/latency curve, we'd be substituting one silent miscalibration for another. Either a quick benchmark note in the PR or a follow-up tuning task would be enough.

5. indexdef.replace(idx_name, temp_name, 1) relies on the index name being the first textual occurrence in pg_indexes.indexdef — true in practice for the canonical reconstructed form, but worth a one-line comment noting the assumption.

6. _init_connection exception swallowing. Now that the dispatcher returns an empty tuple for backends without a knob, the broad except Exception is doing less work than it used to. Tightening to asyncpg.exceptions.PostgresError would avoid masking real bugs.

Overall: solid fix, scoped tightly to vchord, with the rest of the backends provably unchanged. The CONCURRENTLY recovery gap is the only thing I'd block on.

…r backend Closes vectorize-io#1667. vchordrq operator classes are bound 1:1 to operators: vector_l2_ops only matches `<->`, while every Hindsight ANN query uses `<=>` (cosine distance). The previous vchord mapping used vector_l2_ops, so the planner ignored the index entirely and fell back to a sequential scan + per-row cosine computation. Separately, `SET LOCAL hnsw.ef_search = 60` (retain) and `SET hnsw.ef_search = 200` (pool init) only exist in pgvector and silently no-op'd under vchord, so the recall-vs-latency trade-off had never been applied to vchord deployments at all. This switches the vchord opclass to vector_cosine_ops (matching the engine's `<=>` queries), updates the four historical migrations that create vchord indexes inline so fresh installs land on cosine ops, and adds an online migration that rebuilds any existing L2-ops vchordrq indexes via CREATE INDEX CONCURRENTLY + drop + rename. Also introduces an ann_search_tuning_settings dispatcher so link_utils and the pool init pick the right GUC per backend (hnsw.ef_search for pgvector, vchordrq.probes for vchord). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per review on vectorize-io#1668: the env-var lookup that decides which vector backend is configured was duplicated in three places (the new migration plus the two runtime call sites in engine/retain/link_utils.py and engine/memory_engine.py). Centralize the read + validation in hindsight_api._vector_index.configured_vector_extension() so the default value and the access mechanism live in one spot. The new migration b8c9d0e1f2a3_vchord_cosine_opclass now imports the shared helper instead of inlining its own. The four legacy vchord migrations stay frozen (they keep their inline helpers); the frozen-state test is narrowed to that legacy set so future vchord migrations can opt into the shared helper on a per-migration basis. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Wrap DROP canonical + RENAME temp in a server-side DO block so the swap is atomic; a crash between the two would otherwise leave the temp index as a valid orphan and the canonical name missing, with no recovery path on retry. - Drop the temp index at the top of each rebuild loop and assert pg_index.indisvalid after CREATE INDEX CONCURRENTLY, so a leftover INVALID index from a prior failed run can't be promoted into the canonical name. - Align the migration with the _pg_schema_prefix() convention used by other PG migrations, and normalize empty-string target_schema to NULL so COALESCE falls back to current_schema() instead of filtering on ''. - Narrow _init_connection's except Exception to asyncpg.PostgresError so real pool/connection bugs surface instead of being silently logged. - Document the vchordrq.probes 10/30 starting defaults and the indexdef.replace first-occurrence assumption. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

isac322 · 2026-05-26T01:50:06Z

Thanks for the thorough review — addressed all of it in commit 73dfcf8. Summary:

Must address

1. CREATE INDEX CONCURRENTLY recovery (b8c9d0e1f2a3_vchord_cosine_opclass.py)

Two layers of defense against the INVALID-leftover scenario:

DROP INDEX IF EXISTS … __opclass_swap at the top of each loop iteration, so a leftover INVALID temp index from a prior failed run can't be skipped by CREATE … IF NOT EXISTS and then promoted into the canonical name.
After CREATE, query pg_index.indisvalid and raise RuntimeError if false — guards against the current run's CONCURRENTLY finishing INVALID (e.g. constraint violation during the second build scan).

While wiring this up I noticed a related gap: a crash between the canonical DROP and the RENAME would leave the temp as a valid orphan and the canonical name missing, with no recovery on retry (next run's WHERE indexdef ILIKE '%vector_l2_ops%' filter wouldn't match anything). Wrapped the DROP + RENAME in a server-side PL/pgSQL DO block so PG runs them in a single implicit transaction — both succeed or both roll back.

2. Schema-prefix convention

Switched to the _pg_schema_prefix() helper used by the other PG migrations. The pg_indexes filter now uses WHERE schemaname = COALESCE(:target_schema, current_schema()), and I normalize the Alembic option with or None so an explicit empty-string value also falls back to current_schema() rather than filtering on schemaname = ''.

3. Manual vchord verification

Done — see the follow-up comment with the docker-compose transcript. Both fresh-install and downgrade→upgrade-from-l2_ops paths confirmed, plus an EXPLAIN proving the index is now picked for <=>.

Nice to have

4. vchordrq.probes provenance

Expanded the comment on _ANN_TUNING_* in _vector_index.py to explain that the pgvector 60/200 pair predates the dispatcher (internal benchmarks tied to our embedding count + recall floor) and that vchord 10/30 are conservative starting defaults pending a workload-specific sweep — vchordrq's recall curve shape differs from HNSW's so the pgvector numbers don't translate directly.

5. indexdef.replace first-occurrence assumption

Added a one-line comment noting that pg_get_indexdef() emits the canonical form CREATE INDEX <name> ON …, so both substitutions rely on <name> being the first textual occurrence.

6. _init_connection exception

Narrowed to asyncpg.exceptions.PostgresError. Worth noting for context: in practice this defensive net is rarely tripped, because PG accepts qualifier.name = value SETs with arbitrary qualifiers — so SET vchordrq.probes = 10 doesn't error on a pgvector-only cluster, it just silently no-ops. The narrowed except is the right hygiene either way.

isac322 · 2026-05-26T01:50:11Z

Manual verification on a vchord docker-compose stack

Ran the migration locally against tensorchord/vchord-suite:pg18-latest using docker/docker-compose/vchord/docker-compose.yaml, applying migrations from our local API code (not the published image). Both the fresh-install path and the broken-state recovery path the migration is designed for behave as expected.

Fresh install (clean DB → upgrade heads)

pg_indexes after applying every migration through b8c9d0e1f2a3:

          indexname          |                                                   indexdef
-----------------------------+---------------------------------------------------------------------------------------------------------------
 idx_mental_models_embedding | CREATE INDEX idx_mental_models_embedding ON public.mental_models USING vchordrq (embedding vector_cosine_ops)
 idx_memory_units_embedding  | CREATE INDEX idx_memory_units_embedding ON public.memory_units USING vchordrq (embedding vector_cosine_ops)

Both vchordrq indexes land on vector_cosine_ops from the first migration onward, as intended by the historical-migration flips.

Broken-state recovery (the production scenario from #1667)

alembic downgrade b8c9d0e1f2a3 → 86f7a033d372 — _pg_downgrade rebuilt both indexes as vector_l2_ops, reproducing the production "broken" state.
alembic upgrade heads — the new migration rebuilt both indexes back as vector_cosine_ops, no manual intervention needed.
pg_index.indisvalid = true for both indexes after the upgrade.
No *__opclass_swap orphan indexes left behind — confirming the DO-block-wrapped DROP + RENAME is atomic across the swap.

EXPLAIN — operator class binding

After inserting 200 synthetic memory_units rows and ANALYZE-ing, the cosine ANN query picks the vchordrq index:

->  Index Scan using idx_memory_units_embedding on memory_units  (cost=0.20..214.70 rows=200 width=32)
      Order By: (embedding <=> (InitPlan 2).col1)
      Filter: ((embedding IS NOT NULL) AND (bank_id = 'test-bank'::text) AND (fact_type = 'world'::text))

Previously, with vector_l2_ops, the planner ignored the index and fell back to a sequential scan — which is exactly the original bug.

Caveat

CPU return-to-baseline can only be observed against the real production workload; that part is deferred to staging/prod rollout.

elliotclee reviewed May 20, 2026

View reviewed changes

elliotclee approved these changes May 21, 2026

View reviewed changes

benfrank241 mentioned this pull request May 22, 2026

vchord backend: ANN queries cannot use vchordrq indexes due to opclass/operator mismatch (vector_l2_ops vs <=>) #1667

Open

nicoloboschi requested changes May 25, 2026

View reviewed changes

isac322 and others added 2 commits May 26, 2026 09:29

isac322 force-pushed the fix/vchord-cosine-opclass branch from c4011bc to 79ee0a3 Compare May 26, 2026 00:29

isac322 requested a review from nicoloboschi May 26, 2026 05:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(api): vchord ANN — use cosine opclass and dispatch tuning GUCs per backend#1668

fix(api): vchord ANN — use cosine opclass and dispatch tuning GUCs per backend#1668
isac322 wants to merge 3 commits into
vectorize-io:mainfrom
isac322:fix/vchord-cosine-opclass

isac322 commented May 20, 2026 •

edited

Loading

Uh oh!

isac322 commented May 20, 2026

Uh oh!

elliotclee left a comment

Uh oh!

elliotclee May 20, 2026

Uh oh!

isac322 May 21, 2026

Uh oh!

elliotclee May 20, 2026

Uh oh!

isac322 May 21, 2026

Uh oh!

elliotclee May 20, 2026

Uh oh!

isac322 May 21, 2026

Uh oh!

elliotclee left a comment

Uh oh!

nicoloboschi left a comment

Uh oh!

isac322 commented May 26, 2026

Uh oh!

isac322 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		def _configured_vector_extension() -> str:
		return os.getenv("HINDSIGHT_API_VECTOR_EXTENSION", "pgvector").lower()

Conversation

isac322 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes

Why

Test plan

Migration impact

Uh oh!

isac322 commented May 20, 2026

Uh oh!

elliotclee left a comment

Choose a reason for hiding this comment

Uh oh!

elliotclee May 20, 2026

Choose a reason for hiding this comment

Uh oh!

isac322 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

elliotclee May 20, 2026

Choose a reason for hiding this comment

Uh oh!

isac322 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

elliotclee May 20, 2026

Choose a reason for hiding this comment

Uh oh!

isac322 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

elliotclee left a comment

Choose a reason for hiding this comment

Uh oh!

nicoloboschi left a comment

Choose a reason for hiding this comment

Must address

Nice to have

Uh oh!

isac322 commented May 26, 2026

Must address

Nice to have

Uh oh!

isac322 commented May 26, 2026

Manual verification on a vchord docker-compose stack

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

isac322 commented May 20, 2026 •

edited

Loading