Skip to content

fix: validate embedding dimensions before pgvector writes#1670

Open
ai-ag2026 wants to merge 2 commits into
vectorize-io:mainfrom
ai-ag2026:fix/validate-retain-embedding-dimensions
Open

fix: validate embedding dimensions before pgvector writes#1670
ai-ag2026 wants to merge 2 commits into
vectorize-io:mainfrom
ai-ag2026:fix/validate-retain-embedding-dimensions

Conversation

@ai-ag2026
Copy link
Copy Markdown
Contributor

@ai-ag2026 ai-ag2026 commented May 20, 2026

Summary

  • validate embedding vectors before they can reach pgvector writes
  • reject empty vectors and vectors whose length does not match the backend's declared dimension
  • cover both retain and consolidation observation creation paths with regression tests

Root cause

Retain embedding generation already guaranteed that the number of returned vectors matched the number of input texts, but it did not validate per-vector dimensions. If an embedding backend returned [] or another wrong-sized vector, the invalid value could flow downstream until PostgreSQL/pgvector raised different vector dimensions 384 and 0.

The same shared embedding utility is also used by consolidation when creating observations, so the validation boundary protects both retain writes and consolidation-created observations.

Related reconnaissance

Test plan

  • RED on current origin/main: uv run pytest tests/test_consolidation_embedding_validation.py::test_create_observation_rejects_zero_length_embedding_before_insert -q -o addopts= failed because the zero-length vector reached the DB insert path.
  • uv run pytest tests/test_consolidation_embedding_validation.py::test_create_observation_rejects_zero_length_embedding_before_insert tests/test_retain_orchestrator_mapping.py -q -o addopts= → 11 passed
  • uv run ruff check tests/test_consolidation_embedding_validation.py hindsight_api/engine/retain/embedding_utils.py tests/test_retain_orchestrator_mapping.py → passed
  • Added-line secret/private marker scan → 0 findings

@ai-ag2026 ai-ag2026 changed the title fix: validate retain embedding dimensions fix: validate embedding dimensions before pgvector writes May 20, 2026
Copy link
Copy Markdown
Collaborator

@nicoloboschi nicoloboschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of must-fix items before this lands:

1. The new RuntimeError is swallowed in generate_embedding(). The single-text path still has except Exception as e: raise Exception(f"Failed to generate embedding: {str(e)}") wrapping the body, and _validate_embedding_vector is called inside that try. So a precise RuntimeError("embedding 0 has dimension 0; expected 384") becomes a plain Exception("Failed to generate embedding: embedding 0 …") — callers can't except RuntimeError to distinguish validation failures from backend transport errors. The batch path is fine because the list comprehension runs after the try/except. Easiest fix: move validation outside the try, or re-raise with raise / raise … from e without reclassifying.

2. No test coverage for generate_embedding() (single-text). All three new unit tests hit generate_embeddings_batch. The function you just modified has zero coverage of the new validation branch — which is also why issue #1 is invisible right now.

A few non-blocking nits I'd consider while you're in here:

  • embeddings_backend is untyped and defensively probed. The Embeddings ABC in engine/embeddings.py already guarantees a dimension: int property; the try/except around getattr(..., None) plus isinstance(..., int) and > 0 is paper over the missing type annotation. Typing the parameter and reading .dimension directly would be the right shape.
  • The consolidation test (test_consolidation_embedding_validation.py) calls the private _create_observation_directly, monkey-patches the private _filter_live_source_memories, fakes MemoryEngine with a bare class, and uses a _FailingConn that asserts on fetchrow. The validation itself is already covered by the unit tests, and "consolidation uses the validating helper" is visible by inspection — this test will break on any consolidator refactor without catching anything the unit tests miss.
  • Error message lacks input context: "embedding 1 has dimension 2; expected 3" tells you the batch offset but not the text, call site, or bank — correlating back in prod will require log fishing. A logger.error(...) with bank/operation/text length before raising would help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

batch_retain can pass zero-length embeddings to pgvector (vector(384) vs vector(0))

2 participants