Skip to content

perf(core): reduce postgres vector sync work#723

Merged
phernandez merged 10 commits intomainfrom
perf/core-postgres-vector-sync-batching
Apr 8, 2026
Merged

perf(core): reduce postgres vector sync work#723
phernandez merged 10 commits intomainfrom
perf/core-postgres-vector-sync-batching

Conversation

@phernandez
Copy link
Copy Markdown
Member

@phernandez phernandez commented Apr 7, 2026

Summary

  • skip unchanged vector sync work using persisted semantic fingerprints and embedding model metadata
  • shard oversized entity vector sync into bounded resumable work units
  • make Postgres vector prepare concurrency configurable with a conservative default

Testing

  • ./.venv/bin/ruff check src/basic_memory/config.py src/basic_memory/repository/search_repository_base.py src/basic_memory/repository/postgres_search_repository.py tests/repository/test_postgres_search_repository.py tests/repository/test_postgres_search_repository_unit.py tests/test_config.py
  • ./.venv/bin/pytest tests/repository/test_semantic_search_base.py tests/repository/test_postgres_search_repository_unit.py tests/test_config.py -q
  • BASIC_MEMORY_TEST_POSTGRES=1 ./.venv/bin/pytest tests/repository/test_postgres_search_repository.py -q -k "vector_sync_skips_unchanged_and_reembeds_changed_content or vector_sync_shards_oversized_entity_and_resumes"

Signed-off-by: phernandez <paul@basicmachines.co>
Copy link
Copy Markdown
Member Author

Additional local verification note: just fast-check failed after the targeted repository tests passed.

Failure:

  • test-int/semantic/test_semantic_quality.py::test_semantic_quality[asyncio-postgres-openai]
  • assertion: recall@5 for postgres-openai/paraphrase/hybrid: 0.000 < 0.250

The targeted tests that passed before opening the PR were:

  • uv run pytest -q tests/repository/test_postgres_search_repository_unit.py tests/repository/test_semantic_search_base.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Claude finished @phernandez's task in 5m 32s —— View job


PR Review: perf(core): batch postgres vector sync preparation

Tasks

  • Gather context and read changed files
  • Analyze semantic quality regression
  • Review implementation correctness
  • Run team checklist
  • Post final review

Summary

The PR adds two improvements to Postgres vector sync:

  1. Concurrent entity preparation via asyncio.gather windows (hides round-trip latency on Neon/cloud)
  2. Bulk upsert of embeddings in a single SQL statement per flush (reduces execute calls)

The individual pieces look sound, but there's a known semantic quality regression (recall@5 = 0.0) that needs to be understood before merging.


Critical Issues

1. Semantic quality regression — recall@5 = 0.000

The PR author noted that just fast-check failed with:

recall@5 for postgres-openai/paraphrase/hybrid: 0.000 < 0.250

recall@5 = 0.0 (not merely below threshold — literally zero) is a strong signal that embeddings are not being stored or found at all, not just degraded. This is consistent with a silent bug in the new code path rather than a quality regression from the algorithm change.

The new sync_entity_vectors_batch override completely replaces the base class _sync_entity_vectors_internal flow. If anything goes wrong in the concurrent-prepare → flush pipeline, embeddings could silently fail to write while the method still returns a VectorSyncBatchResult with no errors (because exceptions are caught and logged as warnings).

Before merging, please confirm whether this test was already failing on main before these changes, or whether it is a regression introduced here. A quick git stash && just test-int-postgres -- -k test_semantic_quality on main would settle this.

2. Large code duplication — fragile maintenance surface

The new PostgresSearchRepository.sync_entity_vectors_batch (~100 lines) duplicates almost all orchestration logic from SearchRepositoryBase._sync_entity_vectors_internal. The only meaningful addition is the asyncio.gather window. If the base class sync logic ever changes (error handling, batch flushing, synced_entity_ids tracking), the Postgres override will silently diverge.

A less risky design would be to extract just the "prepare one window of N entities concurrently" concern into a _prepare_entity_vector_jobs_window helper and override only that in PostgresSearchRepository, still delegating the outer loop to the base class.

Fix this →


Moderate Issues

3. Missing type annotation on progress_callback

src/basic_memory/repository/postgres_search_repository.py — the Postgres override drops the type annotation:

# base class (correct):
async def sync_entity_vectors_batch(
    self,
    entity_ids: list[int],
    progress_callback: Optional[Callable[[int, int, int], Any]] = None,
) -> VectorSyncBatchResult:

# Postgres override (type annotation dropped):
async def sync_entity_vectors_batch(
    self,
    entity_ids: list[int],
    progress_callback=None,  # ← no annotation
) -> VectorSyncBatchResult:

4. Missing explicit session.commit() before early returns in _prepare_entity_vector_jobs

The base class explicitly calls await session.commit() before each early return inside the async with db.scoped_session(...) block (e.g. lines 865, 886, 995). The Postgres override relies on the context manager auto-commit. While functionally equivalent today (the auto-commit runs on __aexit__), the explicit commits make the intent clearer and are consistent with the rest of the codebase. Add explicit await session.commit() calls before the two early returns inside the async with block.

5. incoming_by_key is built but partially used

incoming_by_key = {record["chunk_key"]: record for record in chunk_records}

This is only used for stale detection (if chunk_key not in incoming_by_key). The main loop iterates over chunk_records and uses existing_by_key.get(record["chunk_key"]) — it doesn't go through incoming_by_key. The dict is harmless but adds confusion.


Minor Issues

6. F-string SQL template for bulk upsert

Both the chunk upsert and the embedding upsert in _write_embeddings build SQL via f-strings (", ".join(upsert_values)). The values are driven by enumerate() (safe), but this pattern deviates from the project's style of using static text() SQL. This is acceptable for a bulk-upsert path, but a comment noting that the f-string is safe (integer indices only, no user data) would help future reviewers.

7. Progress callback semantics changed

The base class calls progress_callback(entity_id, sequential_index, total_entities) per entity before its individual prepare. The Postgres override calls it per entity within the window before the concurrent gather fires, using window_start + offset_within_window. The semantics are slightly different (batch-of-4 progress vs per-entity), which could affect any UI progress bars that depend on this callback.


Team Checklist

Item Status Notes
Python 3.12+ types and async patterns ⚠️ progress_callback missing annotation on override
SQLAlchemy 2.0 best practices F-string SQL building is safe but should be commented
100-character line length
No commented-out code
Unit tests for new code New TestBatchPrepareConcurrency and TestWriteEmbeddings
100% test coverage ⚠️ _update_timestamp_sql has # pragma: no cover on line 892 — verify that's still needed
No hardcoded secrets
Error handling / logging Good structured logging with loguru
Performance Core intent: latency hiding via concurrent prepare
Code duplication ~100 lines duplicated from base class orchestration loop
Known test regression test_semantic_quality[asyncio-postgres-openai] recall@5 = 0.0
File path / Windows compat No file path operations in changed code
No emoji/unicode in logs
Literate programming comments Good Trigger/Why/Outcome comments on the new method
Alembic migration N/A No schema changes

Bottom Line

The bulk-embedding upsert and concurrent-prepare window are good ideas. The main concerns are:

  1. Unresolved test regression — please verify whether recall@5 = 0.0 is a pre-existing failure or introduced by this PR. If it's a regression, the concurrent asyncio.gather prepare path or the new bulk-upsert chunk path may have a bug that silently loses embedding writes.

  2. Code duplication — the nearly verbatim copy of the base class orchestration loop should be refactored before this lands.

Signed-off-by: phernandez <paul@basicmachines.co>
Signed-off-by: phernandez <paul@basicmachines.co>
Signed-off-by: phernandez <paul@basicmachines.co>
Signed-off-by: phernandez <paul@basicmachines.co>
Signed-off-by: phernandez <paul@basicmachines.co>
Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Claude finished @phernandez's task in 4m 35s —— View job


PR Review: perf(core): batch postgres vector sync preparation

Tasks

  • Gather context
  • Read changed files
  • Review implementation correctness
  • Run team checklist
  • Post final review

Summary

The PR introduces two improvements to Postgres vector sync:

  1. Concurrent entity preparation via asyncio.gather windows (hides Neon round-trip latency)
  2. Bulk upsert of chunk rows and embeddings per flush (reduces per-entity execute calls)

The individual pieces look sound, but there is a confirmed semantic quality regression and a code duplication concern that should be resolved before merging.


Critical Issues

1. postgres-openai quality thresholds silently removed to mask a failing test

This is the most important finding. Looking at the first commit of this PR (a1541f3), test-int/semantic/test_semantic_quality.py originally included:

# OpenAI hybrid should handle paraphrases better than FastEmbed
("postgres-openai", "lexical", "hybrid"): 0.37,
("postgres-openai", "paraphrase", "hybrid"): 0.25,

The test ran, failed with recall@5 = 0.000 < 0.250, and subsequent commits removed those two threshold entries (-3 lines in the file diff) instead of fixing the root cause. The comment in the current file now says "OpenAI metrics are still recorded, but we do not gate on them yet" — but we were gating on them.

recall@5 = 0.0 (literally zero) is not noise. It means no relevant results are returned in the top 5. This is consistent with a bug where embeddings are written but not queryable, or not written at all, for the openai provider specifically. The concurrent asyncio.gather path or the new bulk-upsert _write_embeddings are the most likely suspects.

The threshold removal must be reverted, the root cause found and fixed, and the test must pass before this merges.

2. ~100 lines of orchestration logic duplicated from base class

PostgresSearchRepository.sync_entity_vectors_batch (lines 497–738) duplicates almost all of SearchRepositoryBase._sync_entity_vectors_internal. The only meaningful addition is the asyncio.gather window.

If the base class orchestration changes (error handling, deferred tracking, synced_entity_ids bookkeeping, orphan cleanup), the Postgres override will silently diverge — and we already have evidence it has diverged (quality regression).

The refactor suggested in the previous review stands: extract the "prepare window of N entities concurrently" concern into a hook like _prepare_entity_vector_jobs_window and override only that in Postgres. The outer flush loop should live in one place in the base class.

Fix this →


Moderate Issues

3. Missing progress_callback type annotation on Postgres override

postgres_search_repository.py line 500:

# Postgres override (annotation dropped):
async def sync_entity_vectors_batch(
    self,
    entity_ids: list[int],
    progress_callback=None,       # ← no type annotation
) -> VectorSyncBatchResult:

# Base class (correct):
async def sync_entity_vectors_batch(
    self,
    entity_ids: list[int],
    progress_callback: Optional[Callable[[int, int, int], Any]] = None,
) -> VectorSyncBatchResult:

4. Missing explicit await session.commit() before early returns in _prepare_entity_vector_jobs

The base class _prepare_entity_vector_jobs (lines 1069, 1094 in search_repository_base.py) explicitly calls await session.commit() before each early return inside async with db.scoped_session(...). The Postgres override (lines 785–793, 808–817) omits these calls and relies on the context manager auto-commit.

Both are functionally equivalent today because db.scoped_session.__aexit__ calls await session.commit(). But the inconsistency is subtle and could mislead future maintainers into thinking those branches don't persist state. Add explicit commits for consistency.

5. incoming_by_key is built but only partially used

postgres_search_repository.py line 833:

incoming_by_key = {record["chunk_key"]: record for record in chunk_records}

This dict is built but only used for stale detection (if chunk_key not in incoming_by_key). The main per-record loop iterates chunk_records directly and looks up existing rows via existing_by_key. The dict is harmless but adds noise and could confuse reviewers.


Minor Issues

6. F-string SQL building in _write_embeddings and _prepare_entity_vector_jobs bulk upsert

The bulk inserts build SQL via f-strings with ", ".join(...). The values are driven by integer enumerate() indices so there is no injection risk, but this deviates from the project's preference for static text() SQL. A short comment near each f-string noting that the template is index-safe would help future reviewers.

7. Progress callback semantics changed

The base class calls progress_callback(entity_id, sequential_index, total_entities) for each entity individually before prepare. The Postgres override (lines 544–546) calls it per entity within a window batch using window_start + offset_within_window, then fires all prepares concurrently. For any UI progress bar consuming this callback, the progress events arrive in bursts rather than one by one. This is not a bug but is a semantic change worth documenting.


Team Checklist

Item Status Notes
Python 3.12+ types and async patterns ⚠️ progress_callback missing type annotation on Postgres override
SQLAlchemy 2.0 best practices
FastAPI / Typer conventions N/A No API/CLI changes
100-character line length
No commented-out code
Unit tests for new code TestBatchPrepareConcurrency and TestWriteEmbeddings added
100% test coverage ⚠️ _update_timestamp_sql (line 1097 in postgres) carries # pragma: no cover — verify it's still needed
Documentation / docstrings Good Trigger/Why/Outcome comments on new method
No hardcoded secrets
Proper error handling and logging Structured loguru logging throughout
Performance Core intent (concurrent prepare + bulk upsert) is sound
Code duplication ~100-line orchestration loop duplicated from base class
Known test regression postgres-openai quality thresholds removed to mask recall@5 = 0.0 failure
Alembic migration m6h7i8j9k0l1 adds entity_fingerprint + embedding_model columns
Windows file path compatibility No file path operations in changed code
No emoji/unicode in log output
CLAUDE.md conventions Literate programming style followed

Bottom Line

Two blockers before merging:

  1. Restore the postgres-openai quality thresholds (revert the -3 lines removed from test_semantic_quality.py) and fix whatever is causing recall@5 = 0.0. The concurrent prepare window or the bulk embedding upsert likely has a bug that silently produces zero queryable results for the OpenAI provider.

  2. Refactor the duplicated orchestration loop — the Postgres override should be a minimal hook on the base class, not a full copy of it.

@phernandez phernandez changed the title perf(core): batch postgres vector sync preparation perf(core): reduce postgres vector sync work Apr 7, 2026
Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Claude finished @phernandez's task in 6m 37s —— View job


PR Review: perf(core): reduce postgres vector sync work

Tasks

  • Gather context
  • Read changed files
  • Review implementation correctness
  • Run team checklist
  • Post final review

Summary

This PR delivers three improvements to Postgres vector sync: (1) skip-unchanged logic using semantic fingerprints, (2) sharding of oversized entities into bounded work units, and (3) configurable concurrent preparation windows. The latest commit (4529bfa) resolves the major code-duplication concern from previous reviews by extracting a _prepare_entity_vector_jobs_window hook and delegating the full orchestration loop to the base class.

However there are two blockers remaining.


Critical Issues

1. Chunk-skip logic misplaced in SearchRepositoryBase._prepare_entity_vector_jobs — existing chunks never re-embedded

src/basic_memory/repository/search_repository_base.py lines 1239–1240:

# current state (BROKEN):
if current:
    ...
    if same_source_hash and not is_orphan and same_embedding_model:
        if not same_entity_fingerprint:
            await session.execute(UPDATE fingerprint)
    skipped_chunks_count += 1   # ← at if current: level — always runs
    continue                     # ← at if current: level — always runs

pending_records.append(record)  # ← only new chunks ever reach this

skipped_chunks_count += 1 and continue must be inside the if same_source_hash and not is_orphan and same_embedding_model: block. As written, every chunk that already exists in the DB is counted as skipped and never added to pending_records, regardless of whether its content changed, its embedding is missing (orphan), or the embedding model changed. Only first-time syncs for completely new entities work correctly.

The old code on main was:

if current and current.source_hash == record["source_hash"] and not is_orphan:
    ...
    continue
# else fall through and add to embedding_jobs directly

This regression means: after initial indexing, any entity whose content changes will silently retain stale embeddings forever (SQLite path). The Postgres override of _prepare_entity_vector_jobs has this logic correct (the pending_records.append(record) at the end of the loop), but the base class — used by SQLite — is broken.

Fix:

if current:
    ...
    if same_source_hash and not is_orphan and same_embedding_model:
        if not same_entity_fingerprint:
            await session.execute(UPDATE fingerprint)
        skipped_chunks_count += 1   # ← move inside this block
        continue                     # ← move inside this block

pending_records.append(record)

Fix this →

2. postgres-openai quality thresholds still removed

test-int/semantic/test_semantic_quality.py diff:

-    # OpenAI hybrid should handle paraphrases better than FastEmbed
-    ("postgres-openai", "lexical", "hybrid"): 0.37,
-    ("postgres-openai", "paraphrase", "hybrid"): 0.25,
+    # OpenAI metrics are still recorded, but we do not gate on them yet.
+    # The current benchmark corpus is too small to make that combo stable.

The thresholds were passing before a1541f3 (the first commit of this PR). They were removed after that commit broke them (recall@5 = 0.0). The refactoring in 4529bfa may have fixed the underlying bug, but the thresholds were never restored to verify.

Please restore the thresholds and confirm the test passes. If the test was genuinely flaky on a small corpus before this PR (not just broken by it), the right fix is to enlarge the corpus — not silently drop the assertions.


Resolved Since Previous Reviews

Issue Status
~100-line orchestration loop duplicated from base class ✅ Fixed by 4529bfa refactor — _prepare_entity_vector_jobs_window hook + base class owns the full loop
Missing progress_callback type annotation on Postgres override ✅ Fixed — Postgres no longer overrides sync_entity_vectors_batch
incoming_by_key built but only partially used ✅ Removed in refactor
Missing explicit session.commit() before early returns ✅ Both Postgres and base class have explicit commits

Minor Issues

3. _update_timestamp_sql pragma on line 874

async def _update_timestamp_sql(self) -> str:
    return "NOW()"  # pragma: no cover

This method has a body (return "NOW()") and is not abstract, yet carries # pragma: no cover. Either it should have a test that calls it, or if it's a Postgres-specific concrete implementation of an abstract base method, document why it can't be tested.

4. Progress-callback burst semantics

_sync_entity_vectors_internal lines 827–829 fire progress_callback per entity before the concurrent _prepare_entity_vector_jobs_window gather fires. Progress events arrive in window-sized bursts (default 4 at a time), not one-by-one. Not a bug, but callers implementing progress bars may observe jumpy reporting. This is worth a comment near the callback invocation.


Team Checklist

Item Status Notes
Python 3.12+ types and async patterns
SQLAlchemy 2.0 best practices
100-character line length
No commented-out code
Unit tests for new code TestBatchPrepareConcurrency, TestWriteEmbeddings, fingerprint skip tests added
100% test coverage ⚠️ _update_timestamp_sql pragma on line 874 — verify still needed
No hardcoded secrets
Proper error handling and logging
Performance Concurrent prepare + bulk upsert + skip-unchanged are all sound ideas
Code duplication Resolved by 4529bfa
Chunk-skip logic correctness skipped_chunks_count += 1; continue misplaced in base class — blocker
Quality test regression postgres-openai thresholds removed and not restored — blocker
Alembic migration m6h7i8j9k0l1 adds entity_fingerprint + embedding_model columns
Windows file path compatibility
No emoji/unicode in logs
Literate programming comments Good Trigger/Why/Outcome style throughout

Bottom Line

The refactoring in 4529bfa resolved the main architectural concern from previous reviews. Two blockers remain:

  1. Fix the indentation bug in SearchRepositoryBase._prepare_entity_vector_jobs lines 1239–1240 — move skipped_chunks_count += 1 and continue inside if same_source_hash and not is_orphan and same_embedding_model:. Add a test that re-syncs an entity with changed content and verifies the embedding is updated.

  2. Restore the postgres-openai quality thresholds and confirm the test passes on HEAD. If recall@5 = 0.0 still reproduces, the bug is likely in the Postgres _prepare_entity_vector_jobs bulk-upsert path or _write_embeddings.

Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Claude finished @phernandez's task in 3m 54s —— View job


PR Review: perf(core): reduce postgres vector sync work

Tasks

  • Gather context
  • Read changed files
  • Review implementation correctness
  • Run team checklist
  • Post final review

Summary

The PR delivers three solid improvements to Postgres vector sync:

  1. Skip-unchanged logic — persisted semantic fingerprints + embedding model metadata lets the sync short-circuit re-embedding for entities whose content hasn't changed
  2. Sharding — oversized entities with thousands of pending chunks are split into bounded, resumable work units
  3. Configurable concurrency — Postgres preparation concurrency is tunable (semantic_postgres_prepare_concurrency, default 4)

The latest commit (16fc164) resolved both blockers identified in the previous reviews. The code is well-structured, follows the literate programming style, and has good test coverage.


Resolved Since Previous Reviews

Issue Status
Chunk-skip skipped_chunks_count += 1; continue indentation bug in base class ✅ Fixed in 16fc164
postgres-openai quality thresholds removed to mask recall@5 = 0.0 ✅ Restored in 16fc164
~100-line orchestration loop duplicated from base class ✅ Fixed in 4529bfa_prepare_entity_vector_jobs_window hook
Missing progress_callback type annotation on Postgres override ✅ N/A — Postgres no longer overrides sync_entity_vectors_batch
incoming_by_key partially used ✅ Removed in refactor

Remaining Issues

1. Dead abstract method _update_timestamp_sql

src/basic_memory/repository/search_repository_base.py line 292, postgres_search_repository.py line 873, sqlite_search_repository.py line 557 — _update_timestamp_sql is declared @abstractmethod, implemented in both subclasses, but never called anywhere in the codebase. The actual timestamp expression used throughout _prepare_entity_vector_jobs is _timestamp_now_expr() (a synchronous non-abstract method). The Postgres override carries # pragma: no cover specifically because nothing calls it.

This is dead code in three files. It should be removed (the @abstractmethod declaration from the base class and both backend implementations). The # pragma: no cover on the Postgres implementation is the tell — it's not a coverage exception for a hard-to-test path, it's evidence the method is unreachable.

Fix this →

2. SQLite upgrade silently drops all cached embeddings

src/basic_memory/repository/sqlite_search_repository.py lines 394–413 — on first startup after this upgrade, existing SQLite users will hit the schema mismatch check (the new entity_fingerprint and embedding_model columns are not in old tables), causing search_vector_chunks and search_vector_embeddings to be dropped and recreated. A full re-sync is triggered on the next indexing run.

This is handled gracefully — the log warning at line 407 is visible — but the UX impact (re-embedding all entities) can be significant for large knowledge bases. A comment near the schema mismatch handler noting that this was introduced in this version (and why it's safe to drop-recreate) would help future maintainers understand the intent.


Minor Issues

3. Progress callback fires in window-sized bursts

search_repository_base.py lines 827–829 — progress_callback is invoked per entity within each window before the concurrent _prepare_entity_vector_jobs_window gather fires. For callers implementing progress bars, events arrive in batches of prepare_window_size (default 4) rather than one by one. Not a bug, but callers should be aware of this. A brief comment near the callback loop noting the burst semantics would help.


Team Checklist

Item Status Notes
Python 3.12+ types and async patterns
SQLAlchemy 2.0 best practices
100-character line length description= strings in Field() follow existing pattern
No commented-out code
Unit tests for new code TestBatchPrepareConcurrency, TestWriteEmbeddings, fingerprint skip tests
100% test coverage ⚠️ _update_timestamp_sql in postgres has # pragma: no cover — method is dead code, should be removed
No hardcoded secrets
Proper error handling and logging Good structured loguru logging throughout
Performance Concurrent prepare + bulk upsert + skip-unchanged are all sound
Code duplication Resolved
Chunk-skip logic correctness Fixed in 16fc164
Quality test regression Fixed in 16fc164
Alembic migration m6h7i8j9k0l1 adds entity_fingerprint + embedding_model for Postgres
SQLite upgrade path ⚠️ Schema mismatch detection drops/recreates tables — full re-sync on upgrade
Windows file path compatibility No file path operations in changed code
No emoji/unicode in logs
Literate programming comments Good Trigger/Why/Outcome style throughout
F-string SQL safety Safety comments added near both bulk upsert statements

Bottom Line

Both blockers from the previous reviews are resolved. The implementation is correct, the architecture is clean, and the test coverage is thorough.

Two remaining items before merging:

  1. Remove _update_timestamp_sql — it's dead code in three files. The # pragma: no cover on the Postgres implementation is a signal, not a fix.

  2. Document the SQLite re-sync behaviour — add a comment near the schema mismatch handler explaining that the drop-recreate is intentional and expected on upgrade.

CI tests are still running at the time of this review.

Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Claude finished @phernandez's task in 3m 46s —— View job


PR Review: perf(core): reduce postgres vector sync work

Tasks

  • Gather context and read changed files
  • Review implementation correctness
  • Run team checklist
  • Post final review

Summary

All blockers and major concerns from previous reviews have been resolved in e1df151. The implementation is correct, well-structured, and follows the literate programming conventions in CLAUDE.md.


All Previously Identified Issues — Resolved

Issue Status
Chunk-skip skipped_chunks_count += 1; continue indentation bug (base class) ✅ Fixed in 16fc164
postgres-openai quality thresholds removed to mask recall@5 = 0.0 ✅ Restored in 16fc164
~100-line orchestration loop duplicated from base class ✅ Fixed in 4529bfa_prepare_entity_vector_jobs_window hook
Dead _update_timestamp_sql method in 3 files (with # pragma: no cover) ✅ Removed in e1df151
SQLite schema-mismatch drop/recreate undocumented ✅ Trigger/Why/Outcome comment added at sqlite_search_repository.py:407
Missing progress_callback type annotation on Postgres override ✅ N/A — Postgres no longer overrides sync_entity_vectors_batch
incoming_by_key partially used ✅ Removed in refactor

One Minor Observation

Postgres main-path relies on context manager auto-commit

postgres_search_repository.py — the main (non-early-return) exit from _prepare_entity_vector_jobs exits the async with db.scoped_session(...) block without an explicit await session.commit(). The base class (search_repository_base.py:1333) has an explicit commit before the block closes.

Both are functionally correct — db.scoped_session.__aexit__ auto-commits. But the inconsistency is minor and worth a note for future maintainers. Not a blocker.


Team Checklist

Item Status Notes
Python 3.12+ types and async patterns
SQLAlchemy 2.0 best practices
FastAPI / Typer conventions N/A No API/CLI changes
100-character line length
No commented-out code
Unit tests for new code TestBatchPrepareConcurrency, TestWriteEmbeddings, fingerprint skip + shard tests
100% test coverage # pragma: no cover entries are on abstract pass-through methods only
No hardcoded secrets
Proper error handling and logging Structured loguru logging throughout
Performance Concurrent prepare + bulk upsert + skip-unchanged all sound
Code duplication Resolved — _prepare_entity_vector_jobs_window hook pattern
Chunk-skip logic correctness Fixed and tested
Quality test regression postgres-openai thresholds restored and passing
Alembic migration m6h7i8j9k0l1 adds entity_fingerprint + embedding_model with NOT NULL backfill
SQLite upgrade path Schema mismatch handler documented with Trigger/Why/Outcome comment
F-string SQL safety Index-safety comments on both bulk upsert statements
Windows file path compatibility No file path operations in changed code
No emoji/unicode in logs
Literate programming comments Good Trigger/Why/Outcome style throughout

Bottom Line

This is ready to merge. CI tests are still running — pending their completion.

@phernandez phernandez merged commit 8c81d3c into main Apr 8, 2026
41 checks passed
@phernandez phernandez deleted the perf/core-postgres-vector-sync-batching branch April 8, 2026 00:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant