perf(core): reduce postgres vector sync work by phernandez · Pull Request #723 · basicmachines-co/basic-memory

phernandez · 2026-04-07T00:57:09Z

Summary

skip unchanged vector sync work using persisted semantic fingerprints and embedding model metadata
shard oversized entity vector sync into bounded resumable work units
make Postgres vector prepare concurrency configurable with a conservative default

Testing

./.venv/bin/ruff check src/basic_memory/config.py src/basic_memory/repository/search_repository_base.py src/basic_memory/repository/postgres_search_repository.py tests/repository/test_postgres_search_repository.py tests/repository/test_postgres_search_repository_unit.py tests/test_config.py
./.venv/bin/pytest tests/repository/test_semantic_search_base.py tests/repository/test_postgres_search_repository_unit.py tests/test_config.py -q
BASIC_MEMORY_TEST_POSTGRES=1 ./.venv/bin/pytest tests/repository/test_postgres_search_repository.py -q -k "vector_sync_skips_unchanged_and_reembeds_changed_content or vector_sync_shards_oversized_entity_and_resumes"

Signed-off-by: phernandez <paul@basicmachines.co>

phernandez · 2026-04-07T00:57:35Z

Additional local verification note: just fast-check failed after the targeted repository tests passed.

Failure:

test-int/semantic/test_semantic_quality.py::test_semantic_quality[asyncio-postgres-openai]
assertion: recall@5 for postgres-openai/paraphrase/hybrid: 0.000 < 0.250

The targeted tests that passed before opening the PR were:

uv run pytest -q tests/repository/test_postgres_search_repository_unit.py tests/repository/test_semantic_search_base.py

github-actions · 2026-04-07T00:58:03Z

Claude finished @phernandez's task in 5m 32s —— View job

PR Review: `perf(core): batch postgres vector sync preparation`

Tasks

Gather context and read changed files
Analyze semantic quality regression
Review implementation correctness
Run team checklist
Post final review

Summary

The PR adds two improvements to Postgres vector sync:

Concurrent entity preparation via asyncio.gather windows (hides round-trip latency on Neon/cloud)
Bulk upsert of embeddings in a single SQL statement per flush (reduces execute calls)

The individual pieces look sound, but there's a known semantic quality regression (recall@5 = 0.0) that needs to be understood before merging.

Critical Issues

1. Semantic quality regression — `recall@5 = 0.000`

The PR author noted that just fast-check failed with:

recall@5 for postgres-openai/paraphrase/hybrid: 0.000 < 0.250

recall@5 = 0.0 (not merely below threshold — literally zero) is a strong signal that embeddings are not being stored or found at all, not just degraded. This is consistent with a silent bug in the new code path rather than a quality regression from the algorithm change.

The new sync_entity_vectors_batch override completely replaces the base class _sync_entity_vectors_internal flow. If anything goes wrong in the concurrent-prepare → flush pipeline, embeddings could silently fail to write while the method still returns a VectorSyncBatchResult with no errors (because exceptions are caught and logged as warnings).

Before merging, please confirm whether this test was already failing on main before these changes, or whether it is a regression introduced here. A quick git stash && just test-int-postgres -- -k test_semantic_quality on main would settle this.

2. Large code duplication — fragile maintenance surface

The new PostgresSearchRepository.sync_entity_vectors_batch (~100 lines) duplicates almost all orchestration logic from SearchRepositoryBase._sync_entity_vectors_internal. The only meaningful addition is the asyncio.gather window. If the base class sync logic ever changes (error handling, batch flushing, synced_entity_ids tracking), the Postgres override will silently diverge.

A less risky design would be to extract just the "prepare one window of N entities concurrently" concern into a _prepare_entity_vector_jobs_window helper and override only that in PostgresSearchRepository, still delegating the outer loop to the base class.

Fix this →

Moderate Issues

3. Missing type annotation on `progress_callback`

src/basic_memory/repository/postgres_search_repository.py — the Postgres override drops the type annotation:

# base class (correct):
async def sync_entity_vectors_batch(
    self,
    entity_ids: list[int],
    progress_callback: Optional[Callable[[int, int, int], Any]] = None,
) -> VectorSyncBatchResult:

# Postgres override (type annotation dropped):
async def sync_entity_vectors_batch(
    self,
    entity_ids: list[int],
    progress_callback=None,  # ← no annotation
) -> VectorSyncBatchResult:

4. Missing explicit `session.commit()` before early returns in `_prepare_entity_vector_jobs`

The base class explicitly calls await session.commit() before each early return inside the async with db.scoped_session(...) block (e.g. lines 865, 886, 995). The Postgres override relies on the context manager auto-commit. While functionally equivalent today (the auto-commit runs on __aexit__), the explicit commits make the intent clearer and are consistent with the rest of the codebase. Add explicit await session.commit() calls before the two early returns inside the async with block.

5. `incoming_by_key` is built but partially used

incoming_by_key = {record["chunk_key"]: record for record in chunk_records}

This is only used for stale detection (if chunk_key not in incoming_by_key). The main loop iterates over chunk_records and uses existing_by_key.get(record["chunk_key"]) — it doesn't go through incoming_by_key. The dict is harmless but adds confusion.

Minor Issues

6. F-string SQL template for bulk upsert

Both the chunk upsert and the embedding upsert in _write_embeddings build SQL via f-strings (", ".join(upsert_values)). The values are driven by enumerate() (safe), but this pattern deviates from the project's style of using static text() SQL. This is acceptable for a bulk-upsert path, but a comment noting that the f-string is safe (integer indices only, no user data) would help future reviewers.

7. Progress callback semantics changed

The base class calls progress_callback(entity_id, sequential_index, total_entities) per entity before its individual prepare. The Postgres override calls it per entity within the window before the concurrent gather fires, using window_start + offset_within_window. The semantics are slightly different (batch-of-4 progress vs per-entity), which could affect any UI progress bars that depend on this callback.

Team Checklist

Item	Status	Notes
Python 3.12+ types and async patterns	⚠️	`progress_callback` missing annotation on override
SQLAlchemy 2.0 best practices	✅	F-string SQL building is safe but should be commented
100-character line length	✅
No commented-out code	✅
Unit tests for new code	✅	New `TestBatchPrepareConcurrency` and `TestWriteEmbeddings`
100% test coverage	⚠️	`_update_timestamp_sql` has `# pragma: no cover` on line 892 — verify that's still needed
No hardcoded secrets	✅
Error handling / logging	✅	Good structured logging with loguru
Performance	✅	Core intent: latency hiding via concurrent prepare
Code duplication	❌	~100 lines duplicated from base class orchestration loop
Known test regression	❌	`test_semantic_quality[asyncio-postgres-openai]` recall@5 = 0.0
File path / Windows compat	✅	No file path operations in changed code
No emoji/unicode in logs	✅
Literate programming comments	✅	Good Trigger/Why/Outcome comments on the new method
Alembic migration	N/A	No schema changes

Bottom Line

The bulk-embedding upsert and concurrent-prepare window are good ideas. The main concerns are:

Unresolved test regression — please verify whether recall@5 = 0.0 is a pre-existing failure or introduced by this PR. If it's a regression, the concurrent asyncio.gather prepare path or the new bulk-upsert chunk path may have a bug that silently loses embedding writes.
Code duplication — the nearly verbatim copy of the base class orchestration loop should be refactored before this lands.

Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-04-07T17:28:14Z

Claude finished @phernandez's task in 4m 35s —— View job

PR Review: `perf(core): batch postgres vector sync preparation`

Tasks

Summary

The PR introduces two improvements to Postgres vector sync:

Concurrent entity preparation via asyncio.gather windows (hides Neon round-trip latency)
Bulk upsert of chunk rows and embeddings per flush (reduces per-entity execute calls)

The individual pieces look sound, but there is a confirmed semantic quality regression and a code duplication concern that should be resolved before merging.

Critical Issues

1. `postgres-openai` quality thresholds silently removed to mask a failing test

This is the most important finding. Looking at the first commit of this PR (a1541f3), test-int/semantic/test_semantic_quality.py originally included:

# OpenAI hybrid should handle paraphrases better than FastEmbed
("postgres-openai", "lexical", "hybrid"): 0.37,
("postgres-openai", "paraphrase", "hybrid"): 0.25,

The test ran, failed with recall@5 = 0.000 < 0.250, and subsequent commits removed those two threshold entries (-3 lines in the file diff) instead of fixing the root cause. The comment in the current file now says "OpenAI metrics are still recorded, but we do not gate on them yet" — but we were gating on them.

recall@5 = 0.0 (literally zero) is not noise. It means no relevant results are returned in the top 5. This is consistent with a bug where embeddings are written but not queryable, or not written at all, for the openai provider specifically. The concurrent asyncio.gather path or the new bulk-upsert _write_embeddings are the most likely suspects.

The threshold removal must be reverted, the root cause found and fixed, and the test must pass before this merges.

2. ~100 lines of orchestration logic duplicated from base class

PostgresSearchRepository.sync_entity_vectors_batch (lines 497–738) duplicates almost all of SearchRepositoryBase._sync_entity_vectors_internal. The only meaningful addition is the asyncio.gather window.

If the base class orchestration changes (error handling, deferred tracking, synced_entity_ids bookkeeping, orphan cleanup), the Postgres override will silently diverge — and we already have evidence it has diverged (quality regression).

The refactor suggested in the previous review stands: extract the "prepare window of N entities concurrently" concern into a hook like _prepare_entity_vector_jobs_window and override only that in Postgres. The outer flush loop should live in one place in the base class.

Fix this →

Moderate Issues

3. Missing `progress_callback` type annotation on Postgres override

postgres_search_repository.py line 500:

# Postgres override (annotation dropped):
async def sync_entity_vectors_batch(
    self,
    entity_ids: list[int],
    progress_callback=None,       # ← no type annotation
) -> VectorSyncBatchResult:

# Base class (correct):
async def sync_entity_vectors_batch(
    self,
    entity_ids: list[int],
    progress_callback: Optional[Callable[[int, int, int], Any]] = None,
) -> VectorSyncBatchResult:

4. Missing explicit `await session.commit()` before early returns in `_prepare_entity_vector_jobs`

The base class _prepare_entity_vector_jobs (lines 1069, 1094 in search_repository_base.py) explicitly calls await session.commit() before each early return inside async with db.scoped_session(...). The Postgres override (lines 785–793, 808–817) omits these calls and relies on the context manager auto-commit.

Both are functionally equivalent today because db.scoped_session.__aexit__ calls await session.commit(). But the inconsistency is subtle and could mislead future maintainers into thinking those branches don't persist state. Add explicit commits for consistency.

5. `incoming_by_key` is built but only partially used

postgres_search_repository.py line 833:

incoming_by_key = {record["chunk_key"]: record for record in chunk_records}

This dict is built but only used for stale detection (if chunk_key not in incoming_by_key). The main per-record loop iterates chunk_records directly and looks up existing rows via existing_by_key. The dict is harmless but adds noise and could confuse reviewers.

Minor Issues

6. F-string SQL building in `_write_embeddings` and `_prepare_entity_vector_jobs` bulk upsert

The bulk inserts build SQL via f-strings with ", ".join(...). The values are driven by integer enumerate() indices so there is no injection risk, but this deviates from the project's preference for static text() SQL. A short comment near each f-string noting that the template is index-safe would help future reviewers.

7. Progress callback semantics changed

The base class calls progress_callback(entity_id, sequential_index, total_entities) for each entity individually before prepare. The Postgres override (lines 544–546) calls it per entity within a window batch using window_start + offset_within_window, then fires all prepares concurrently. For any UI progress bar consuming this callback, the progress events arrive in bursts rather than one by one. This is not a bug but is a semantic change worth documenting.

Team Checklist

Item	Status	Notes
Python 3.12+ types and async patterns	⚠️	`progress_callback` missing type annotation on Postgres override
SQLAlchemy 2.0 best practices	✅
FastAPI / Typer conventions	N/A	No API/CLI changes
100-character line length	✅
No commented-out code	✅
Unit tests for new code	✅	`TestBatchPrepareConcurrency` and `TestWriteEmbeddings` added
100% test coverage	⚠️	`_update_timestamp_sql` (line 1097 in postgres) carries `# pragma: no cover` — verify it's still needed
Documentation / docstrings	✅	Good Trigger/Why/Outcome comments on new method
No hardcoded secrets	✅
Proper error handling and logging	✅	Structured loguru logging throughout
Performance	✅	Core intent (concurrent prepare + bulk upsert) is sound
Code duplication	❌	~100-line orchestration loop duplicated from base class
Known test regression	❌	`postgres-openai` quality thresholds removed to mask `recall@5 = 0.0` failure
Alembic migration	✅	`m6h7i8j9k0l1` adds `entity_fingerprint` + `embedding_model` columns
Windows file path compatibility	✅	No file path operations in changed code
No emoji/unicode in log output	✅
CLAUDE.md conventions	✅	Literate programming style followed

Bottom Line

Two blockers before merging:

Restore the postgres-openai quality thresholds (revert the -3 lines removed from test_semantic_quality.py) and fix whatever is causing recall@5 = 0.0. The concurrent prepare window or the bulk embedding upsert likely has a bug that silently produces zero queryable results for the OpenAI provider.
Refactor the duplicated orchestration loop — the Postgres override should be a minimal hook on the base class, not a full copy of it.

Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-04-07T17:37:32Z

Claude finished @phernandez's task in 6m 37s —— View job

PR Review: `perf(core): reduce postgres vector sync work`

Tasks

Summary

This PR delivers three improvements to Postgres vector sync: (1) skip-unchanged logic using semantic fingerprints, (2) sharding of oversized entities into bounded work units, and (3) configurable concurrent preparation windows. The latest commit (4529bfa) resolves the major code-duplication concern from previous reviews by extracting a _prepare_entity_vector_jobs_window hook and delegating the full orchestration loop to the base class.

However there are two blockers remaining.

Critical Issues

1. Chunk-skip logic misplaced in `SearchRepositoryBase._prepare_entity_vector_jobs` — existing chunks never re-embedded

src/basic_memory/repository/search_repository_base.py lines 1239–1240:

# current state (BROKEN):
if current:
    ...
    if same_source_hash and not is_orphan and same_embedding_model:
        if not same_entity_fingerprint:
            await session.execute(UPDATE fingerprint)
    skipped_chunks_count += 1   # ← at if current: level — always runs
    continue                     # ← at if current: level — always runs

pending_records.append(record)  # ← only new chunks ever reach this

skipped_chunks_count += 1 and continue must be inside the if same_source_hash and not is_orphan and same_embedding_model: block. As written, every chunk that already exists in the DB is counted as skipped and never added to pending_records, regardless of whether its content changed, its embedding is missing (orphan), or the embedding model changed. Only first-time syncs for completely new entities work correctly.

The old code on main was:

if current and current.source_hash == record["source_hash"] and not is_orphan:
    ...
    continue
# else fall through and add to embedding_jobs directly

This regression means: after initial indexing, any entity whose content changes will silently retain stale embeddings forever (SQLite path). The Postgres override of _prepare_entity_vector_jobs has this logic correct (the pending_records.append(record) at the end of the loop), but the base class — used by SQLite — is broken.

Fix:

if current:
    ...
    if same_source_hash and not is_orphan and same_embedding_model:
        if not same_entity_fingerprint:
            await session.execute(UPDATE fingerprint)
        skipped_chunks_count += 1   # ← move inside this block
        continue                     # ← move inside this block

pending_records.append(record)

Fix this →

2. `postgres-openai` quality thresholds still removed

test-int/semantic/test_semantic_quality.py diff:

-    # OpenAI hybrid should handle paraphrases better than FastEmbed
-    ("postgres-openai", "lexical", "hybrid"): 0.37,
-    ("postgres-openai", "paraphrase", "hybrid"): 0.25,
+    # OpenAI metrics are still recorded, but we do not gate on them yet.
+    # The current benchmark corpus is too small to make that combo stable.

The thresholds were passing before a1541f3 (the first commit of this PR). They were removed after that commit broke them (recall@5 = 0.0). The refactoring in 4529bfa may have fixed the underlying bug, but the thresholds were never restored to verify.

Please restore the thresholds and confirm the test passes. If the test was genuinely flaky on a small corpus before this PR (not just broken by it), the right fix is to enlarge the corpus — not silently drop the assertions.

Resolved Since Previous Reviews

Issue	Status
~100-line orchestration loop duplicated from base class	✅ Fixed by `4529bfa` refactor — `_prepare_entity_vector_jobs_window` hook + base class owns the full loop
Missing `progress_callback` type annotation on Postgres override	✅ Fixed — Postgres no longer overrides `sync_entity_vectors_batch`
`incoming_by_key` built but only partially used	✅ Removed in refactor
Missing explicit `session.commit()` before early returns	✅ Both Postgres and base class have explicit commits

Minor Issues

3. `_update_timestamp_sql` pragma on line 874

async def _update_timestamp_sql(self) -> str:
    return "NOW()"  # pragma: no cover

This method has a body (return "NOW()") and is not abstract, yet carries # pragma: no cover. Either it should have a test that calls it, or if it's a Postgres-specific concrete implementation of an abstract base method, document why it can't be tested.

4. Progress-callback burst semantics

_sync_entity_vectors_internal lines 827–829 fire progress_callback per entity before the concurrent _prepare_entity_vector_jobs_window gather fires. Progress events arrive in window-sized bursts (default 4 at a time), not one-by-one. Not a bug, but callers implementing progress bars may observe jumpy reporting. This is worth a comment near the callback invocation.

Team Checklist

Item	Status	Notes
Python 3.12+ types and async patterns	✅
SQLAlchemy 2.0 best practices	✅
100-character line length	✅
No commented-out code	✅
Unit tests for new code	✅	`TestBatchPrepareConcurrency`, `TestWriteEmbeddings`, fingerprint skip tests added
100% test coverage	⚠️	`_update_timestamp_sql` pragma on line 874 — verify still needed
No hardcoded secrets	✅
Proper error handling and logging	✅
Performance	✅	Concurrent prepare + bulk upsert + skip-unchanged are all sound ideas
Code duplication	✅	Resolved by `4529bfa`
Chunk-skip logic correctness	❌	`skipped_chunks_count += 1; continue` misplaced in base class — blocker
Quality test regression	❌	`postgres-openai` thresholds removed and not restored — blocker
Alembic migration	✅	`m6h7i8j9k0l1` adds `entity_fingerprint` + `embedding_model` columns
Windows file path compatibility	✅
No emoji/unicode in logs	✅
Literate programming comments	✅	Good Trigger/Why/Outcome style throughout

Bottom Line

The refactoring in 4529bfa resolved the main architectural concern from previous reviews. Two blockers remain:

Fix the indentation bug in SearchRepositoryBase._prepare_entity_vector_jobs lines 1239–1240 — move skipped_chunks_count += 1 and continue inside if same_source_hash and not is_orphan and same_embedding_model:. Add a test that re-syncs an entity with changed content and verifies the embedding is updated.
Restore the postgres-openai quality thresholds and confirm the test passes on HEAD. If recall@5 = 0.0 still reproduces, the bug is likely in the Postgres _prepare_entity_vector_jobs bulk-upsert path or _write_embeddings.

Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-04-07T17:58:36Z

Claude finished @phernandez's task in 3m 54s —— View job

PR Review: `perf(core): reduce postgres vector sync work`

Tasks

Summary

The PR delivers three solid improvements to Postgres vector sync:

Skip-unchanged logic — persisted semantic fingerprints + embedding model metadata lets the sync short-circuit re-embedding for entities whose content hasn't changed
Sharding — oversized entities with thousands of pending chunks are split into bounded, resumable work units
Configurable concurrency — Postgres preparation concurrency is tunable (semantic_postgres_prepare_concurrency, default 4)

The latest commit (16fc164) resolved both blockers identified in the previous reviews. The code is well-structured, follows the literate programming style, and has good test coverage.

Resolved Since Previous Reviews

Issue	Status
Chunk-skip `skipped_chunks_count += 1; continue` indentation bug in base class	✅ Fixed in `16fc164`
`postgres-openai` quality thresholds removed to mask `recall@5 = 0.0`	✅ Restored in `16fc164`
~100-line orchestration loop duplicated from base class	✅ Fixed in `4529bfa` — `_prepare_entity_vector_jobs_window` hook
Missing `progress_callback` type annotation on Postgres override	✅ N/A — Postgres no longer overrides `sync_entity_vectors_batch`
`incoming_by_key` partially used	✅ Removed in refactor

Remaining Issues

1. Dead abstract method `_update_timestamp_sql`

src/basic_memory/repository/search_repository_base.py line 292, postgres_search_repository.py line 873, sqlite_search_repository.py line 557 — _update_timestamp_sql is declared @abstractmethod, implemented in both subclasses, but never called anywhere in the codebase. The actual timestamp expression used throughout _prepare_entity_vector_jobs is _timestamp_now_expr() (a synchronous non-abstract method). The Postgres override carries # pragma: no cover specifically because nothing calls it.

This is dead code in three files. It should be removed (the @abstractmethod declaration from the base class and both backend implementations). The # pragma: no cover on the Postgres implementation is the tell — it's not a coverage exception for a hard-to-test path, it's evidence the method is unreachable.

Fix this →

2. SQLite upgrade silently drops all cached embeddings

src/basic_memory/repository/sqlite_search_repository.py lines 394–413 — on first startup after this upgrade, existing SQLite users will hit the schema mismatch check (the new entity_fingerprint and embedding_model columns are not in old tables), causing search_vector_chunks and search_vector_embeddings to be dropped and recreated. A full re-sync is triggered on the next indexing run.

This is handled gracefully — the log warning at line 407 is visible — but the UX impact (re-embedding all entities) can be significant for large knowledge bases. A comment near the schema mismatch handler noting that this was introduced in this version (and why it's safe to drop-recreate) would help future maintainers understand the intent.

Minor Issues

3. Progress callback fires in window-sized bursts

search_repository_base.py lines 827–829 — progress_callback is invoked per entity within each window before the concurrent _prepare_entity_vector_jobs_window gather fires. For callers implementing progress bars, events arrive in batches of prepare_window_size (default 4) rather than one by one. Not a bug, but callers should be aware of this. A brief comment near the callback loop noting the burst semantics would help.

Team Checklist

Item	Status	Notes
Python 3.12+ types and async patterns	✅
SQLAlchemy 2.0 best practices	✅
100-character line length	✅	`description=` strings in `Field()` follow existing pattern
No commented-out code	✅
Unit tests for new code	✅	`TestBatchPrepareConcurrency`, `TestWriteEmbeddings`, fingerprint skip tests
100% test coverage	⚠️	`_update_timestamp_sql` in postgres has `# pragma: no cover` — method is dead code, should be removed
No hardcoded secrets	✅
Proper error handling and logging	✅	Good structured loguru logging throughout
Performance	✅	Concurrent prepare + bulk upsert + skip-unchanged are all sound
Code duplication	✅	Resolved
Chunk-skip logic correctness	✅	Fixed in `16fc164`
Quality test regression	✅	Fixed in `16fc164`
Alembic migration	✅	`m6h7i8j9k0l1` adds `entity_fingerprint` + `embedding_model` for Postgres
SQLite upgrade path	⚠️	Schema mismatch detection drops/recreates tables — full re-sync on upgrade
Windows file path compatibility	✅	No file path operations in changed code
No emoji/unicode in logs	✅
Literate programming comments	✅	Good Trigger/Why/Outcome style throughout
F-string SQL safety	✅	Safety comments added near both bulk upsert statements

Bottom Line

Both blockers from the previous reviews are resolved. The implementation is correct, the architecture is clean, and the test coverage is thorough.

Two remaining items before merging:

Remove _update_timestamp_sql — it's dead code in three files. The # pragma: no cover on the Postgres implementation is a signal, not a fix.
Document the SQLite re-sync behaviour — add a comment near the schema mismatch handler explaining that the drop-recreate is intentional and expected on upgrade.

CI tests are still running at the time of this review.

Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-04-07T18:30:08Z

Claude finished @phernandez's task in 3m 46s —— View job

PR Review: `perf(core): reduce postgres vector sync work`

Tasks

Gather context and read changed files
Review implementation correctness
Run team checklist
Post final review

Summary

All blockers and major concerns from previous reviews have been resolved in e1df151. The implementation is correct, well-structured, and follows the literate programming conventions in CLAUDE.md.

All Previously Identified Issues — Resolved

Issue	Status
Chunk-skip `skipped_chunks_count += 1; continue` indentation bug (base class)	✅ Fixed in `16fc164`
`postgres-openai` quality thresholds removed to mask `recall@5 = 0.0`	✅ Restored in `16fc164`
~100-line orchestration loop duplicated from base class	✅ Fixed in `4529bfa` — `_prepare_entity_vector_jobs_window` hook
Dead `_update_timestamp_sql` method in 3 files (with `# pragma: no cover`)	✅ Removed in `e1df151`
SQLite schema-mismatch drop/recreate undocumented	✅ Trigger/Why/Outcome comment added at `sqlite_search_repository.py:407`
Missing `progress_callback` type annotation on Postgres override	✅ N/A — Postgres no longer overrides `sync_entity_vectors_batch`
`incoming_by_key` partially used	✅ Removed in refactor

One Minor Observation

Postgres main-path relies on context manager auto-commit

postgres_search_repository.py — the main (non-early-return) exit from _prepare_entity_vector_jobs exits the async with db.scoped_session(...) block without an explicit await session.commit(). The base class (search_repository_base.py:1333) has an explicit commit before the block closes.

Both are functionally correct — db.scoped_session.__aexit__ auto-commits. But the inconsistency is minor and worth a note for future maintainers. Not a blocker.

Team Checklist

Item	Status	Notes
Python 3.12+ types and async patterns	✅
SQLAlchemy 2.0 best practices	✅
FastAPI / Typer conventions	N/A	No API/CLI changes
100-character line length	✅
No commented-out code	✅
Unit tests for new code	✅	`TestBatchPrepareConcurrency`, `TestWriteEmbeddings`, fingerprint skip + shard tests
100% test coverage	✅	`# pragma: no cover` entries are on abstract pass-through methods only
No hardcoded secrets	✅
Proper error handling and logging	✅	Structured loguru logging throughout
Performance	✅	Concurrent prepare + bulk upsert + skip-unchanged all sound
Code duplication	✅	Resolved — `_prepare_entity_vector_jobs_window` hook pattern
Chunk-skip logic correctness	✅	Fixed and tested
Quality test regression	✅	`postgres-openai` thresholds restored and passing
Alembic migration	✅	`m6h7i8j9k0l1` adds `entity_fingerprint` + `embedding_model` with `NOT NULL` backfill
SQLite upgrade path	✅	Schema mismatch handler documented with Trigger/Why/Outcome comment
F-string SQL safety	✅	Index-safety comments on both bulk upsert statements
Windows file path compatibility	✅	No file path operations in changed code
No emoji/unicode in logs	✅
Literate programming comments	✅	Good Trigger/Why/Outcome style throughout

Bottom Line

This is ready to merge. CI tests are still running — pending their completion.

perf(core): batch postgres vector sync preparation

a1541f3

Signed-off-by: phernandez <paul@basicmachines.co>

phernandez added 6 commits April 7, 2026 11:16

update claude settings

901eb00

Signed-off-by: phernandez <paul@basicmachines.co>

refactor timing for indexing

3175e7c

Signed-off-by: phernandez <paul@basicmachines.co>

perf(core): skip unchanged vector sync work

9738267

Signed-off-by: phernandez <paul@basicmachines.co>

don't pass force_full=true by default

d2055f3

Signed-off-by: phernandez <paul@basicmachines.co>

perf(core): shard oversized vector sync work

d9ab2a1

Signed-off-by: phernandez <paul@basicmachines.co>

perf(core): make postgres vector sync tuning configurable

2f34747

Signed-off-by: phernandez <paul@basicmachines.co>

phernandez changed the title ~~perf(core): batch postgres vector sync preparation~~ perf(core): reduce postgres vector sync work Apr 7, 2026

refactor(core): share vector sync prepare orchestration

4529bfa

Signed-off-by: phernandez <paul@basicmachines.co>

fix(core): restore semantic vector skip and benchmark gating

16fc164

Signed-off-by: phernandez <paul@basicmachines.co>

fix(core): address vector sync follow-up regressions

e1df151

Signed-off-by: phernandez <paul@basicmachines.co>

phernandez merged commit 8c81d3c into main Apr 8, 2026
41 checks passed

phernandez deleted the perf/core-postgres-vector-sync-batching branch April 8, 2026 00:00

phernandez mentioned this pull request Apr 8, 2026

Remove inline ALTER TABLE from postgres_search_repository #727

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(core): reduce postgres vector sync work#723

perf(core): reduce postgres vector sync work#723
phernandez merged 10 commits intomainfrom
perf/core-postgres-vector-sync-batching

phernandez commented Apr 7, 2026 •

edited

Loading

Uh oh!

phernandez commented Apr 7, 2026

Uh oh!

github-actions bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

phernandez commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

phernandez commented Apr 7, 2026

Uh oh!

github-actions bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: perf(core): batch postgres vector sync preparation

Summary

Critical Issues

1. Semantic quality regression — recall@5 = 0.000

2. Large code duplication — fragile maintenance surface

Moderate Issues

3. Missing type annotation on progress_callback

4. Missing explicit session.commit() before early returns in _prepare_entity_vector_jobs

5. incoming_by_key is built but partially used

Minor Issues

6. F-string SQL template for bulk upsert

7. Progress callback semantics changed

Team Checklist

Bottom Line

Uh oh!

github-actions bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: perf(core): batch postgres vector sync preparation

Summary

Critical Issues

1. postgres-openai quality thresholds silently removed to mask a failing test

2. ~100 lines of orchestration logic duplicated from base class

Moderate Issues

3. Missing progress_callback type annotation on Postgres override

4. Missing explicit await session.commit() before early returns in _prepare_entity_vector_jobs

5. incoming_by_key is built but only partially used

Minor Issues

6. F-string SQL building in _write_embeddings and _prepare_entity_vector_jobs bulk upsert

7. Progress callback semantics changed

Team Checklist

Bottom Line

Uh oh!

github-actions bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: perf(core): reduce postgres vector sync work

Summary

Critical Issues

1. Chunk-skip logic misplaced in SearchRepositoryBase._prepare_entity_vector_jobs — existing chunks never re-embedded

2. postgres-openai quality thresholds still removed

Resolved Since Previous Reviews

Minor Issues

3. _update_timestamp_sql pragma on line 874

4. Progress-callback burst semantics

Team Checklist

Bottom Line

Uh oh!

github-actions bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: perf(core): reduce postgres vector sync work

Summary

Resolved Since Previous Reviews

Remaining Issues

1. Dead abstract method _update_timestamp_sql

2. SQLite upgrade silently drops all cached embeddings

Minor Issues

3. Progress callback fires in window-sized bursts

Team Checklist

Bottom Line

Uh oh!

github-actions bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: perf(core): reduce postgres vector sync work

Summary

All Previously Identified Issues — Resolved

One Minor Observation

Postgres main-path relies on context manager auto-commit

Team Checklist

Bottom Line

Uh oh!

Uh oh!

phernandez commented Apr 7, 2026 •

edited

Loading

github-actions bot commented Apr 7, 2026 •

edited

Loading

PR Review: `perf(core): batch postgres vector sync preparation`

1. Semantic quality regression — `recall@5 = 0.000`

3. Missing type annotation on `progress_callback`

4. Missing explicit `session.commit()` before early returns in `_prepare_entity_vector_jobs`

5. `incoming_by_key` is built but partially used

github-actions bot commented Apr 7, 2026 •

edited

Loading

PR Review: `perf(core): batch postgres vector sync preparation`

1. `postgres-openai` quality thresholds silently removed to mask a failing test

3. Missing `progress_callback` type annotation on Postgres override

4. Missing explicit `await session.commit()` before early returns in `_prepare_entity_vector_jobs`

5. `incoming_by_key` is built but only partially used

6. F-string SQL building in `_write_embeddings` and `_prepare_entity_vector_jobs` bulk upsert

github-actions bot commented Apr 7, 2026 •

edited

Loading

PR Review: `perf(core): reduce postgres vector sync work`

1. Chunk-skip logic misplaced in `SearchRepositoryBase._prepare_entity_vector_jobs` — existing chunks never re-embedded

2. `postgres-openai` quality thresholds still removed

3. `_update_timestamp_sql` pragma on line 874

github-actions bot commented Apr 7, 2026 •

edited

Loading

PR Review: `perf(core): reduce postgres vector sync work`

1. Dead abstract method `_update_timestamp_sql`

github-actions bot commented Apr 7, 2026 •

edited

Loading

PR Review: `perf(core): reduce postgres vector sync work`