perf(sync): batch file indexing in core#726
Conversation
Signed-off-by: phernandez <paul@basicmachines.co>
|
Claude finished @phernandez's task in 2m 35s —— View job PR Review: perf(sync): batch file indexing in coreTasks:
SummaryThis is a well-architected PR that introduces a typed, reusable batch indexing layer. The separation of concerns is excellent — Issues Found🔴 Dead code:
|
| Category | Status | Notes |
|---|---|---|
| Python 3.12+ type annotations | ✅ | slots=True dataclasses, type alias in factory |
| Async patterns / SQLAlchemy 2.0 | ✅ | Correct asyncio.gather + Semaphore usage |
| FastAPI/Typer conventions | ✅ | |
| 100-char line length | ✅ | |
| No commented-out code | ✅ | |
| Unit tests for new functions | ✅ | |
| Integration tests for new paths | Missing test-int/ coverage for batch sync |
|
| 100% test coverage | Some _upsert_regular_file branches likely uncovered |
|
| MCP tools atomic/composable | N/A | No MCP changes |
| No DB migration needed | ✅ | Config fields only |
| Local-first architecture preserved | ✅ | |
| No hardcoded secrets/credentials | ✅ | |
| Input validation | ✅ | max_files <= 0 / max_bytes <= 0 guards |
| No sensitive data in logs | ✅ | |
| Windows path compatibility | ✅ | as_posix() used consistently in scan paths |
| No emojis/unicode in console/log output | ✅ | New code uses ASCII only; ✓ in db.py:251 is pre-existing |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8f8d017529
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| except Exception as exc: | ||
| errors[path] = str(exc) | ||
| logger.warning("Batch indexing failed", path=path, error=str(exc)) |
There was a problem hiding this comment.
Re-raise fatal sync exceptions in batch workers
_run_bounded currently downgrades every worker exception to a recoverable per-file error, which swallows SyncFatalError from repository writes (for example when a project is deleted and FK checks fail during sync). The pre-batch path explicitly propagated fatal sync errors to stop the run, but this code now lets sync continue and potentially update scan watermark after partial failures, leaving state inconsistent. Please allow fatal exceptions to propagate instead of handling them as normal batch errors.
Useful? React with 👍 / 👎.
| files[path] = IndexInputFile( | ||
| path=metadata.path, | ||
| size=metadata.size, | ||
| checksum=metadata.checksum, | ||
| content_type=metadata.content_type, |
There was a problem hiding this comment.
Recompute checksum from loaded bytes before upsert
This carries forward the checksum captured during scan() instead of deriving it from the bytes just read for indexing. If a file changes between scan and batch load, the upsert path can write new content/mtime while persisting an old checksum; because mtime and size may already match afterward, later scans may never correct the checksum. That leaves persistent checksum drift and can break checksum-based move detection. Compute checksum from content (or recompute at write time) rather than trusting scan-time checksum here.
Useful? React with 👍 / 👎.
Signed-off-by: phernandez <paul@basicmachines.co>
|
Claude finished @phernandez's task in 2m 16s —— View job PR Review: perf(sync): batch file indexing in coreTasks:
New Commit:
|
| Severity | Issue | Status |
|---|---|---|
| 🔴 | progress_callback in index_files is never invoked from _index_changed_files |
Open |
| 🔴 | get_file_path_to_permalink_map() called once per batch instead of once per sync |
Open |
| 🟡 | Sequential await loop in _resolve_batch_relations fetch phase (line 434-437) |
Open |
| 🟡 | existing is None evaluated after except IntegrityError mutates existing (line 411) |
Open |
| 🟡 | No test-int/ integration coverage for multi-batch sync |
Open |
Still-open Codex review issues
| Priority | Issue | Status |
|---|---|---|
| P1 | _run_bounded catches all exceptions, swallowing SyncFatalError from repository writes (line 554) |
Open |
| P2 | Checksum stored in IndexedEntity comes from scan time; if file changes between scan and load, checksum can drift |
Open |
Checklist Results
| Category | Status | Notes |
|---|---|---|
| Python 3.12+ type annotations | ✅ | slots=True dataclasses throughout |
| Async patterns / SQLAlchemy 2.0 | ✅ | |
| 100-char line length | ✅ | |
| No commented-out code | ✅ | |
| Unit tests for new behavior | ✅ | No-rewrite and rewrite cases covered |
| Integration tests | Still missing test-int/ batch sync coverage |
|
| 100% test coverage | # pragma: no cover on except in update_frontmatter_with_result — acceptable |
|
| No hardcoded secrets | ✅ | |
| No sensitive data in logs | ✅ | |
Backward compatibility of update_frontmatter |
✅ | Shim delegates correctly |
IndexFileWriter protocol updated consistently |
✅ | Test writer, prod writer, and protocol all updated |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2f3cf950c0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| entity_id=updated.id, | ||
| checksum=prepared.final_checksum, | ||
| content_type=prepared.file.content_type, | ||
| search_content=remove_frontmatter(prepared.content), |
There was a problem hiding this comment.
Derive markdown search content from parsed markdown
Building search_content with remove_frontmatter(prepared.content) can fail on files that the parser already accepted (for example, markdown starting with --- but malformed/non-frontmatter delimiters). EntityParser.parse_markdown_content() is explicitly tolerant of malformed YAML and still produces usable EntityMarkdown, but this later remove_frontmatter() call can raise and mark the file as a batch error after the entity was already upserted. That regresses sync robustness for malformed-but-indexable notes; use the parsed markdown body (prepared.markdown.content) instead of reparsing raw text.
Useful? React with 👍 / 👎.
Signed-off-by: phernandez <paul@basicmachines.co>
Signed-off-by: phernandez <paul@basicmachines.co>
|
Claude finished @phernandez's task in 3m 33s —— View job PR Review: perf(sync): batch file indexing in coreTasks:
New Commits:
|
| Prior Issue | Resolution |
|---|---|
🔴 progress_callback dead code in index_files |
Removed; replaced with existing_permalink_by_path parameter |
🔴 get_file_path_to_permalink_map() called once per batch |
Fetched once before loop in _index_changed_files; passed as shared mutable dict |
🟡 Sequential await loop in _resolve_batch_relations (both fetch + count phases) |
Both replaced with asyncio.gather |
🟡 existing is None evaluated after IntegrityError mutates existing |
is_new_entity flag captured before try block |
🟡 Missing test-int/ integration coverage |
test_sync_batching_integration.py added with 2 real-service scenarios |
P1 _run_bounded swallows SyncFatalError |
Re-raises when isinstance(exc, SyncFatalError) or isinstance(exc.__cause__, SyncFatalError) |
| P2 Checksum stored from scan time | Now recomputed via compute_checksum(content) from loaded bytes |
🟡 remove_frontmatter called redundantly on every markdown |
Replaced with prepared.markdown.content if prepared.markdown.content is not None else prepared.content |
🟡 FrontmatterUpdateResult/IndexFrontmatterWriteResult adapter lacks explanation |
# Why: comment added to _FileServiceIndexWriter.write_frontmatter |
P1 Codex: remove_frontmatter can raise on malformed-but-parseable files |
Addressed by the markdown content fallback; unit test added |
Also: the shared_permalink_by_path dict being passed by reference and mutated in-place by _normalize_markdown_batch (line existing_permalink_by_path[path] = normalized[path].markdown.frontmatter.permalink) is an elegant way to propagate newly assigned permalinks across batches without an extra DB round-trip per batch.
New Issues Found
🟡 except FileNotFoundError on sync_service.py:629 is dead code
After d986c4d reverted the FileNotFoundError re-raise from file_service.py, read_file_bytes now wraps all exceptions — including FileNotFoundError — in FileOperationError. The handler at line 629 can never fire from read_file_bytes. The subsequent except FileOperationError (line 631) is the correct path, and it already handles the FileNotFoundError case via isinstance(exc.__cause__, FileNotFoundError).
# sync_service.py:629-638
except FileNotFoundError: # ← dead: read_file_bytes always wraps in FileOperationError
await self.handle_delete(path)
except FileOperationError as exc: # ← this is the actual path for missing files
if isinstance(exc.__cause__, FileNotFoundError):
await self.handle_delete(path)
...The dead handler is harmless but misleads readers into thinking FileNotFoundError can surface unwrapped. Remove it, or add a comment like # safety net: kept in case callers other than read_file_bytes are added here.
🟢 Minor: batching.py line 55 — == is correct but relies on unstated invariant
The change from current_bytes >= max_bytes to current_bytes == max_bytes is logically equivalent given that the overflow-flush on lines 38–43 ensures current_bytes <= max_bytes always holds after adding a file. However, the reviewer's suggestion was to change > to >= on line 39 (to unify both checks), not to make line 55 more restrictive. The == check is less defensive: if the invariant ever breaks (e.g., a file with negative size), an over-budget batch would not flush. The prior >= was a free safety net. Consider reverting line 55 to >= or adding a comment:
# current_bytes <= max_bytes is invariant here (overflow case handled above),
# so == is equivalent to >=, but >= is kept for defensive clarity.
if len(current_paths) >= max_files or current_bytes >= max_bytes:Checklist Results
| Category | Status | Notes |
|---|---|---|
| Python 3.12+ type annotations | ✅ | |
| Async patterns / SQLAlchemy 2.0 | ✅ | asyncio.gather used correctly throughout |
| 100-char line length | ✅ | |
| No commented-out code | ✅ | |
| Unit tests for new behavior | ✅ | Malformed frontmatter + fatal error re-raise tests added |
| Integration tests | ✅ | test-int/test_sync_batching_integration.py added with circuit-breaker and cross-batch forward-ref tests |
| 100% coverage | ✅ | Dead except FileNotFoundError may show as uncovered — worth checking |
SyncFatalError propagation |
✅ | Correct; handles both direct and __cause__ wrapping |
| Checksum from loaded bytes | ✅ | Scan-time drift fixed |
| Permalink map fetched once per sync | ✅ | Shared mutable dict threaded across batches correctly |
search_content from parsed body |
✅ | Malformed-frontmatter regression closed |
is_new_entity flag |
✅ | Post-mutation evaluation eliminated |
| No hardcoded secrets | ✅ | |
| Windows path compatibility | ✅ | |
| No emojis/unicode in logs | ✅ |
Overall: This is in good shape. The one actionable item is the dead except FileNotFoundError handler — it should be removed to keep the error-handling contract clear. The batching.py == vs >= is a style nit. Ready to merge once the dead handler is cleaned up.
|
Claude finished @phernandez's task in 3m 29s —— View job PR Review: perf(sync): batch file indexing in coreTasks:
New Commits:
|
| Issue | Status |
|---|---|
🔴 progress_callback dead code in index_files |
✅ Removed entirely |
🔴 get_file_path_to_permalink_map() called once per batch |
✅ Pre-fetched once in _index_changed_files, passed as existing_permalink_by_path |
🟡 Sequential await loop in _resolve_batch_relations |
✅ Both fetch and remaining-count phases now use asyncio.gather |
🟡 existing is None evaluated after except IntegrityError mutates existing |
✅ is_new_entity flag captured before try block |
🟡 No test-int/ integration coverage |
✅ Added test-int/test_sync_batching_integration.py with two integration tests |
🟡 test_sync_service_batching.py sparse coverage |
✅ Added tests for empty changed_paths, circuit breaker, metadata errors, batch load errors |
🔴 P1: _run_bounded swallows SyncFatalError |
✅ Fixed — SyncFatalError and __cause__ re-raised |
| 🔴 P2: Checksum from scan time can drift | ✅ Fixed — loaded_checksum = await compute_checksum(content) computed from actual bytes |
🟡 Codex P1: remove_frontmatter on parsed body |
✅ Fixed — now uses prepared.markdown.content with fallback |
| 🟡 Adapter comment missing | ✅ Added # Why: comment on the FrontmatterUpdateResult → IndexFrontmatterWriteResult conversion |
One new issue
🟡 except FileNotFoundError at sync_service.py:629 is now dead code
Commit d986c4d changed file_service.read_file_bytes to wrap ALL errors (including FileNotFoundError) in FileOperationError. FileOperationError is not a subclass of FileNotFoundError, so the bare handler on line 629 can never be reached:
# sync_service.py:629-630
except FileNotFoundError: # ← dead: read_file_bytes never raises this bare
await self.handle_delete(path)
except FileOperationError as exc: # ← the actual path
if isinstance(exc.__cause__, FileNotFoundError):
await self.handle_delete(path)Either remove the dead handler or add a comment explaining it's a defensive net.
Architecture observation (no action needed)
The shared_permalink_by_path dict is passed by reference to _normalize_markdown_batch and mutated in-place (line 228 of batch_indexer.py). This correctly propagates new permalinks from batch N into the reserved set for batch N+1. The in-place mutation is intentional — but a one-line comment at the call site in _index_changed_files noting "mutated in-place to accumulate cross-batch permalink reservations" would help future readers understand why the dict isn't re-fetched between batches.
Checklist Results
| Category | Status | Notes |
|---|---|---|
| Python 3.12+ type annotations | ✅ | slots=True dataclasses, type alias, full annotations |
| Async patterns / SQLAlchemy 2.0 | ✅ | asyncio.gather used consistently now in both relation fetch phases |
| FastAPI/Typer conventions | ✅ | |
| 100-char line length | ✅ | |
| No commented-out code | ✅ | |
| Unit tests for new functions | ✅ | Comprehensive coverage of edge cases added |
| Integration tests | ✅ | Two integration tests in test-int/ covering multi-batch + circuit breaker |
| 100% test coverage | ✅ | Codex P1 search_content fix covered by new test_batch_indexer_uses_parsed_markdown_body_for_malformed_frontmatter_delimiters test |
| MCP tools atomic/composable | N/A | No MCP changes |
| No DB migration needed | ✅ | Config fields only |
| Local-first architecture preserved | ✅ | |
| No hardcoded secrets/credentials | ✅ | |
| Input validation | ✅ | Guards for max_files <= 0, max_bytes <= 0 |
| No sensitive data in logs | ✅ | Large per-entity logger.info blocks removed |
| Windows path compatibility | ✅ | as_posix() used consistently |
| No emojis/unicode in console/log output | ✅ | |
SyncFatalError propagation |
✅ | Both isinstance(exc, SyncFatalError) and isinstance(exc.__cause__, SyncFatalError) checked |
| Checksum consistency | ✅ | Derived from actual loaded bytes, not scan-time metadata |
This PR is in good shape. The one dead-code handler is a minor cleanup item, not a blocker.
Signed-off-by: phernandez <paul@basicmachines.co>
|
Claude finished @phernandez's task in 2m 24s —— View job PR Review: perf(sync): batch file indexing in coreTasks:
New Commit:
|
| Category | Status | Notes |
|---|---|---|
| Python 3.12+ type annotations | ✅ | @staticmethod, clean type guards |
| Async patterns / SQLAlchemy 2.0 | ✅ | Correct execute_query usage |
| 100-char line length | ✅ | Line 599 (if isinstance(self.repository, SearchRepositoryBase) and not self.repository._semantic_enabled) is ~100 chars — borderline, within limit |
| No commented-out code | ✅ | |
| Unit tests for new behavior | ✅ | Single sync, batch mixed, FTS-not-excluded all covered |
| Integration tests | ✅ | Prior integration tests intact |
| 100% test coverage | reindex_vectors with an opted-out entity has no test — will show as a gap once opt-out is threaded through |
|
embed: false coercion correctness |
✅ | bool, str (normalized), int/float, unknown-default-true |
Postgres/SQLite parity in _clear_entity_vectors |
✅ | SQLite explicitly deletes vec table; Postgres relies on ON DELETE CASCADE — correct |
| No sensitive data in logs | ✅ | |
| Local-first architecture preserved | ✅ |
Overall: One clear bug (reindex_vectors bypassing opt-out) that should be fixed before merge. The sequential clear loop and db.py backfill gap are lower priority. The dead except FileNotFoundError handler is a cleanup carry-over from the previous round.
Signed-off-by: phernandez <paul@basicmachines.co>
|
Claude finished @phernandez's task in 2m 17s —— View job PR Review: perf(sync): batch file indexing in coreTasks:
New Commit:
|
| Prior Issue | Resolution |
|---|---|
🔴 reindex_vectors calls self.repository.sync_entity_vectors_batch directly, bypassing opt-out |
Fixed: changed to self.sync_entity_vectors_batch — routes through service-level filter |
🟡 Sequential _clear_entity_vectors loop in sync_entity_vectors_batch |
Fixed: replaced with asyncio.gather |
🟡 Dead except FileNotFoundError handler in sync_service.py |
Fixed: handler removed from _load_index_batch_files |
The fix to reindex_vectors is exactly the one-word change suggested in the prior review. The new test test_reindex_vectors_respects_embed_opt_out correctly validates the delegation: it mocks sync_entity_vectors_batch on the service instance and asserts that reindex_vectors invokes it with all entity IDs (the opt-out filtering is delegated to sync_entity_vectors_batch, which has its own test coverage). The assertion sync_batch.assert_awaited_once_with([41, 42], progress_callback=None) correctly captures the routing.
Still-Open Issue
🟡 Startup backfill at db.py:128 still calls the repository directly
The _run_semantic_embedding_backfill function in src/basic_memory/db.py constructs a SearchRepository and calls sync_entity_vectors_batch without going through SearchService:
# src/basic_memory/db.py:123-128
search_repository = repository_class(
session_maker,
project_id=project_id,
app_config=app_config,
)
batch_result = await search_repository.sync_entity_vectors_batch(entity_ids) # ← bypasses opt-outThis is the startup backfill that fires when search_vector_chunks is empty. SearchService is not available at this call site, so opted-out notes will be embedded during the first-run backfill even if they have embed: false. On subsequent syncs, sync_entity_vectors_batch (service-level) would clear those vectors — so the drift is eventually corrected, but there's a window where opted-out notes have stale vectors.
This was flagged as a known limitation in the prior review. The options remain: thread an entity-metadata check into the backfill, or add a # Why: startup backfill has no SearchService; opted-out entities will be corrected on the next sync comment to document the limitation explicitly.
Checklist Results
| Category | Status | Notes |
|---|---|---|
| Python 3.12+ type annotations | ✅ | |
| Async patterns / SQLAlchemy 2.0 | ✅ | asyncio.gather in _clear_entity_vectors loop is correct |
| 100-char line length | ✅ | |
| No commented-out code | ✅ | |
| Unit tests for new behavior | ✅ | test_reindex_vectors_respects_embed_opt_out correctly validates routing via service |
| Integration tests | ✅ | Prior test-int/ coverage intact |
| 100% test coverage | ✅ | |
embed: false opt-out respected in reindex_vectors |
✅ | Fixed |
_clear_entity_vectors concurrent |
✅ | Fixed |
Dead except FileNotFoundError removed |
✅ | Fixed |
| Startup backfill opt-out | db.py:128 still bypasses SearchService; known limitation |
Overall: All 🔴 issues are closed. This PR is effectively ready to merge. The db.py:128 backfill gap is a pre-existing limitation with a bounded impact (vectors will be corrected on the next sync) — a comment documenting the trade-off is sufficient.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1207e8c664
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| eligible_entity_ids = [ | ||
| entity_id | ||
| for entity_id in entity_ids | ||
| if entity_id in entities_by_id and entity_id not in opted_out_ids | ||
| ] |
There was a problem hiding this comment.
Preserve cleanup for missing IDs in vector batch sync
sync_entity_vectors_batch() now drops IDs that are not returned by entity_repository.find_by_ids() before delegating to repository.sync_entity_vectors_batch(). This skips the repository’s missing-entity cleanup path (which deletes stale chunk/embedding rows when no search_index rows exist), so a stale/deleted entity ID in the input list will no longer have its vector rows cleaned up. This regression appears when callers pass outdated IDs (for example from stale queues or races), and leaves orphaned vector data accumulating over time.
Useful? React with 👍 / 👎.
Signed-off-by: phernandez <paul@basicmachines.co>
|
Claude finished @phernandez's task in 1m 58s —— View job PR Review: perf(sync): batch file indexing in coreTasks:
New Commit:
|
| Prior Issue | Status |
|---|---|
🟡 db.py:128 startup backfill bypasses SearchService opt-out |
✅ Fixed — entire backfill removed; users directed to bm reindex --embeddings |
One Still-Open Item (Codex P2, from 1207e8c)
sync_entity_vectors_batch() silently drops stale IDs without cleanup
The service-level filter in sync_entity_vectors_batch (search_service.py:471–475) builds eligible_entity_ids from IDs that exist in find_by_ids(). Entity IDs not returned by find_by_ids() (i.e., already deleted from the entity table) are silently dropped — neither opted-out nor delegated to the repository. The repository-level sync_entity_vectors_batch previously handled this by cleaning up orphaned search_index and search_vector_chunks rows for missing entities.
# search_service.py:471-475
eligible_entity_ids = [
entity_id
for entity_id in entity_ids
if entity_id in entities_by_id and entity_id not in opted_out_ids
# IDs not in entities_by_id are silently discarded — no stale row cleanup
]Practical impact is low given actual call sites:
reindex_vectors()seeds its IDs fromfind_all()— only live entities — so stale IDs can't appear thereBatchIndexercallssync_entity_vectors_batchonly with freshly-upserted entity IDs
However, direct callers of the service method with stale IDs (e.g. from a race condition between delete and reindex) would accumulate orphaned vector rows indefinitely. A defensive fix would pass unknown IDs through to the repository for cleanup:
unknown_ids = [eid for eid in entity_ids if eid not in entities_by_id]
if unknown_ids:
await self.repository.sync_entity_vectors_batch(unknown_ids) # cleans up stale rowsThis is not a blocker given the current call sites, but worth noting for a follow-up.
Checklist Results
| Category | Status | Notes |
|---|---|---|
| Python 3.12+ type annotations | ✅ | |
| Async patterns / SQLAlchemy 2.0 | ✅ | |
| 100-char line length | ✅ | |
| No commented-out code | ✅ | |
| Unit tests for removed behavior | ✅ | 225 lines of now-irrelevant tests correctly deleted |
| Integration tests | ✅ | Prior test-int/ coverage intact |
| 100% test coverage | ✅ | No dead code introduced |
| Startup backfill opt-out bypass | ✅ | Fixed by removing backfill |
_log_embedding_status still functional |
✅ | Synchronous status check retained |
run_migrations() structurally correct |
✅ | except block intact after removal |
| No sensitive data in logs | ✅ | |
| Local-first architecture preserved | ✅ |
Overall: All 🔴 issues are resolved. The Codex P2 stale-ID cleanup gap has low practical impact but is worth a follow-up. This PR is ready to merge.
Signed-off-by: phernandez <paul@basicmachines.co>
|
Claude finished @phernandez's task in 2m 58s —— View job PR Review: perf(sync): batch file indexing in coreTasks:
New Commit:
|
| Prior Issue | Status |
|---|---|
Codex P2: sync_entity_vectors_batch() drops stale/deleted IDs without cleanup |
✅ Fixed — unknown IDs now routed to repository.sync_entity_vectors_batch for cleanup |
The fix correctly:
- Collects
unknown_ids— IDs passed in that are absent fromfind_by_ids()result - Routes them to
self.repository.sync_entity_vectors_batch(unknown_ids)so the repository layer can delete orphanedsearch_indexandsearch_vector_chunksrows - Builds a merged
VectorSyncBatchResultby summing across both the cleanup batch and the eligible-embedding batch, withopted_out_idsadded intoentities_skipped - The decision-point comment (
# Trigger / Why / Outcome) is exactly right
The result merging is also a nice refactor from the prior mutable batch_result.entities_total = ... mutation pattern to an immutable construction — clearer and less fragile.
New Issues Found
🟢 Minor: unknown_ids cleanup and eligible_entity_ids embedding are sequential
Both repository.sync_entity_vectors_batch(unknown_ids) and repository.sync_entity_vectors_batch(eligible_entity_ids, ...) are await-ed in sequence. Since neither depends on the other's result, they could be done concurrently:
results = await asyncio.gather(
self.repository.sync_entity_vectors_batch(unknown_ids) if unknown_ids else None,
self.repository.sync_entity_vectors_batch(eligible_entity_ids, progress_callback=progress_callback) if eligible_entity_ids else None,
)That said, unknown_ids is empty in every real call site (reindex_vectors seeds from find_all(), BatchIndexer uses freshly-upserted IDs), so the practical impact is zero. Not a blocker.
🟢 Minor: entities_skipped double-counts repository skips for unknown IDs
When unknown_ids are routed through repository.sync_entity_vectors_batch, the repository finds no source rows for those IDs and likely reports them as entities_skipped in its result. The merge aggregation adds len(opted_out_ids) + sum(result.entities_skipped ...), which means unknown entities count as both "deleted" and "skipped". This is a subtle accounting edge case with no practical impact given the real call sites — but worth noting if the entities_skipped field is ever surfaced in a user-visible stats page.
All Previously-Open Issues
| Severity | Issue | Status |
|---|---|---|
🔴 progress_callback dead code |
✅ Resolved (49a40be) |
|
🔴 get_file_path_to_permalink_map() per-batch |
✅ Resolved (49a40be) |
|
🟡 Sequential fetch in _resolve_batch_relations |
✅ Resolved (49a40be) |
|
🟡 existing is None post-mutation |
✅ Resolved (49a40be) |
|
🟡 Missing test-int/ integration coverage |
✅ Resolved (49a40be) |
|
| 🟡 Dual identical result types without comment | ✅ Resolved (49a40be) |
|
🔴 P1 Codex: _run_bounded swallows SyncFatalError |
✅ Resolved (49a40be) |
|
| 🔴 P2 Codex: Checksum drift from scan time | ✅ Resolved (49a40be) |
|
🟡 remove_frontmatter redundant call |
✅ Resolved (49a40be) |
|
🔴 reindex_vectors bypasses embed: false opt-out |
✅ Resolved (1207e8c) |
|
🟡 Sequential _clear_entity_vectors loop |
✅ Resolved (1207e8c) |
|
🟡 Dead except FileNotFoundError handler |
✅ Resolved (1207e8c) |
|
| 🟡 Startup backfill bypasses opt-out | ✅ Resolved (fe9fc7e — backfill removed entirely) |
|
Codex P2: sync_entity_vectors_batch drops stale IDs |
✅ Resolved (a83bee9) |
Checklist Results
| Category | Status | Notes |
|---|---|---|
| Python 3.12+ type annotations | ✅ | list[VectorSyncBatchResult] typed correctly |
| Async patterns / SQLAlchemy 2.0 | ✅ | Correct; sequential awaits are fine given real call sites |
| 100-char line length | ✅ | prepare_seconds_total=sum(...) line at 90 chars |
| No commented-out code | ✅ | |
| Unit tests for new behavior | ✅ | test_semantic_vector_sync_batch_cleans_up_unknown_ids validates both paths called, counts correct |
| Integration tests | ✅ | Prior test-int/ coverage intact |
| 100% test coverage | ✅ | Unknown-IDs path now exercised |
| Result merging correctness | ✅ | Immutable construction replaces mutable mutation; cleaner |
opted_out_ids correctly excluded from eligible_entity_ids |
✅ | |
progress_callback only forwarded to eligible batch |
✅ | Unknown cleanup call correctly omits it |
| No hardcoded secrets | ✅ | |
| No sensitive data in logs | ✅ |
Overall: All 🔴 issues are closed. No new 🔴 or 🟡 issues introduced. The two 🟢 observations above are non-blocking style/perf nits with zero practical impact at current call sites. This PR is ready to merge.
Signed-off-by: phernandez <paul@basicmachines.co>
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1fee913950
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| entities_skipped=( | ||
| len(opted_out_ids) | ||
| + sum(result.entities_skipped for result in repository_results) | ||
| - len(unknown_ids) | ||
| ), |
There was a problem hiding this comment.
Do not subtract unknown IDs from skipped vector count
When entity_ids contains stale/unknown IDs (for example, a sync race where an entity is deleted before embedding), repository.sync_entity_vectors_batch(unknown_ids) performs cleanup and typically reports entities_skipped == 0; subtracting len(unknown_ids) here can therefore make entities_skipped negative. This produces invalid embedding stats in sync/reindex reporting and can mislead operators about what was actually skipped.
Useful? React with 👍 / 👎.
Summary
SyncService.sync()to batch changed file indexing, emit typed progress, and keep forward-ref/vector phases coarse and explicitTesting
uv run pytest tests/indexing/test_batching.py tests/indexing/test_batch_indexer.py tests/sync/test_sync_service_batching.py tests/sync/test_sync_service_telemetry.py tests/repository/test_openai_provider.py -quv run ruff check src/basic_memory/indexing src/basic_memory/sync/sync_service.py src/basic_memory/config.py src/basic_memory/repository/openai_provider.py src/basic_memory/repository/embedding_provider_factory.py src/basic_memory/cli/commands/db.py tests/indexing tests/sync/test_sync_service_batching.py tests/sync/test_sync_service_telemetry.py tests/repository/test_openai_provider.pyuv run pyrightjust fast-check