feat: add lexical indexing of documents chunks#9
feat: add lexical indexing of documents chunks#9programmer-ke wants to merge 5 commits intomasterfrom
Conversation
📝 WalkthroughWalkthroughThis PR introduces Whoosh-backed document indexing by adding intermediate-storage abstractions, document chunk pipeline implementations, a Whoosh index adapter, and service-layer protocols. It extends the adapter layer with both filesystem and in-memory implementations, updates the CLI to invoke indexing, and includes comprehensive test coverage with sample fixtures. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI
participant Pipeline as DocumentChunksPipeline
participant Index as DocumentIndex
participant Services
CLI->>Pipeline: get_document_chunks()
activate Pipeline
Pipeline-->>CLI: Iterator[DocumentChunk]
deactivate Pipeline
CLI->>Services: index_document_chunks(pipeline, index)
activate Services
Services->>Pipeline: get_temp_location()
activate Pipeline
Pipeline-->>Services: tmp_path
deactivate Pipeline
Services->>Index: fit(chunks, destination=tmp_path)
activate Index
Index->>Index: Index chunks into tmp_path
Index-->>Services: (complete)
deactivate Index
Services->>Pipeline: replace_destination(tmp_path)
activate Pipeline
Pipeline->>Pipeline: Move tmp_path → destination
Pipeline-->>Services: (complete)
deactivate Pipeline
deactivate Services
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
tests/unit/test_services.py (1)
219-241: Add the failure-path coverage for index rebuilds.This only exercises the happy path.
index_document_chunks()uses the same temp-dir swap pattern asupdate_document_artifacts(), so it should also seed an existing destination, force the fake index to fail, and verify the old index survives while<dest>.tmpis cleaned up.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/test_services.py` around lines 219 - 241, Extend test_can_index_documents to also cover the failure-path: pre-seed pipeline.sink[dest] with existing items, configure adapters.FakeIndex to raise/return failure when services.index_document_chunks is called, then run index_document_chunks and assert that the original pipeline.sink[dest] items remain (i.e. old index survives), that any created tmp directory (f"{dest}.tmp") is cleaned up (no sink entries for tmp), and that pipeline.actions reflect the attempted MKDIR and cleanup (e.g., no final MV to dest). Use the existing test helpers FakeDocumentChunksPipeline, FakeIndex, and inspect pipeline.sink and pipeline.actions to verify behavior similar to update_document_artifacts failure-path.Makefile (1)
1-12: Add acleantarget for generated indexing artifacts.This PR adds another generated workspace directory (
.whoosh), and the file is already trippingcheckmakefor a missingcleantarget. A small cleanup target would make local reruns less stateful.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Makefile` around lines 1 - 12, The Makefile is missing a clean target required to remove generated indexing artifacts like the new .whoosh workspace; add a phony clean target (add "clean" to the .PHONY line) and implement a "clean:" recipe that removes .whoosh (e.g. rm -rf .whoosh) and any other transient files you want cleaned (pytest cache, build artifacts) so local reruns are deterministic and checkmake passes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pyproject.toml`:
- Around line 41-43: The mypy override currently targets only subpackages with
module = ["whoosh.*"] which misses imports like from whoosh import index; update
the override so the module pattern includes the root package as well by changing
module = ["whoosh.*"] to module = ["whoosh", "whoosh.*"] so both the root
"whoosh" and its subpackages are covered.
In `@src/docs_buddy/adapters/__init__.py`:
- Around line 225-247: get_temp_location currently stages in system temp which
can be on a different filesystem than self._destination; change
get_temp_location to create the TemporaryDirectory on the destination filesystem
by passing dir=self._destination.parent (so prefix still can use
self._destination.name). Update replace_destination to avoid deleting dest
before the handoff: instead move/rename the current dest to a backup name (e.g.,
dest_backup), then atomically move/rename the staged temp into the final
location using an atomic rename/replace API (os.replace or Path.replace) so the
swap is atomic on the same filesystem, and finally remove the backup only after
the atomic swap succeeds; reference functions: get_temp_location,
replace_destination, and attribute: self._destination.
In `@src/docs_buddy/adapters/whoosh_index.py`:
- Around line 49-54: The path components in writer.add_document(...) are passed
as chunk.path.split("/") into path_keywords, which bypasses Whoosh's analyzer
and leaves them case-sensitive; transform each component to lowercase before
passing to add_document (e.g., map chunk.path.split("/") to a list of
component.lower() values) so path_keywords matches the KEYWORD(lowercase=True)
schema intent; update the call that sets path_keywords in the
writer.add_document usage referencing chunk.path and path_keywords accordingly.
In `@tests/unit/test_frontmatter.py`:
- Around line 16-24: The existing test uses
SAMPLE_DOC_1.strip().endswith(content) which is too weak; change
test_frontmatter_extractor to assert that content exactly equals the expected
markdown body (i.e., SAMPLE_DOC_1 with the YAML frontmatter removed and trimmed)
rather than using endswith. Locate the test function and
adapters.frontmatter_metadata_extractor call, compute or inline the
expected_body (the original SAMPLE_DOC_1 content after removing the frontmatter
block and any surrounding whitespace), then replace the endswith assertion with
assert content == expected_body (or assert content.strip() ==
expected_body.strip()) so the test verifies the header is actually stripped.
---
Nitpick comments:
In `@Makefile`:
- Around line 1-12: The Makefile is missing a clean target required to remove
generated indexing artifacts like the new .whoosh workspace; add a phony clean
target (add "clean" to the .PHONY line) and implement a "clean:" recipe that
removes .whoosh (e.g. rm -rf .whoosh) and any other transient files you want
cleaned (pytest cache, build artifacts) so local reruns are deterministic and
checkmake passes.
In `@tests/unit/test_services.py`:
- Around line 219-241: Extend test_can_index_documents to also cover the
failure-path: pre-seed pipeline.sink[dest] with existing items, configure
adapters.FakeIndex to raise/return failure when services.index_document_chunks
is called, then run index_document_chunks and assert that the original
pipeline.sink[dest] items remain (i.e. old index survives), that any created tmp
directory (f"{dest}.tmp") is cleaned up (no sink entries for tmp), and that
pipeline.actions reflect the attempted MKDIR and cleanup (e.g., no final MV to
dest). Use the existing test helpers FakeDocumentChunksPipeline, FakeIndex, and
inspect pipeline.sink and pipeline.actions to verify behavior similar to
update_document_artifacts failure-path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 06fa7516-ef60-445a-a89f-2cf33017b0ec
📒 Files selected for processing (12)
.gitignoreMakefilepyproject.tomlsrc/docs_buddy/adapters/__init__.pysrc/docs_buddy/adapters/whoosh_index.pysrc/docs_buddy/domain/__init__.pysrc/docs_buddy/entrypoints/cli.pysrc/docs_buddy/services/__init__.pytests/integration/test_adapters.pytests/unit/test_domain.pytests/unit/test_frontmatter.pytests/unit/test_services.py
| [[tool.mypy.overrides]] | ||
| module = ["whoosh.*"] | ||
| follow_untyped_imports = true |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Direct imports of the root package need `module = ["whoosh", "whoosh.*"]`.
rg -n -C2 '^\s*(from\s+whoosh\s+import|import\s+whoosh\b)' src tests || true
# Show the current override block for comparison.
sed -n '37,43p' pyproject.tomlRepository: programmer-ke/docs-buddy
Length of output: 871
Add "whoosh" to the module pattern.
The adapter imports directly from the root whoosh package (from whoosh import index), which is not covered by the pattern module = ["whoosh.*"]. The override should be module = ["whoosh", "whoosh.*"] to properly suppress type errors for both the root package and its subpackages.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyproject.toml` around lines 41 - 43, The mypy override currently targets
only subpackages with module = ["whoosh.*"] which misses imports like from
whoosh import index; update the override so the module pattern includes the root
package as well by changing module = ["whoosh.*"] to module = ["whoosh",
"whoosh.*"] so both the root "whoosh" and its subpackages are covered.
| @contextmanager | ||
| def get_temp_location(self, prefix=""): | ||
| """ | ||
| Create a temporary directory for atomic writes. | ||
|
|
||
| Yields: | ||
| Path to temporary directory | ||
| """ | ||
| with tempfile.TemporaryDirectory( | ||
| prefix=(prefix or f"{self._destination.name}_") | ||
| ) as temp_dir: | ||
| yield Path(temp_dir) | ||
|
|
||
| def replace_destination(self, temp_location: PathLike) -> None: | ||
| """Replaces the destination with the provided temp_location""" | ||
|
|
||
| temp_path = Path(temp_location) | ||
| dest_path = self._destination | ||
|
|
||
| if dest_path.exists(): | ||
| shutil.rmtree(dest_path) | ||
|
|
||
| shutil.move(str(temp_path), str(dest_path)) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
sed -n '225,247p' src/docs_buddy/adapters/__init__.pyRepository: programmer-ke/docs-buddy
Length of output: 767
🏁 Script executed:
# Find all usages of get_temp_location and replace_destination
rg "get_temp_location|replace_destination" --type py -B 2 -A 2Repository: programmer-ke/docs-buddy
Length of output: 9509
🏁 Script executed:
# Check the initialization of FileSystemIntermediateStorage and _destination
rg "class FileSystemIntermediateStorage|def __init__" src/docs_buddy/adapters/__init__.py -A 10Repository: programmer-ke/docs-buddy
Length of output: 3206
🏁 Script executed:
# Check if there's any documentation about the design intent
rg "_destination|FileSystemIntermediateStorage" src/docs_buddy/adapters/__init__.py -B 2 -A 2 | head -100Repository: programmer-ke/docs-buddy
Length of output: 3231
🏁 Script executed:
# Check how destinations are created/initialized in actual usage
rg "FileSystemIntermediateStorage|FileSystemDocsStorage|FileSystemChunkPipeline" --type py -B 3 -A 3 src/docs_buddy/Repository: programmer-ke/docs-buddy
Length of output: 2582
🏁 Script executed:
# Check if there are any tests that verify cross-filesystem behavior or failure recovery
rg "test.*replace|test.*atomic|test.*cross" --type py -A 5 tests/Repository: programmer-ke/docs-buddy
Length of output: 1380
The destination handoff is still non-atomic and risks cross-filesystem copy fallback.
Line 233 stages with TemporaryDirectory() using the system default temp area, which may reside on a different filesystem from user data directories like .chunks or .whoosh. When source and destination are on different filesystems, shutil.move() falls back to copy+delete instead of atomic rename. Lines 244-247 then delete the live destination before the move, creating a window where a failed handoff leaves the last good index missing or partially copied.
Ensure staging occurs on the destination filesystem by supplying dir= to TemporaryDirectory(), and consider a safer swap strategy for the final replacement that doesn't require a delete-then-move sequence.
Minimum fix to keep staging on the destination filesystem
`@contextmanager`
def get_temp_location(self, prefix=""):
"""
Create a temporary directory for atomic writes.
@@
- with tempfile.TemporaryDirectory(
- prefix=(prefix or f"{self._destination.name}_")
- ) as temp_dir:
+ parent = self._destination.parent
+ parent.mkdir(parents=True, exist_ok=True)
+ with tempfile.TemporaryDirectory(
+ dir=parent, prefix=(prefix or f"{self._destination.name}_")
+ ) as temp_dir:
yield Path(temp_dir)This removes the cross-filesystem copy path, but the final rmtree() + move() swap still needs a safer handoff strategy.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/docs_buddy/adapters/__init__.py` around lines 225 - 247,
get_temp_location currently stages in system temp which can be on a different
filesystem than self._destination; change get_temp_location to create the
TemporaryDirectory on the destination filesystem by passing
dir=self._destination.parent (so prefix still can use self._destination.name).
Update replace_destination to avoid deleting dest before the handoff: instead
move/rename the current dest to a backup name (e.g., dest_backup), then
atomically move/rename the staged temp into the final location using an atomic
rename/replace API (os.replace or Path.replace) so the swap is atomic on the
same filesystem, and finally remove the backup only after the atomic swap
succeeds; reference functions: get_temp_location, replace_destination, and
attribute: self._destination.
| writer.add_document( | ||
| chunk_id=chunk_id, | ||
| content=chunk.chunk, | ||
| path=chunk.path, | ||
| path_keywords=chunk.path.split("/"), | ||
| metadata=json.dumps(chunk.metadata, default=json_datetime_handler), |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
sed -n '49,54p' src/docs_buddy/adapters/whoosh_index.py
sed -n '98,101p' src/docs_buddy/services/__init__.pyRepository: programmer-ke/docs-buddy
Length of output: 544
🏁 Script executed:
#!/bin/bash
# Check for Windows-related TODOs, FIXMEs, or comments in path handling code
rg -i "windows|backslash|native.*separator|platform.*specific" --type py src/docs_buddy/
# Check if there's any documentation about path format expectations
find . -name "*.md" -o -name "*.txt" -o -name "*.rst" | head -20 | xargs grep -l -i "path\|windows\|separator" 2>/dev/null | head -5
# Check the PathLib usage pattern to understand intended path format
rg "pathlib|Path\(" src/docs_buddy/ -A 2 -B 2 | head -50Repository: programmer-ke/docs-buddy
Length of output: 3154
🏁 Script executed:
#!/bin/bash
# Check what format chunk.path actually has by examining its source
rg "chunk\.path\s*=" src/docs_buddy/ -B 5 -A 2
# Check FileSystemDocsStorage to understand how paths are generated
rg "class FileSystemDocsStorage|get_source_paths" src/docs_buddy/services/__init__.py -A 20 | head -60Repository: programmer-ke/docs-buddy
Length of output: 1509
🏁 Script executed:
#!/bin/bash
# Check the Whoosh field definition for path_keywords
rg "path_keywords|KEYWORD|schema" src/docs_buddy/adapters/whoosh_index.py -B 5 -A 5Repository: programmer-ke/docs-buddy
Length of output: 1374
🌐 Web query:
Whoosh KEYWORD field list input analyzer lowercase behavior
💡 Result:
In Whoosh:
KEYWORD(...)does not lowercase by default (lowercase=False). It will only lowercase at index time if you setKEYWORD(lowercase=True)(or supply an analyzer that lowercases). The docs explicitly say: “To automatically lowercase the keywords before indexing them, uselowercase=True.” [1]KEYWORDnormally tokenizes a single string into keywords, split on spaces by default, or on commas if you setcommas=True(useful when keywords may contain spaces). [1]- Important gotcha for “list input”: if you pass a list/tuple of unicode strings to
writer.add_document(field=[...]), Whoosh says this bypasses the field’s analyzer for those strings—so any analyzer-driven behavior (including lowercasing) won’t run. You must normalize (e.g.,.lower()) yourself before indexing if you use list/tuple input. [2] - Reference: the
KEYWORDfield type’s constructor defaults arelowercase=False, commas=False, and its analyzer behavior can be overridden via theanalyzer=parameter. [3] - The built-in
KeywordAnalyzeralso defaults tolowercase=False(and has acommasoption), matching the field defaults. [4]
Sources:
[1] Whoosh “Designing a schema” (KEYWORD, lowercase=True, commas=True) (whoosh.readthedocs.io)
[2] Whoosh writing docs (list/tuple of unicode strings bypasses analyzer) (whoosh.readthedocs.io)
[3] Whoosh fields.KEYWORD API (default params, analyzer=) (whoosh.readthedocs.io)
[4] Whoosh analysis.KeywordAnalyzer(lowercase=False, commas=False) (whoosh.readthedocs.io)
Citations:
- 1: https://whoosh.readthedocs.io/en/latest/schema.html
- 2: https://whoosh.readthedocs.io/en/latest/api/writing.html?utm_source=openai
- 3: https://whoosh.readthedocs.io/en/latest/api/fields.html
- 4: https://whoosh.readthedocs.io/en/latest/api/analysis.html?utm_source=openai
🏁 Script executed:
#!/bin/bash
# Check if there are any tests or usage examples showing expected behavior
find . -name "*test*" -type f -name "*.py" | xargs grep -l "path_keywords\|KEYWORD" 2>/dev/null | head -3
# Check if there are comments or documentation about path handling
rg "path_keywords|chunk\.path" src/docs_buddy/ -B 3 -A 3Repository: programmer-ke/docs-buddy
Length of output: 1548
Lowercase path_keywords list items to match schema intent.
Line 53 passes a list to Whoosh's KEYWORD(lowercase=True) field. Per Whoosh documentation, list input bypasses the field's analyzer, so the lowercase=True setting doesn't apply. Path keywords won't be lowercased during indexing, causing case-sensitive lookups. Lowercase each path component before passing to add_document().
🛠️ Proposed fix
writer.add_document(
chunk_id=chunk_id,
content=chunk.chunk,
path=chunk.path,
- path_keywords=chunk.path.split("/"),
+ path_keywords=[part.lower() for part in chunk.path.split("/") if part],
metadata=json.dumps(chunk.metadata, default=json_datetime_handler),
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| writer.add_document( | |
| chunk_id=chunk_id, | |
| content=chunk.chunk, | |
| path=chunk.path, | |
| path_keywords=chunk.path.split("/"), | |
| metadata=json.dumps(chunk.metadata, default=json_datetime_handler), | |
| writer.add_document( | |
| chunk_id=chunk_id, | |
| content=chunk.chunk, | |
| path=chunk.path, | |
| path_keywords=[part.lower() for part in chunk.path.split("/") if part], | |
| metadata=json.dumps(chunk.metadata, default=json_datetime_handler), | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/docs_buddy/adapters/whoosh_index.py` around lines 49 - 54, The path
components in writer.add_document(...) are passed as chunk.path.split("/") into
path_keywords, which bypasses Whoosh's analyzer and leaves them case-sensitive;
transform each component to lowercase before passing to add_document (e.g., map
chunk.path.split("/") to a list of component.lower() values) so path_keywords
matches the KEYWORD(lowercase=True) schema intent; update the call that sets
path_keywords in the writer.add_document usage referencing chunk.path and
path_keywords accordingly.
| def test_frontmatter_extractor() -> None: | ||
| metadata, content = adapters.frontmatter_metadata_extractor(SAMPLE_DOC_1) | ||
|
|
||
| assert all( | ||
| [t in metadata for t in ["title", "description", "centeredHeader", "pubDate"]] | ||
| ) | ||
| assert metadata["title"] == "Open Source Community" | ||
| assert metadata["pubDate"] == "2020-01-19" | ||
| assert SAMPLE_DOC_1.strip().endswith(content) |
There was a problem hiding this comment.
Make the content assertion exact.
SAMPLE_DOC_1.strip().endswith(content) is too weak here: it still passes for content == "" and for outputs that leave the frontmatter attached. Assert the extracted markdown body directly so this test proves the parser stripped the header.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit/test_frontmatter.py` around lines 16 - 24, The existing test uses
SAMPLE_DOC_1.strip().endswith(content) which is too weak; change
test_frontmatter_extractor to assert that content exactly equals the expected
markdown body (i.e., SAMPLE_DOC_1 with the YAML frontmatter removed and trimmed)
rather than using endswith. Locate the test function and
adapters.frontmatter_metadata_extractor call, compute or inline the
expected_body (the original SAMPLE_DOC_1 content after removing the frontmatter
block and any surrounding whitespace), then replace the endswith assertion with
assert content == expected_body (or assert content.strip() ==
expected_body.strip()) so the test verifies the header is actually stripped.
As a prerequisite for search we need to implement indexing of the chunks.
This PR adds lexical indexing with whoosh.
Summary by CodeRabbit
New Features
Dependencies
Tests