feat: add lexical indexing of documents chunks by programmer-ke · Pull Request #9 · programmer-ke/docs-buddy

programmer-ke · 2026-04-08T13:09:37Z

As a prerequisite for search we need to implement indexing of the chunks.

This PR adds lexical indexing with whoosh.

Summary by CodeRabbit

New Features
- Added document indexing capability for efficient search support.
- Added verbose test mode to display detailed test output.
Dependencies
- Added Whoosh-Reloaded package for indexing functionality.
Tests
- Expanded test coverage with integration tests for the indexing pipeline and adapter components.

coderabbitai · 2026-04-08T13:09:50Z

📝 Walkthrough

Walkthrough

This PR introduces Whoosh-backed document indexing by adding intermediate-storage abstractions, document chunk pipeline implementations, a Whoosh index adapter, and service-layer protocols. It extends the adapter layer with both filesystem and in-memory implementations, updates the CLI to invoke indexing, and includes comprehensive test coverage with sample fixtures.

Changes

Cohort / File(s)	Summary
Configuration & Build `.gitignore`, `Makefile`, `pyproject.toml`	Added `.whoosh` to .gitignore, introduced `test_verbose` Makefile target for unbuffered pytest output, and declared `Whoosh-Reloaded>=2.7.5` dependency with MyPy overrides for whoosh module namespace.
Core Adapters & Storage `src/docs_buddy/adapters/__init__.py`	Added `FakeIntermediateStorage` and `FileSystemIntermediateStorage` abstractions for temp-location and atomic-replace operations; introduced `FakeDocumentChunksPipeline` and `FileSystemDocumentChunksPipeline` for chunk sourcing and discovery; refactored `FakeDocsStorage` to route writes through intermediate storage; added `FakeIndex` with fit capability; extracted `_read_file` to module level; renamed samples to private and added chunk samples.
Whoosh Indexing `src/docs_buddy/adapters/whoosh_index.py`	New module introducing `WhooshDocumentIndex` with Whoosh schema (chunk\_id, content, path, path\_keywords, metadata) and `fit()` method to index `DocumentChunk` iterables with deterministic chunk IDs and JSON-serialized metadata.
Domain Layer `src/docs_buddy/domain/__init__.py`	Added `DocumentChunk.fromstring()` classmethod to parse JSON strings into DocumentChunk instances.
Service Protocols `src/docs_buddy/services/__init__.py`	Extracted `SupportsIntermediateStorage` base protocol; refactored `DocsArtifactStorage` to inherit from it; added `DocumentIndex` and `DocumentChunksPipeline` protocols; implemented `index_document_chunks()` function orchestrating chunk retrieval, indexing via temp-location, and destination replacement.
CLI Integration `src/docs_buddy/entrypoints/cli.py`	Updated CLI to instantiate `FileSystemDocumentChunksPipeline` and `WhooshDocumentIndex`, then invoke `services.index_document_chunks()` after artifact updates.
Integration Tests `tests/integration/test_adapters.py`	Replaced frontmatter unit test with comprehensive adapter suite: `FileSystemIntermediateStorage` temp/replace semantics, Whoosh index creation and querying, and `FileSystemDocumentChunksPipeline` chunk discovery with large sample chunk fixtures.
Unit Tests `tests/unit/test_domain.py`, `tests/unit/test_frontmatter.py`, `tests/unit/test_services.py`	Added `DocumentChunk.fromstring()` and serialization tests; extracted frontmatter extractor test into dedicated module; added `index_document_chunks()` behavior validation using fake implementations.

Sequence Diagram(s)

sequenceDiagram
    participant CLI
    participant Pipeline as DocumentChunksPipeline
    participant Index as DocumentIndex
    participant Services
    
    CLI->>Pipeline: get_document_chunks()
    activate Pipeline
    Pipeline-->>CLI: Iterator[DocumentChunk]
    deactivate Pipeline
    
    CLI->>Services: index_document_chunks(pipeline, index)
    activate Services
    Services->>Pipeline: get_temp_location()
    activate Pipeline
    Pipeline-->>Services: tmp_path
    deactivate Pipeline
    
    Services->>Index: fit(chunks, destination=tmp_path)
    activate Index
    Index->>Index: Index chunks into tmp_path
    Index-->>Services: (complete)
    deactivate Index
    
    Services->>Pipeline: replace_destination(tmp_path)
    activate Pipeline
    Pipeline->>Pipeline: Move tmp_path → destination
    Pipeline-->>Services: (complete)
    deactivate Pipeline
    deactivate Services

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

PR #7: Both modify document chunking/indexing pipeline architecture, introduce domain-level methods like DocumentChunk.fromstring(), and update CLI components to orchestrate the artifact-to-chunks flow.
PR #6: Both refactor src/docs_buddy/adapters/__init__.py with adapter implementations and storage abstractions; this PR evolves the adapter layer further with intermediate-storage patterns.

Poem

🐰 Whoosh goes the index, chunks in the burrow,
Temp dirs and moves, no paths to sorrow,
Pipelines and storage, abstractions so neat,
A rabbit's refactor—indexing complete! 🌾✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.31% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: add lexical indexing of documents chunks' clearly describes the main change—introducing lexical indexing for document chunks using Whoosh. It is concise, specific, and accurately reflects the PR's primary objective.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/lexical-indexing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (2)

tests/unit/test_services.py (1)
219-241: Add the failure-path coverage for index rebuilds.

This only exercises the happy path. index_document_chunks() uses the same temp-dir swap pattern as update_document_artifacts(), so it should also seed an existing destination, force the fake index to fail, and verify the old index survives while <dest>.tmp is cleaned up.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/test_services.py` around lines 219 - 241, Extend
test_can_index_documents to also cover the failure-path: pre-seed
pipeline.sink[dest] with existing items, configure adapters.FakeIndex to
raise/return failure when services.index_document_chunks is called, then run
index_document_chunks and assert that the original pipeline.sink[dest] items
remain (i.e. old index survives), that any created tmp directory (f"{dest}.tmp")
is cleaned up (no sink entries for tmp), and that pipeline.actions reflect the
attempted MKDIR and cleanup (e.g., no final MV to dest). Use the existing test
helpers FakeDocumentChunksPipeline, FakeIndex, and inspect pipeline.sink and
pipeline.actions to verify behavior similar to update_document_artifacts
failure-path.
Makefile (1)
1-12: Add a clean target for generated indexing artifacts.

This PR adds another generated workspace directory (.whoosh), and the file is already tripping checkmake for a missing clean target. A small cleanup target would make local reruns less stateful.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 1 - 12, The Makefile is missing a clean target
required to remove generated indexing artifacts like the new .whoosh workspace;
add a phony clean target (add "clean" to the .PHONY line) and implement a
"clean:" recipe that removes .whoosh (e.g. rm -rf .whoosh) and any other
transient files you want cleaned (pytest cache, build artifacts) so local reruns
are deterministic and checkmake passes.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pyproject.toml`:
- Around line 41-43: The mypy override currently targets only subpackages with
module = ["whoosh.*"] which misses imports like from whoosh import index; update
the override so the module pattern includes the root package as well by changing
module = ["whoosh.*"] to module = ["whoosh", "whoosh.*"] so both the root
"whoosh" and its subpackages are covered.

In `@src/docs_buddy/adapters/__init__.py`:
- Around line 225-247: get_temp_location currently stages in system temp which
can be on a different filesystem than self._destination; change
get_temp_location to create the TemporaryDirectory on the destination filesystem
by passing dir=self._destination.parent (so prefix still can use
self._destination.name). Update replace_destination to avoid deleting dest
before the handoff: instead move/rename the current dest to a backup name (e.g.,
dest_backup), then atomically move/rename the staged temp into the final
location using an atomic rename/replace API (os.replace or Path.replace) so the
swap is atomic on the same filesystem, and finally remove the backup only after
the atomic swap succeeds; reference functions: get_temp_location,
replace_destination, and attribute: self._destination.

In `@src/docs_buddy/adapters/whoosh_index.py`:
- Around line 49-54: The path components in writer.add_document(...) are passed
as chunk.path.split("/") into path_keywords, which bypasses Whoosh's analyzer
and leaves them case-sensitive; transform each component to lowercase before
passing to add_document (e.g., map chunk.path.split("/") to a list of
component.lower() values) so path_keywords matches the KEYWORD(lowercase=True)
schema intent; update the call that sets path_keywords in the
writer.add_document usage referencing chunk.path and path_keywords accordingly.

In `@tests/unit/test_frontmatter.py`:
- Around line 16-24: The existing test uses
SAMPLE_DOC_1.strip().endswith(content) which is too weak; change
test_frontmatter_extractor to assert that content exactly equals the expected
markdown body (i.e., SAMPLE_DOC_1 with the YAML frontmatter removed and trimmed)
rather than using endswith. Locate the test function and
adapters.frontmatter_metadata_extractor call, compute or inline the
expected_body (the original SAMPLE_DOC_1 content after removing the frontmatter
block and any surrounding whitespace), then replace the endswith assertion with
assert content == expected_body (or assert content.strip() ==
expected_body.strip()) so the test verifies the header is actually stripped.

---

Nitpick comments:
In `@Makefile`:
- Around line 1-12: The Makefile is missing a clean target required to remove
generated indexing artifacts like the new .whoosh workspace; add a phony clean
target (add "clean" to the .PHONY line) and implement a "clean:" recipe that
removes .whoosh (e.g. rm -rf .whoosh) and any other transient files you want
cleaned (pytest cache, build artifacts) so local reruns are deterministic and
checkmake passes.

In `@tests/unit/test_services.py`:
- Around line 219-241: Extend test_can_index_documents to also cover the
failure-path: pre-seed pipeline.sink[dest] with existing items, configure
adapters.FakeIndex to raise/return failure when services.index_document_chunks
is called, then run index_document_chunks and assert that the original
pipeline.sink[dest] items remain (i.e. old index survives), that any created tmp
directory (f"{dest}.tmp") is cleaned up (no sink entries for tmp), and that
pipeline.actions reflect the attempted MKDIR and cleanup (e.g., no final MV to
dest). Use the existing test helpers FakeDocumentChunksPipeline, FakeIndex, and
inspect pipeline.sink and pipeline.actions to verify behavior similar to
update_document_artifacts failure-path.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 06fa7516-ef60-445a-a89f-2cf33017b0ec

📥 Commits

Reviewing files that changed from the base of the PR and between 5817cf3 and 24f4a56.

📒 Files selected for processing (12)

.gitignore
Makefile
pyproject.toml
src/docs_buddy/adapters/__init__.py
src/docs_buddy/adapters/whoosh_index.py
src/docs_buddy/domain/__init__.py
src/docs_buddy/entrypoints/cli.py
src/docs_buddy/services/__init__.py
tests/integration/test_adapters.py
tests/unit/test_domain.py
tests/unit/test_frontmatter.py
tests/unit/test_services.py

coderabbitai · 2026-04-08T13:23:37Z

pyproject.toml

+[[tool.mypy.overrides]]
+module = ["whoosh.*"]
+follow_untyped_imports = true


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Direct imports of the root package need `module = ["whoosh", "whoosh.*"]`. rg -n -C2 '^\s*(from\s+whoosh\s+import|import\s+whoosh\b)' src tests || true # Show the current override block for comparison. sed -n '37,43p' pyproject.toml

Repository: programmer-ke/docs-buddy

Length of output: 871

Add "whoosh" to the module pattern.

The adapter imports directly from the root whoosh package (from whoosh import index), which is not covered by the pattern module = ["whoosh.*"]. The override should be module = ["whoosh", "whoosh.*"] to properly suppress type errors for both the root package and its subpackages.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 41 - 43, The mypy override currently targets only subpackages with module = ["whoosh.*"] which misses imports like from whoosh import index; update the override so the module pattern includes the root package as well by changing module = ["whoosh.*"] to module = ["whoosh", "whoosh.*"] so both the root "whoosh" and its subpackages are covered.

coderabbitai · 2026-04-08T13:23:37Z

src/docs_buddy/adapters/__init__.py

+    @contextmanager
+    def get_temp_location(self, prefix=""):
+        """
+        Create a temporary directory for atomic writes.
+
+        Yields:
+            Path to temporary directory
+        """
+        with tempfile.TemporaryDirectory(
+            prefix=(prefix or f"{self._destination.name}_")
+        ) as temp_dir:
+            yield Path(temp_dir)
+
+    def replace_destination(self, temp_location: PathLike) -> None:
+        """Replaces the destination with the provided temp_location"""
+
+        temp_path = Path(temp_location)
+        dest_path = self._destination
+
+        if dest_path.exists():
+            shutil.rmtree(dest_path)
+
+        shutil.move(str(temp_path), str(dest_path))


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash sed -n '225,247p' src/docs_buddy/adapters/__init__.py

Repository: programmer-ke/docs-buddy

Length of output: 767

🏁 Script executed:

# Find all usages of get_temp_location and replace_destination rg "get_temp_location|replace_destination" --type py -B 2 -A 2

Repository: programmer-ke/docs-buddy

Length of output: 9509

🏁 Script executed:

# Check the initialization of FileSystemIntermediateStorage and _destination rg "class FileSystemIntermediateStorage|def __init__" src/docs_buddy/adapters/__init__.py -A 10

Repository: programmer-ke/docs-buddy

Length of output: 3206

🏁 Script executed:

# Check if there's any documentation about the design intent rg "_destination|FileSystemIntermediateStorage" src/docs_buddy/adapters/__init__.py -B 2 -A 2 | head -100

Repository: programmer-ke/docs-buddy

Length of output: 3231

🏁 Script executed:

# Check how destinations are created/initialized in actual usage rg "FileSystemIntermediateStorage|FileSystemDocsStorage|FileSystemChunkPipeline" --type py -B 3 -A 3 src/docs_buddy/

Repository: programmer-ke/docs-buddy

Length of output: 2582

🏁 Script executed:

# Check if there are any tests that verify cross-filesystem behavior or failure recovery rg "test.*replace|test.*atomic|test.*cross" --type py -A 5 tests/

Repository: programmer-ke/docs-buddy

Length of output: 1380

The destination handoff is still non-atomic and risks cross-filesystem copy fallback.

Line 233 stages with TemporaryDirectory() using the system default temp area, which may reside on a different filesystem from user data directories like .chunks or .whoosh. When source and destination are on different filesystems, shutil.move() falls back to copy+delete instead of atomic rename. Lines 244-247 then delete the live destination before the move, creating a window where a failed handoff leaves the last good index missing or partially copied.

Ensure staging occurs on the destination filesystem by supplying dir= to TemporaryDirectory(), and consider a safer swap strategy for the final replacement that doesn't require a delete-then-move sequence.

Minimum fix to keep staging on the destination filesystem

`@contextmanager` def get_temp_location(self, prefix=""): """ Create a temporary directory for atomic writes. @@ - with tempfile.TemporaryDirectory( - prefix=(prefix or f"{self._destination.name}_") - ) as temp_dir: + parent = self._destination.parent + parent.mkdir(parents=True, exist_ok=True) + with tempfile.TemporaryDirectory( + dir=parent, prefix=(prefix or f"{self._destination.name}_") + ) as temp_dir: yield Path(temp_dir)

This removes the cross-filesystem copy path, but the final rmtree() + move() swap still needs a safer handoff strategy.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/docs_buddy/adapters/__init__.py` around lines 225 - 247, get_temp_location currently stages in system temp which can be on a different filesystem than self._destination; change get_temp_location to create the TemporaryDirectory on the destination filesystem by passing dir=self._destination.parent (so prefix still can use self._destination.name). Update replace_destination to avoid deleting dest before the handoff: instead move/rename the current dest to a backup name (e.g., dest_backup), then atomically move/rename the staged temp into the final location using an atomic rename/replace API (os.replace or Path.replace) so the swap is atomic on the same filesystem, and finally remove the backup only after the atomic swap succeeds; reference functions: get_temp_location, replace_destination, and attribute: self._destination.

coderabbitai · 2026-04-08T13:23:37Z

src/docs_buddy/adapters/whoosh_index.py

+            writer.add_document(
+                chunk_id=chunk_id,
+                content=chunk.chunk,
+                path=chunk.path,
+                path_keywords=chunk.path.split("/"),
+                metadata=json.dumps(chunk.metadata, default=json_datetime_handler),


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash sed -n '49,54p' src/docs_buddy/adapters/whoosh_index.py sed -n '98,101p' src/docs_buddy/services/__init__.py

Repository: programmer-ke/docs-buddy

Length of output: 544

🏁 Script executed:

#!/bin/bash # Check for Windows-related TODOs, FIXMEs, or comments in path handling code rg -i "windows|backslash|native.*separator|platform.*specific" --type py src/docs_buddy/ # Check if there's any documentation about path format expectations find . -name "*.md" -o -name "*.txt" -o -name "*.rst" | head -20 | xargs grep -l -i "path\|windows\|separator" 2>/dev/null | head -5 # Check the PathLib usage pattern to understand intended path format rg "pathlib|Path\(" src/docs_buddy/ -A 2 -B 2 | head -50

Repository: programmer-ke/docs-buddy

Length of output: 3154

🏁 Script executed:

#!/bin/bash # Check what format chunk.path actually has by examining its source rg "chunk\.path\s*=" src/docs_buddy/ -B 5 -A 2 # Check FileSystemDocsStorage to understand how paths are generated rg "class FileSystemDocsStorage|get_source_paths" src/docs_buddy/services/__init__.py -A 20 | head -60

Repository: programmer-ke/docs-buddy

Length of output: 1509

🏁 Script executed:

#!/bin/bash # Check the Whoosh field definition for path_keywords rg "path_keywords|KEYWORD|schema" src/docs_buddy/adapters/whoosh_index.py -B 5 -A 5

Repository: programmer-ke/docs-buddy

Length of output: 1374

🌐 Web query:

Whoosh KEYWORD field list input analyzer lowercase behavior

💡 Result:

In Whoosh:

KEYWORD(...) does not lowercase by default (lowercase=False). It will only lowercase at index time if you set KEYWORD(lowercase=True) (or supply an analyzer that lowercases). The docs explicitly say: “To automatically lowercase the keywords before indexing them, use lowercase=True.” [1]

KEYWORD normally tokenizes a single string into keywords, split on spaces by default, or on commas if you set commas=True (useful when keywords may contain spaces). [1]

Important gotcha for “list input”: if you pass a list/tuple of unicode strings to writer.add_document(field=[...]), Whoosh says this bypasses the field’s analyzer for those strings—so any analyzer-driven behavior (including lowercasing) won’t run. You must normalize (e.g., .lower()) yourself before indexing if you use list/tuple input. [2]

Reference: the KEYWORD field type’s constructor defaults are lowercase=False, commas=False, and its analyzer behavior can be overridden via the analyzer= parameter. [3]

The built-in KeywordAnalyzer also defaults to lowercase=False (and has a commas option), matching the field defaults. [4]

Sources:
[1] Whoosh “Designing a schema” (KEYWORD, lowercase=True, commas=True) (whoosh.readthedocs.io)
[2] Whoosh writing docs (list/tuple of unicode strings bypasses analyzer) (whoosh.readthedocs.io)
[3] Whoosh fields.KEYWORD API (default params, analyzer=) (whoosh.readthedocs.io)
[4] Whoosh analysis.KeywordAnalyzer(lowercase=False, commas=False) (whoosh.readthedocs.io)

Citations:

1: https://whoosh.readthedocs.io/en/latest/schema.html

2: https://whoosh.readthedocs.io/en/latest/api/writing.html?utm_source=openai

3: https://whoosh.readthedocs.io/en/latest/api/fields.html

4: https://whoosh.readthedocs.io/en/latest/api/analysis.html?utm_source=openai

🏁 Script executed:

#!/bin/bash # Check if there are any tests or usage examples showing expected behavior find . -name "*test*" -type f -name "*.py" | xargs grep -l "path_keywords\|KEYWORD" 2>/dev/null | head -3 # Check if there are comments or documentation about path handling rg "path_keywords|chunk\.path" src/docs_buddy/ -B 3 -A 3

Repository: programmer-ke/docs-buddy

Length of output: 1548

Lowercase path_keywords list items to match schema intent.

Line 53 passes a list to Whoosh's KEYWORD(lowercase=True) field. Per Whoosh documentation, list input bypasses the field's analyzer, so the lowercase=True setting doesn't apply. Path keywords won't be lowercased during indexing, causing case-sensitive lookups. Lowercase each path component before passing to add_document().

🛠️ Proposed fix

writer.add_document( chunk_id=chunk_id, content=chunk.chunk, path=chunk.path, - path_keywords=chunk.path.split("/"), + path_keywords=[part.lower() for part in chunk.path.split("/") if part], metadata=json.dumps(chunk.metadata, default=json_datetime_handler), )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

writer.add_document(

chunk_id=chunk_id,

content=chunk.chunk,

path=chunk.path,

path_keywords=chunk.path.split("/"),

metadata=json.dumps(chunk.metadata, default=json_datetime_handler),

writer.add_document(

chunk_id=chunk_id,

content=chunk.chunk,

path=chunk.path,

path_keywords=[part.lower() for part in chunk.path.split("/") if part],

metadata=json.dumps(chunk.metadata, default=json_datetime_handler),

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/docs_buddy/adapters/whoosh_index.py` around lines 49 - 54, The path components in writer.add_document(...) are passed as chunk.path.split("/") into path_keywords, which bypasses Whoosh's analyzer and leaves them case-sensitive; transform each component to lowercase before passing to add_document (e.g., map chunk.path.split("/") to a list of component.lower() values) so path_keywords matches the KEYWORD(lowercase=True) schema intent; update the call that sets path_keywords in the writer.add_document usage referencing chunk.path and path_keywords accordingly.

coderabbitai · 2026-04-08T13:23:37Z

tests/unit/test_frontmatter.py

+def test_frontmatter_extractor() -> None:
+    metadata, content = adapters.frontmatter_metadata_extractor(SAMPLE_DOC_1)
+
+    assert all(
+        [t in metadata for t in ["title", "description", "centeredHeader", "pubDate"]]
+    )
+    assert metadata["title"] == "Open Source Community"
+    assert metadata["pubDate"] == "2020-01-19"
+    assert SAMPLE_DOC_1.strip().endswith(content)


⚠️ Potential issue | 🟡 Minor

Make the content assertion exact.

SAMPLE_DOC_1.strip().endswith(content) is too weak here: it still passes for content == "" and for outputs that leave the frontmatter attached. Assert the extracted markdown body directly so this test proves the parser stripped the header.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/unit/test_frontmatter.py` around lines 16 - 24, The existing test uses SAMPLE_DOC_1.strip().endswith(content) which is too weak; change test_frontmatter_extractor to assert that content exactly equals the expected markdown body (i.e., SAMPLE_DOC_1 with the YAML frontmatter removed and trimmed) rather than using endswith. Locate the test function and adapters.frontmatter_metadata_extractor call, compute or inline the expected_body (the original SAMPLE_DOC_1 content after removing the frontmatter block and any surrounding whitespace), then replace the endswith assertion with assert content == expected_body (or assert content.strip() == expected_body.strip()) so the test verifies the header is actually stripped.

programmer-ke added 5 commits March 30, 2026 17:05

refactor: extract protocol for an intermediate storage provider

d81a4ac

feat: Implement in memory indexing

4cab3e9

feat: implement filesystem chunks pipeline

e4f273c

feat: Add lexical indexing using whoosh

6c9e265

feat: Add cli indexing

24f4a56

coderabbitai bot reviewed Apr 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add lexical indexing of documents chunks#9

feat: add lexical indexing of documents chunks#9
programmer-ke wants to merge 5 commits intomasterfrom
feat/lexical-indexing

programmer-ke commented Apr 8, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 8, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 8, 2026

Uh oh!

coderabbitai bot Apr 8, 2026

Uh oh!

coderabbitai bot Apr 8, 2026

Uh oh!

coderabbitai bot Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

programmer-ke commented Apr 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

programmer-ke commented Apr 8, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 8, 2026 •

edited

Loading