Skip to content

feat: add lexical indexing of documents chunks#9

Open
programmer-ke wants to merge 5 commits intomasterfrom
feat/lexical-indexing
Open

feat: add lexical indexing of documents chunks#9
programmer-ke wants to merge 5 commits intomasterfrom
feat/lexical-indexing

Conversation

@programmer-ke
Copy link
Copy Markdown
Owner

@programmer-ke programmer-ke commented Apr 8, 2026

As a prerequisite for search we need to implement indexing of the chunks.

This PR adds lexical indexing with whoosh.

Summary by CodeRabbit

  • New Features

    • Added document indexing capability for efficient search support.
    • Added verbose test mode to display detailed test output.
  • Dependencies

    • Added Whoosh-Reloaded package for indexing functionality.
  • Tests

    • Expanded test coverage with integration tests for the indexing pipeline and adapter components.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 8, 2026

📝 Walkthrough

Walkthrough

This PR introduces Whoosh-backed document indexing by adding intermediate-storage abstractions, document chunk pipeline implementations, a Whoosh index adapter, and service-layer protocols. It extends the adapter layer with both filesystem and in-memory implementations, updates the CLI to invoke indexing, and includes comprehensive test coverage with sample fixtures.

Changes

Cohort / File(s) Summary
Configuration & Build
.gitignore, Makefile, pyproject.toml
Added .whoosh to .gitignore, introduced test_verbose Makefile target for unbuffered pytest output, and declared Whoosh-Reloaded>=2.7.5 dependency with MyPy overrides for whoosh module namespace.
Core Adapters & Storage
src/docs_buddy/adapters/__init__.py
Added FakeIntermediateStorage and FileSystemIntermediateStorage abstractions for temp-location and atomic-replace operations; introduced FakeDocumentChunksPipeline and FileSystemDocumentChunksPipeline for chunk sourcing and discovery; refactored FakeDocsStorage to route writes through intermediate storage; added FakeIndex with fit capability; extracted _read_file to module level; renamed samples to private and added chunk samples.
Whoosh Indexing
src/docs_buddy/adapters/whoosh_index.py
New module introducing WhooshDocumentIndex with Whoosh schema (chunk\_id, content, path, path\_keywords, metadata) and fit() method to index DocumentChunk iterables with deterministic chunk IDs and JSON-serialized metadata.
Domain Layer
src/docs_buddy/domain/__init__.py
Added DocumentChunk.fromstring() classmethod to parse JSON strings into DocumentChunk instances.
Service Protocols
src/docs_buddy/services/__init__.py
Extracted SupportsIntermediateStorage base protocol; refactored DocsArtifactStorage to inherit from it; added DocumentIndex and DocumentChunksPipeline protocols; implemented index_document_chunks() function orchestrating chunk retrieval, indexing via temp-location, and destination replacement.
CLI Integration
src/docs_buddy/entrypoints/cli.py
Updated CLI to instantiate FileSystemDocumentChunksPipeline and WhooshDocumentIndex, then invoke services.index_document_chunks() after artifact updates.
Integration Tests
tests/integration/test_adapters.py
Replaced frontmatter unit test with comprehensive adapter suite: FileSystemIntermediateStorage temp/replace semantics, Whoosh index creation and querying, and FileSystemDocumentChunksPipeline chunk discovery with large sample chunk fixtures.
Unit Tests
tests/unit/test_domain.py, tests/unit/test_frontmatter.py, tests/unit/test_services.py
Added DocumentChunk.fromstring() and serialization tests; extracted frontmatter extractor test into dedicated module; added index_document_chunks() behavior validation using fake implementations.

Sequence Diagram(s)

sequenceDiagram
    participant CLI
    participant Pipeline as DocumentChunksPipeline
    participant Index as DocumentIndex
    participant Services
    
    CLI->>Pipeline: get_document_chunks()
    activate Pipeline
    Pipeline-->>CLI: Iterator[DocumentChunk]
    deactivate Pipeline
    
    CLI->>Services: index_document_chunks(pipeline, index)
    activate Services
    Services->>Pipeline: get_temp_location()
    activate Pipeline
    Pipeline-->>Services: tmp_path
    deactivate Pipeline
    
    Services->>Index: fit(chunks, destination=tmp_path)
    activate Index
    Index->>Index: Index chunks into tmp_path
    Index-->>Services: (complete)
    deactivate Index
    
    Services->>Pipeline: replace_destination(tmp_path)
    activate Pipeline
    Pipeline->>Pipeline: Move tmp_path → destination
    Pipeline-->>Services: (complete)
    deactivate Pipeline
    deactivate Services
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • PR #7: Both modify document chunking/indexing pipeline architecture, introduce domain-level methods like DocumentChunk.fromstring(), and update CLI components to orchestrate the artifact-to-chunks flow.
  • PR #6: Both refactor src/docs_buddy/adapters/__init__.py with adapter implementations and storage abstractions; this PR evolves the adapter layer further with intermediate-storage patterns.

Poem

🐰 Whoosh goes the index, chunks in the burrow,
Temp dirs and moves, no paths to sorrow,
Pipelines and storage, abstractions so neat,
A rabbit's refactor—indexing complete! 🌾✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.31% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add lexical indexing of documents chunks' clearly describes the main change—introducing lexical indexing for document chunks using Whoosh. It is concise, specific, and accurately reflects the PR's primary objective.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/lexical-indexing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
tests/unit/test_services.py (1)

219-241: Add the failure-path coverage for index rebuilds.

This only exercises the happy path. index_document_chunks() uses the same temp-dir swap pattern as update_document_artifacts(), so it should also seed an existing destination, force the fake index to fail, and verify the old index survives while <dest>.tmp is cleaned up.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/test_services.py` around lines 219 - 241, Extend
test_can_index_documents to also cover the failure-path: pre-seed
pipeline.sink[dest] with existing items, configure adapters.FakeIndex to
raise/return failure when services.index_document_chunks is called, then run
index_document_chunks and assert that the original pipeline.sink[dest] items
remain (i.e. old index survives), that any created tmp directory (f"{dest}.tmp")
is cleaned up (no sink entries for tmp), and that pipeline.actions reflect the
attempted MKDIR and cleanup (e.g., no final MV to dest). Use the existing test
helpers FakeDocumentChunksPipeline, FakeIndex, and inspect pipeline.sink and
pipeline.actions to verify behavior similar to update_document_artifacts
failure-path.
Makefile (1)

1-12: Add a clean target for generated indexing artifacts.

This PR adds another generated workspace directory (.whoosh), and the file is already tripping checkmake for a missing clean target. A small cleanup target would make local reruns less stateful.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 1 - 12, The Makefile is missing a clean target
required to remove generated indexing artifacts like the new .whoosh workspace;
add a phony clean target (add "clean" to the .PHONY line) and implement a
"clean:" recipe that removes .whoosh (e.g. rm -rf .whoosh) and any other
transient files you want cleaned (pytest cache, build artifacts) so local reruns
are deterministic and checkmake passes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pyproject.toml`:
- Around line 41-43: The mypy override currently targets only subpackages with
module = ["whoosh.*"] which misses imports like from whoosh import index; update
the override so the module pattern includes the root package as well by changing
module = ["whoosh.*"] to module = ["whoosh", "whoosh.*"] so both the root
"whoosh" and its subpackages are covered.

In `@src/docs_buddy/adapters/__init__.py`:
- Around line 225-247: get_temp_location currently stages in system temp which
can be on a different filesystem than self._destination; change
get_temp_location to create the TemporaryDirectory on the destination filesystem
by passing dir=self._destination.parent (so prefix still can use
self._destination.name). Update replace_destination to avoid deleting dest
before the handoff: instead move/rename the current dest to a backup name (e.g.,
dest_backup), then atomically move/rename the staged temp into the final
location using an atomic rename/replace API (os.replace or Path.replace) so the
swap is atomic on the same filesystem, and finally remove the backup only after
the atomic swap succeeds; reference functions: get_temp_location,
replace_destination, and attribute: self._destination.

In `@src/docs_buddy/adapters/whoosh_index.py`:
- Around line 49-54: The path components in writer.add_document(...) are passed
as chunk.path.split("/") into path_keywords, which bypasses Whoosh's analyzer
and leaves them case-sensitive; transform each component to lowercase before
passing to add_document (e.g., map chunk.path.split("/") to a list of
component.lower() values) so path_keywords matches the KEYWORD(lowercase=True)
schema intent; update the call that sets path_keywords in the
writer.add_document usage referencing chunk.path and path_keywords accordingly.

In `@tests/unit/test_frontmatter.py`:
- Around line 16-24: The existing test uses
SAMPLE_DOC_1.strip().endswith(content) which is too weak; change
test_frontmatter_extractor to assert that content exactly equals the expected
markdown body (i.e., SAMPLE_DOC_1 with the YAML frontmatter removed and trimmed)
rather than using endswith. Locate the test function and
adapters.frontmatter_metadata_extractor call, compute or inline the
expected_body (the original SAMPLE_DOC_1 content after removing the frontmatter
block and any surrounding whitespace), then replace the endswith assertion with
assert content == expected_body (or assert content.strip() ==
expected_body.strip()) so the test verifies the header is actually stripped.

---

Nitpick comments:
In `@Makefile`:
- Around line 1-12: The Makefile is missing a clean target required to remove
generated indexing artifacts like the new .whoosh workspace; add a phony clean
target (add "clean" to the .PHONY line) and implement a "clean:" recipe that
removes .whoosh (e.g. rm -rf .whoosh) and any other transient files you want
cleaned (pytest cache, build artifacts) so local reruns are deterministic and
checkmake passes.

In `@tests/unit/test_services.py`:
- Around line 219-241: Extend test_can_index_documents to also cover the
failure-path: pre-seed pipeline.sink[dest] with existing items, configure
adapters.FakeIndex to raise/return failure when services.index_document_chunks
is called, then run index_document_chunks and assert that the original
pipeline.sink[dest] items remain (i.e. old index survives), that any created tmp
directory (f"{dest}.tmp") is cleaned up (no sink entries for tmp), and that
pipeline.actions reflect the attempted MKDIR and cleanup (e.g., no final MV to
dest). Use the existing test helpers FakeDocumentChunksPipeline, FakeIndex, and
inspect pipeline.sink and pipeline.actions to verify behavior similar to
update_document_artifacts failure-path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 06fa7516-ef60-445a-a89f-2cf33017b0ec

📥 Commits

Reviewing files that changed from the base of the PR and between 5817cf3 and 24f4a56.

📒 Files selected for processing (12)
  • .gitignore
  • Makefile
  • pyproject.toml
  • src/docs_buddy/adapters/__init__.py
  • src/docs_buddy/adapters/whoosh_index.py
  • src/docs_buddy/domain/__init__.py
  • src/docs_buddy/entrypoints/cli.py
  • src/docs_buddy/services/__init__.py
  • tests/integration/test_adapters.py
  • tests/unit/test_domain.py
  • tests/unit/test_frontmatter.py
  • tests/unit/test_services.py

Comment on lines +41 to +43
[[tool.mypy.overrides]]
module = ["whoosh.*"]
follow_untyped_imports = true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Direct imports of the root package need `module = ["whoosh", "whoosh.*"]`.
rg -n -C2 '^\s*(from\s+whoosh\s+import|import\s+whoosh\b)' src tests || true

# Show the current override block for comparison.
sed -n '37,43p' pyproject.toml

Repository: programmer-ke/docs-buddy

Length of output: 871


Add "whoosh" to the module pattern.

The adapter imports directly from the root whoosh package (from whoosh import index), which is not covered by the pattern module = ["whoosh.*"]. The override should be module = ["whoosh", "whoosh.*"] to properly suppress type errors for both the root package and its subpackages.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 41 - 43, The mypy override currently targets
only subpackages with module = ["whoosh.*"] which misses imports like from
whoosh import index; update the override so the module pattern includes the root
package as well by changing module = ["whoosh.*"] to module = ["whoosh",
"whoosh.*"] so both the root "whoosh" and its subpackages are covered.

Comment on lines +225 to +247
@contextmanager
def get_temp_location(self, prefix=""):
"""
Create a temporary directory for atomic writes.

Yields:
Path to temporary directory
"""
with tempfile.TemporaryDirectory(
prefix=(prefix or f"{self._destination.name}_")
) as temp_dir:
yield Path(temp_dir)

def replace_destination(self, temp_location: PathLike) -> None:
"""Replaces the destination with the provided temp_location"""

temp_path = Path(temp_location)
dest_path = self._destination

if dest_path.exists():
shutil.rmtree(dest_path)

shutil.move(str(temp_path), str(dest_path))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
sed -n '225,247p' src/docs_buddy/adapters/__init__.py

Repository: programmer-ke/docs-buddy

Length of output: 767


🏁 Script executed:

# Find all usages of get_temp_location and replace_destination
rg "get_temp_location|replace_destination" --type py -B 2 -A 2

Repository: programmer-ke/docs-buddy

Length of output: 9509


🏁 Script executed:

# Check the initialization of FileSystemIntermediateStorage and _destination
rg "class FileSystemIntermediateStorage|def __init__" src/docs_buddy/adapters/__init__.py -A 10

Repository: programmer-ke/docs-buddy

Length of output: 3206


🏁 Script executed:

# Check if there's any documentation about the design intent
rg "_destination|FileSystemIntermediateStorage" src/docs_buddy/adapters/__init__.py -B 2 -A 2 | head -100

Repository: programmer-ke/docs-buddy

Length of output: 3231


🏁 Script executed:

# Check how destinations are created/initialized in actual usage
rg "FileSystemIntermediateStorage|FileSystemDocsStorage|FileSystemChunkPipeline" --type py -B 3 -A 3 src/docs_buddy/

Repository: programmer-ke/docs-buddy

Length of output: 2582


🏁 Script executed:

# Check if there are any tests that verify cross-filesystem behavior or failure recovery
rg "test.*replace|test.*atomic|test.*cross" --type py -A 5 tests/

Repository: programmer-ke/docs-buddy

Length of output: 1380


The destination handoff is still non-atomic and risks cross-filesystem copy fallback.

Line 233 stages with TemporaryDirectory() using the system default temp area, which may reside on a different filesystem from user data directories like .chunks or .whoosh. When source and destination are on different filesystems, shutil.move() falls back to copy+delete instead of atomic rename. Lines 244-247 then delete the live destination before the move, creating a window where a failed handoff leaves the last good index missing or partially copied.

Ensure staging occurs on the destination filesystem by supplying dir= to TemporaryDirectory(), and consider a safer swap strategy for the final replacement that doesn't require a delete-then-move sequence.

Minimum fix to keep staging on the destination filesystem
     `@contextmanager`
     def get_temp_location(self, prefix=""):
         """
         Create a temporary directory for atomic writes.
@@
-        with tempfile.TemporaryDirectory(
-            prefix=(prefix or f"{self._destination.name}_")
-        ) as temp_dir:
+        parent = self._destination.parent
+        parent.mkdir(parents=True, exist_ok=True)
+        with tempfile.TemporaryDirectory(
+            dir=parent, prefix=(prefix or f"{self._destination.name}_")
+        ) as temp_dir:
             yield Path(temp_dir)

This removes the cross-filesystem copy path, but the final rmtree() + move() swap still needs a safer handoff strategy.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/docs_buddy/adapters/__init__.py` around lines 225 - 247,
get_temp_location currently stages in system temp which can be on a different
filesystem than self._destination; change get_temp_location to create the
TemporaryDirectory on the destination filesystem by passing
dir=self._destination.parent (so prefix still can use self._destination.name).
Update replace_destination to avoid deleting dest before the handoff: instead
move/rename the current dest to a backup name (e.g., dest_backup), then
atomically move/rename the staged temp into the final location using an atomic
rename/replace API (os.replace or Path.replace) so the swap is atomic on the
same filesystem, and finally remove the backup only after the atomic swap
succeeds; reference functions: get_temp_location, replace_destination, and
attribute: self._destination.

Comment on lines +49 to +54
writer.add_document(
chunk_id=chunk_id,
content=chunk.chunk,
path=chunk.path,
path_keywords=chunk.path.split("/"),
metadata=json.dumps(chunk.metadata, default=json_datetime_handler),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
sed -n '49,54p' src/docs_buddy/adapters/whoosh_index.py
sed -n '98,101p' src/docs_buddy/services/__init__.py

Repository: programmer-ke/docs-buddy

Length of output: 544


🏁 Script executed:

#!/bin/bash
# Check for Windows-related TODOs, FIXMEs, or comments in path handling code
rg -i "windows|backslash|native.*separator|platform.*specific" --type py src/docs_buddy/

# Check if there's any documentation about path format expectations
find . -name "*.md" -o -name "*.txt" -o -name "*.rst" | head -20 | xargs grep -l -i "path\|windows\|separator" 2>/dev/null | head -5

# Check the PathLib usage pattern to understand intended path format
rg "pathlib|Path\(" src/docs_buddy/ -A 2 -B 2 | head -50

Repository: programmer-ke/docs-buddy

Length of output: 3154


🏁 Script executed:

#!/bin/bash
# Check what format chunk.path actually has by examining its source
rg "chunk\.path\s*=" src/docs_buddy/ -B 5 -A 2

# Check FileSystemDocsStorage to understand how paths are generated
rg "class FileSystemDocsStorage|get_source_paths" src/docs_buddy/services/__init__.py -A 20 | head -60

Repository: programmer-ke/docs-buddy

Length of output: 1509


🏁 Script executed:

#!/bin/bash
# Check the Whoosh field definition for path_keywords
rg "path_keywords|KEYWORD|schema" src/docs_buddy/adapters/whoosh_index.py -B 5 -A 5

Repository: programmer-ke/docs-buddy

Length of output: 1374


🌐 Web query:

Whoosh KEYWORD field list input analyzer lowercase behavior

💡 Result:

In Whoosh:

  • KEYWORD(...) does not lowercase by default (lowercase=False). It will only lowercase at index time if you set KEYWORD(lowercase=True) (or supply an analyzer that lowercases). The docs explicitly say: “To automatically lowercase the keywords before indexing them, use lowercase=True.” [1]
  • KEYWORD normally tokenizes a single string into keywords, split on spaces by default, or on commas if you set commas=True (useful when keywords may contain spaces). [1]
  • Important gotcha for “list input”: if you pass a list/tuple of unicode strings to writer.add_document(field=[...]), Whoosh says this bypasses the field’s analyzer for those strings—so any analyzer-driven behavior (including lowercasing) won’t run. You must normalize (e.g., .lower()) yourself before indexing if you use list/tuple input. [2]
  • Reference: the KEYWORD field type’s constructor defaults are lowercase=False, commas=False, and its analyzer behavior can be overridden via the analyzer= parameter. [3]
  • The built-in KeywordAnalyzer also defaults to lowercase=False (and has a commas option), matching the field defaults. [4]

Sources:
[1] Whoosh “Designing a schema” (KEYWORD, lowercase=True, commas=True) (whoosh.readthedocs.io)
[2] Whoosh writing docs (list/tuple of unicode strings bypasses analyzer) (whoosh.readthedocs.io)
[3] Whoosh fields.KEYWORD API (default params, analyzer=) (whoosh.readthedocs.io)
[4] Whoosh analysis.KeywordAnalyzer(lowercase=False, commas=False) (whoosh.readthedocs.io)

Citations:


🏁 Script executed:

#!/bin/bash
# Check if there are any tests or usage examples showing expected behavior
find . -name "*test*" -type f -name "*.py" | xargs grep -l "path_keywords\|KEYWORD" 2>/dev/null | head -3

# Check if there are comments or documentation about path handling
rg "path_keywords|chunk\.path" src/docs_buddy/ -B 3 -A 3

Repository: programmer-ke/docs-buddy

Length of output: 1548


Lowercase path_keywords list items to match schema intent.

Line 53 passes a list to Whoosh's KEYWORD(lowercase=True) field. Per Whoosh documentation, list input bypasses the field's analyzer, so the lowercase=True setting doesn't apply. Path keywords won't be lowercased during indexing, causing case-sensitive lookups. Lowercase each path component before passing to add_document().

🛠️ Proposed fix
             writer.add_document(
                 chunk_id=chunk_id,
                 content=chunk.chunk,
                 path=chunk.path,
-                path_keywords=chunk.path.split("/"),
+                path_keywords=[part.lower() for part in chunk.path.split("/") if part],
                 metadata=json.dumps(chunk.metadata, default=json_datetime_handler),
             )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
writer.add_document(
chunk_id=chunk_id,
content=chunk.chunk,
path=chunk.path,
path_keywords=chunk.path.split("/"),
metadata=json.dumps(chunk.metadata, default=json_datetime_handler),
writer.add_document(
chunk_id=chunk_id,
content=chunk.chunk,
path=chunk.path,
path_keywords=[part.lower() for part in chunk.path.split("/") if part],
metadata=json.dumps(chunk.metadata, default=json_datetime_handler),
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/docs_buddy/adapters/whoosh_index.py` around lines 49 - 54, The path
components in writer.add_document(...) are passed as chunk.path.split("/") into
path_keywords, which bypasses Whoosh's analyzer and leaves them case-sensitive;
transform each component to lowercase before passing to add_document (e.g., map
chunk.path.split("/") to a list of component.lower() values) so path_keywords
matches the KEYWORD(lowercase=True) schema intent; update the call that sets
path_keywords in the writer.add_document usage referencing chunk.path and
path_keywords accordingly.

Comment on lines +16 to +24
def test_frontmatter_extractor() -> None:
metadata, content = adapters.frontmatter_metadata_extractor(SAMPLE_DOC_1)

assert all(
[t in metadata for t in ["title", "description", "centeredHeader", "pubDate"]]
)
assert metadata["title"] == "Open Source Community"
assert metadata["pubDate"] == "2020-01-19"
assert SAMPLE_DOC_1.strip().endswith(content)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Make the content assertion exact.

SAMPLE_DOC_1.strip().endswith(content) is too weak here: it still passes for content == "" and for outputs that leave the frontmatter attached. Assert the extracted markdown body directly so this test proves the parser stripped the header.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/test_frontmatter.py` around lines 16 - 24, The existing test uses
SAMPLE_DOC_1.strip().endswith(content) which is too weak; change
test_frontmatter_extractor to assert that content exactly equals the expected
markdown body (i.e., SAMPLE_DOC_1 with the YAML frontmatter removed and trimmed)
rather than using endswith. Locate the test function and
adapters.frontmatter_metadata_extractor call, compute or inline the
expected_body (the original SAMPLE_DOC_1 content after removing the frontmatter
block and any surrounding whitespace), then replace the endswith assertion with
assert content == expected_body (or assert content.strip() ==
expected_body.strip()) so the test verifies the header is actually stripped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant