Skip to content

Added multi-format ingestion: DOCX, TXT, Markdown#27

Merged
markgewhite merged 4 commits into
mainfrom
feature/6-multi-format-ingestion
Apr 11, 2026
Merged

Added multi-format ingestion: DOCX, TXT, Markdown#27
markgewhite merged 4 commits into
mainfrom
feature/6-multi-format-ingestion

Conversation

@markgewhite
Copy link
Copy Markdown
Owner

@markgewhite markgewhite commented Apr 11, 2026

Summary

  • Extended load_folder() to handle DOCX (Docx2txtLoader), TXT and MD (TextLoader) alongside PDF
  • Added Markdown-aware chunking: MarkdownHeaderTextSplitter splits by headers first, then by size
  • Markdown chunks include section_header metadata (e.g., "Guide > Setup > Prerequisites")
  • Citations show section headers for Markdown sources: [README.md, Section: Installation]
  • Non-Markdown formats continue using RecursiveCharacterTextSplitter unchanged

Closes #6

Test plan

  • Loader: DOCX loading with correct metadata (3 tests)
  • Loader: TXT and MD loading with correct doc_type (2 tests)
  • Loader: Mixed format folder loads all formats (1 test)
  • Chunker: Markdown section_header metadata (2 tests)
  • Chunker: Non-Markdown has no section_header (1 test)
  • Answerer: Section header in citations (2 tests)
  • Manual: add .md and .docx to test_docs, query across formats

🤖 Generated with Claude Code

markgewhite and others added 4 commits April 11, 2026 21:40
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@markgewhite markgewhite merged commit 08e77d7 into main Apr 11, 2026
1 check passed
@markgewhite markgewhite deleted the feature/6-multi-format-ingestion branch April 11, 2026 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-format ingestion: DOCX, TXT, Markdown

1 participant