feat: support PDF uploads for RAG documents#356
Conversation
|
I have covered the requested scope and suggested tests. Current test coverage includes:
The implementation also covers:
I also checked the dependency lock situation. The repository did not have an existing Local checks:
|
There was a problem hiding this comment.
Pull request overview
Adds text-based PDF support to the RAG document upload path so PDFs can be used in both vector index building and graph extraction (closes #345).
Changes:
- Add
pypdf-based PDF text extraction (read_pdf_text()) and wire it intoread_documents(). - Add unit tests covering TXT/DOCX/PDF success cases and common PDF failure modes (encrypted, unreadable, no extractable text).
- Update demo UI copy to include PDF alongside TXT and DOCX.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| hugegraph-llm/src/tests/test_vector_index_utils.py | Adds unit tests for read_documents() across TXT/DOCX/PDF and PDF error cases. |
| hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py | Implements PDF text extraction via pypdf and updates extension handling + error messages. |
| hugegraph-llm/src/hugegraph_llm/demo/rag_demo/vector_graph_block.py | Updates upload instructions to mention PDF support. |
| hugegraph-llm/pyproject.toml | Adds pypdf dependency required for PDF parsing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try: | ||
| reader = PdfReader(full_path) | ||
|
|
||
| if reader.is_encrypted: | ||
| raise gr.Error( | ||
| "Encrypted PDF files are not supported. " | ||
| "Please upload an unencrypted PDF." | ||
| ) | ||
|
|
||
| page_texts = [] | ||
| for page in reader.pages: | ||
| page_text = page.extract_text() or "" | ||
| if page_text.strip(): | ||
| page_texts.append(page_text) | ||
|
|
||
| text = "\n".join(page_texts).strip() | ||
| if not text: | ||
| raise gr.Error( | ||
| "No extractable text was found in this PDF. " | ||
| "Scanned-image PDFs are not supported without OCR." | ||
| ) | ||
|
|
||
| return text |
There was a problem hiding this comment.
I updated read_pdf_text() to open PDF files with a context manager and pass the binary stream to PdfReader, so the file handle is closed after extraction, including error paths. I also fixed the PDF text join syntax and verified the updated files with py_compile.
| for offset in offsets: | ||
| pdf += f"{offset:010d} 00000 n \n".encode() | ||
|
|
||
| pdf += (f"trailer\n<< /Size {len(objects) + 1} /Root 1 0 R >>\nstartxref\n{xref_offset}\n%%EOF\n").encode() |
| b"/Resources << /Font << /F1 4 0 R >> >> /Contents 5 0 R >>" | ||
| ), | ||
| b"<< /Type /Font /Subtype /Type1 /BaseFont /Helvetica >>", | ||
| (b"<< /Length " + str(len(content_stream)).encode() + b" >>\nstream\n" + content_stream + b"\nendstream"), |
| def test_read_documents_reads_txt_file(tmp_path): | ||
| txt_path = tmp_path / "sample.txt" | ||
| txt_path.write_text("hello hugegraph", encoding="utf-8") | ||
|
|
||
| result = read_documents([SimpleNamespace(name=str(txt_path))], "") |
Summary
Support text-based PDF uploads in the RAG document upload path.
Changes
read_documents()pypdfto extract text from PDF files page by pageTest
python -m py_compile hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.pyCloses #345