feat: support PDF uploads for RAG documents by nw9663644-eng · Pull Request #356 · apache/hugegraph-ai

nw9663644-eng · 2026-06-01T15:19:56Z

Summary

Support text-based PDF uploads in the RAG document upload path.

Changes

Add PDF text extraction support in read_documents()
Use pypdf to extract text from PDF files page by page
Handle encrypted, unreadable, and scanned-image-only PDFs with clear Gradio errors
Keep existing TXT and DOCX behavior unchanged
Update the demo upload copy to mention TXT, DOCX, and PDF

Test

python -m py_compile hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py

Closes #345

nw9663644-eng · 2026-06-01T16:12:04Z

I have covered the requested scope and suggested tests.

Current test coverage includes:

TXT file reading regression
DOCX file reading regression
text-based PDF reading
PDF files without extractable text
unreadable PDF behavior
encrypted PDF behavior
unsupported file type behavior

The implementation also covers:

adding pypdf to hugegraph-llm/pyproject.toml
extracting PDF text page by page in stable order
replacing the previous PDF TODO error path
keeping existing TXT and DOCX behavior unchanged
updating the demo upload copy to mention TXT, DOCX, and PDF

I also checked the dependency lock situation. The repository did not have an existing uv.lock file before running uv lock; running it locally generated a new root-level uv.lock. To avoid introducing a large new lock file unrelated to this focused change, I did not include it in this PR.

Local checks:

python -m py_compile hugegraph-llm/src/tests/test_vector_index_utils.py
python -m py_compile hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py
git diff upstream/main --check

Copilot

Pull request overview

Adds text-based PDF support to the RAG document upload path so PDFs can be used in both vector index building and graph extraction (closes #345).

Changes:

Add pypdf-based PDF text extraction (read_pdf_text()) and wire it into read_documents().
Add unit tests covering TXT/DOCX/PDF success cases and common PDF failure modes (encrypted, unreadable, no extractable text).
Update demo UI copy to include PDF alongside TXT and DOCX.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
hugegraph-llm/src/tests/test_vector_index_utils.py	Adds unit tests for `read_documents()` across TXT/DOCX/PDF and PDF error cases.
hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py	Implements PDF text extraction via `pypdf` and updates extension handling + error messages.
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/vector_graph_block.py	Updates upload instructions to mention PDF support.
hugegraph-llm/pyproject.toml	Adds `pypdf` dependency required for PDF parsing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

nw9663644-eng · 2026-06-02T06:23:37Z

+    try:
+        reader = PdfReader(full_path)
+
+        if reader.is_encrypted:
+            raise gr.Error(
+                "Encrypted PDF files are not supported. "
+                "Please upload an unencrypted PDF."
+            )
+
+        page_texts = []
+        for page in reader.pages:
+            page_text = page.extract_text() or ""
+            if page_text.strip():
+                page_texts.append(page_text)
+
+        text = "\n".join(page_texts).strip()
+        if not text:
+            raise gr.Error(
+                "No extractable text was found in this PDF. "
+                "Scanned-image PDFs are not supported without OCR."
+            )
+
+        return text


I updated read_pdf_text() to open PDF files with a context manager and pass the binary stream to PdfReader, so the file handle is closed after extraction, including error paths. I also fixed the PDF text join syntax and verified the updated files with py_compile.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

+    for offset in offsets:
+        pdf += f"{offset:010d} 00000 n \n".encode()
+
+    pdf += (f"trailer\n<< /Size {len(objects) + 1} /Root 1 0 R >>\nstartxref\n{xref_offset}\n%%EOF\n").encode()


+            b"/Resources << /Font << /F1 4 0 R >> >> /Contents 5 0 R >>"
+        ),
+        b"<< /Type /Font /Subtype /Type1 /BaseFont /Helvetica >>",
+        (b"<< /Length " + str(len(content_stream)).encode() + b" >>\nstream\n" + content_stream + b"\nendstream"),


+def test_read_documents_reads_txt_file(tmp_path):
+    txt_path = tmp_path / "sample.txt"
+    txt_path.write_text("hello hugegraph", encoding="utf-8")
+
+    result = read_documents([SimpleNamespace(name=str(txt_path))], "")


feat: support PDF uploads for RAG documents

f511c44

dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. enhancement New feature or request labels Jun 1, 2026

github-actions Bot added the llm label Jun 1, 2026

wn12222 added 3 commits June 1, 2026 23:45

chore: add PDF document reading tests

295d50c

chore: add document upload regression tests

837ece9

chore: add PDF upload edge case tests

f112387

nw9663644-eng mentioned this pull request Jun 1, 2026

[Feature] Support PDF uploads for RAG index and graph extraction input #345

Open

3 tasks

chore: add license header to test file

5aacb8c

imbajin requested a review from Copilot June 1, 2026 21:54

Copilot started reviewing on behalf of imbajin June 1, 2026 21:55 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

wn12222 added 3 commits June 2, 2026 14:07

chore: close PDF file stream after reading

5f95a27

fix: repair PDF text join syntax

73e6987

chore: format PDF upload changes

91410d2

imbajin requested a review from Copilot June 2, 2026 19:10

Copilot started reviewing on behalf of imbajin June 2, 2026 19:10 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support PDF uploads for RAG documents#356

feat: support PDF uploads for RAG documents#356
nw9663644-eng wants to merge 8 commits into
apache:mainfrom
nw9663644-eng:feat-support-pdf-upload

nw9663644-eng commented Jun 1, 2026

Uh oh!

nw9663644-eng commented Jun 1, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

nw9663644-eng Jun 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nw9663644-eng commented Jun 1, 2026

Summary

Changes

Test

Uh oh!

nw9663644-eng commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

nw9663644-eng Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nw9663644-eng commented Jun 1, 2026 •

edited

Loading