Skip to content

feat: support PDF uploads for RAG documents#356

Open
nw9663644-eng wants to merge 8 commits into
apache:mainfrom
nw9663644-eng:feat-support-pdf-upload
Open

feat: support PDF uploads for RAG documents#356
nw9663644-eng wants to merge 8 commits into
apache:mainfrom
nw9663644-eng:feat-support-pdf-upload

Conversation

@nw9663644-eng
Copy link
Copy Markdown

Summary

Support text-based PDF uploads in the RAG document upload path.

Changes

  • Add PDF text extraction support in read_documents()
  • Use pypdf to extract text from PDF files page by page
  • Handle encrypted, unreadable, and scanned-image-only PDFs with clear Gradio errors
  • Keep existing TXT and DOCX behavior unchanged
  • Update the demo upload copy to mention TXT, DOCX, and PDF

Test

  • python -m py_compile hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py

Closes #345

@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. enhancement New feature or request labels Jun 1, 2026
@github-actions github-actions Bot added the llm label Jun 1, 2026
@nw9663644-eng
Copy link
Copy Markdown
Author

nw9663644-eng commented Jun 1, 2026

I have covered the requested scope and suggested tests.

Current test coverage includes:

  • TXT file reading regression
  • DOCX file reading regression
  • text-based PDF reading
  • PDF files without extractable text
  • unreadable PDF behavior
  • encrypted PDF behavior
  • unsupported file type behavior

The implementation also covers:

  • adding pypdf to hugegraph-llm/pyproject.toml
  • extracting PDF text page by page in stable order
  • replacing the previous PDF TODO error path
  • keeping existing TXT and DOCX behavior unchanged
  • updating the demo upload copy to mention TXT, DOCX, and PDF

I also checked the dependency lock situation. The repository did not have an existing uv.lock file before running uv lock; running it locally generated a new root-level uv.lock. To avoid introducing a large new lock file unrelated to this focused change, I did not include it in this PR.

Local checks:

  • python -m py_compile hugegraph-llm/src/tests/test_vector_index_utils.py
  • python -m py_compile hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py
  • git diff upstream/main --check

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds text-based PDF support to the RAG document upload path so PDFs can be used in both vector index building and graph extraction (closes #345).

Changes:

  • Add pypdf-based PDF text extraction (read_pdf_text()) and wire it into read_documents().
  • Add unit tests covering TXT/DOCX/PDF success cases and common PDF failure modes (encrypted, unreadable, no extractable text).
  • Update demo UI copy to include PDF alongside TXT and DOCX.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
hugegraph-llm/src/tests/test_vector_index_utils.py Adds unit tests for read_documents() across TXT/DOCX/PDF and PDF error cases.
hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py Implements PDF text extraction via pypdf and updates extension handling + error messages.
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/vector_graph_block.py Updates upload instructions to mention PDF support.
hugegraph-llm/pyproject.toml Adds pypdf dependency required for PDF parsing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +33 to +55
try:
reader = PdfReader(full_path)

if reader.is_encrypted:
raise gr.Error(
"Encrypted PDF files are not supported. "
"Please upload an unencrypted PDF."
)

page_texts = []
for page in reader.pages:
page_text = page.extract_text() or ""
if page_text.strip():
page_texts.append(page_text)

text = "\n".join(page_texts).strip()
if not text:
raise gr.Error(
"No extractable text was found in this PDF. "
"Scanned-image PDFs are not supported without OCR."
)

return text
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated read_pdf_text() to open PDF files with a context manager and pass the binary stream to PdfReader, so the file handle is closed after extraction, including error paths. I also fixed the PDF text join syntax and verified the updated files with py_compile.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

for offset in offsets:
pdf += f"{offset:010d} 00000 n \n".encode()

pdf += (f"trailer\n<< /Size {len(objects) + 1} /Root 1 0 R >>\nstartxref\n{xref_offset}\n%%EOF\n").encode()
b"/Resources << /Font << /F1 4 0 R >> >> /Contents 5 0 R >>"
),
b"<< /Type /Font /Subtype /Type1 /BaseFont /Helvetica >>",
(b"<< /Length " + str(len(content_stream)).encode() + b" >>\nstream\n" + content_stream + b"\nendstream"),
Comment on lines +55 to +59
def test_read_documents_reads_txt_file(tmp_path):
txt_path = tmp_path / "sample.txt"
txt_path.write_text("hello hugegraph", encoding="utf-8")

result = read_documents([SimpleNamespace(name=str(txt_path))], "")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request llm size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support PDF uploads for RAG index and graph extraction input

3 participants