py-rag-engine

A Python RAG engine focused on document ingestion for PDF and Markdown files. It loads documents, extracts text, applies recursive chunking with dynamic overlap, optionally detects semantic topic breaks with embeddings, and emits deduplicated chunks ready for indexing.

Features

PDF, .md, and .markdown ingestion.
PDF page extraction with pypdf.
Recursive chunking through RecursiveCharacterTextSplitter.
Dynamic overlap based on the configured chunk size.
Optional semantic paragraph chunking with user-provided embeddings.
SHA-256 content hashes for duplicate detection.
JSON-ready output with text, metadata, and content_hash.

Project Layout

PY-RAG-ENGINE/
├── data/
│   ├── gdp_document_0.metadata.json # generated metadata for practical review
│   ├── gdp_first_rows.json          # Hugging Face first-rows response
│   └── hf_rows.json                 # Markdown dataset sample
├── docs/
│   └── architecture.md
├── examples/
│   └── sample_hf.md
├── scripts/
│   └── process_document.py          # local CLI for PDF/Markdown processing
├── src/
│   └── py_rag_engine/
│       ├── chunking/
│       │   ├── recursive.py         # recursive splitter and dynamic overlap
│       │   └── semantic.py          # embedding-based semantic splitting
│       ├── embeddings/
│       │   └── hashing.py           # content hash helpers
│       ├── ingestion/
│       │   ├── loaders.py           # PDF and Markdown loaders
│       │   └── pipeline.py          # ingestion orchestration
│       ├── retrieval/
│       ├── domain.py                # DocumentChunk and ChunkMetadata
│       └── vector_math.py
├── tests/
│   ├── test_chunking.py
│   ├── test_embeddings.py
│   ├── test_ingestion.py
│   └── test_retrieval.py
├── pyproject.toml
└── README.md

Generated document files and full processing outputs are intentionally kept out of Git:

data/*.pdf
results/
outputs/

Metadata JSON files under data/ are kept versionable so the team can review practical processing results without committing large documents or full chunk payloads.

Installation

PowerShell:

python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -e .
pip install pytest

Install optional embedding support when semantic chunking should use sentence-transformers:

pip install -e ".[embeddings]"

PostgreSQL persistence requires a database with the pgvector extension available. The Python dependencies are installed with the default package:

pip install -e .

Download the Test PDF

The Hugging Face endpoint returns dataset rows with signed PDF URLs:

Invoke-WebRequest `
  -Uri "https://datasets-server.huggingface.co/first-rows?dataset=surgeai%2FGDP.pdf&config=default&split=train" `
  -OutFile "data\gdp_first_rows.json"

Download the first PDF referenced by the response:

$rows = Get-Content -Raw data\gdp_first_rows.json | ConvertFrom-Json
$url = $rows.rows[0].row.pdf.src
Invoke-WebRequest -Uri $url -OutFile "data\gdp_document_0.pdf"

Usage

Process a PDF

$env:PYTHONPATH='src'
python scripts\process_document.py `
  data\gdp_document_0.pdf `
  --output results\gdp_chunks.json `
  --metadata-output data\gdp_document_0.metadata.json

Expected output for the current test PDF:

chunks=50
output=results\gdp_chunks.json
metadata_output=data\gdp_document_0.metadata.json

results\gdp_chunks.json contains the full chunk list, including text. It is ignored by Git because it can become large.

data\gdp_document_0.metadata.json contains lightweight metadata suitable for team review:

{
  "chunk_count": 50,
  "unique_hash_count": 50,
  "sources": ["data\\gdp_document_0.pdf"],
  "pages": [1, 2, 3],
  "chunks": [
    {
      "content_hash": "8276b4d156b68d5ced56231850149a1e8ded0133bfab80375275e9702e45dbcb",
      "source": "data\\gdp_document_0.pdf",
      "page": 1,
      "chunk_index": 0,
      "text_chars": 234
    }
  ]
}

Inspect the Full Chunk Output

$env:PYTHONPATH='src'
Get-Content -Raw results\gdp_chunks.json

Each chunk has this shape:

[
  {
    "text": "GEOTECHNICAL INVESTIGATION REPORT ...",
    "metadata": {
      "source": "C:\\Users\\...\\data\\gdp_document_0.pdf",
      "page": 1,
      "chunk_index": 0,
      "extra": {}
    },
    "content_hash": "8276b4d156b68d5ced56231850149a1e8ded0133bfab80375275e9702e45dbcb"
  }
]

Process Markdown

$env:PYTHONPATH='src'
python scripts\process_document.py `
  examples\sample_hf.md `
  --output results\sample_hf_chunks.json `
  --metadata-output data\sample_hf.metadata.json

For Markdown documents, metadata.page is null because the source format has no page boundaries.

Quick Count Without Writing Files

$env:PYTHONPATH='src'
python -c "from py_rag_engine.ingestion import ingest_file; chunks = ingest_file('data/gdp_document_0.pdf'); print(len(chunks))"

This command prints only the number of generated chunks.

Semantic Chunking

Semantic chunking requires an embedding function. With sentence-transformers:

$env:PYTHONPATH='src'
python -c "from py_rag_engine.ingestion import ingest_file, make_sentence_transformer_embed; embed = make_sentence_transformer_embed('all-MiniLM-L6-v2'); chunks = ingest_file('examples/sample_hf.md', use_semantic_chunking=True, embed=embed); print(len(chunks))"

Useful parameters:

chunk_size: target chunk size, default 1200.
chunk_overlap: fixed overlap. When omitted, dynamic overlap is used.
overlap_ratio: dynamic overlap ratio, default 0.12.
use_semantic_chunking: enables embedding-based topic break detection.
semantic_similarity_threshold: paragraph similarity threshold.
deduplicate_by_hash: removes duplicated chunks, default True.

Main API

from py_rag_engine.ingestion import chunks_to_dicts, ingest_file

chunks = ingest_file("data/gdp_document_0.pdf")
payload = chunks_to_dicts(chunks)

payload is a list of dictionaries:

[
    {
        "text": "...",
        "metadata": {
            "source": "...",
            "page": 1,
            "chunk_index": 0,
            "extra": {},
        },
        "content_hash": "...",
    }
]

PostgreSQL Vector Persistence

The storage layer uses SQLAlchemy with PostgreSQL, JSONB metadata, and pgvector. It creates an embeddings table with a fixed vector dimension and an HNSW ANN index for cosine distance.

Supported dimensions:

text-embedding-3-small or openai-3-small: 1536 dimensions.
bge-m3: 1024 dimensions.

Example:

from sqlalchemy import create_engine

from py_rag_engine.storage import EmbeddingInput, PostgresEmbeddingStore

engine = create_engine("postgresql+psycopg://rag:rag@localhost:5432/rag")
store = PostgresEmbeddingStore(engine, embedding_model="openai-3-small")
store.create_schema()

store.add_embedding(
    EmbeddingInput(
        text="Original chunk text",
        embedding=[0.0] * 1536,
        metadata={"source": "examples/sample_hf.md", "page": None, "chunk_index": 0},
        content_hash="chunk-sha256",
    )
)

results = store.similarity_search([0.0] * 1536, top_k=5, ef_search=80)

create_schema() runs CREATE EXTENSION IF NOT EXISTS vector, creates the table, and creates these indexes:

CREATE INDEX ix_embeddings_embedding_hnsw_cosine
ON embeddings USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

CREATE INDEX ix_embeddings_metadata_gin
ON embeddings USING gin (metadata);

Similarity search orders by pgvector cosine distance and returns cosine_similarity = 1 - cosine_distance. Keep the ORDER BY and LIMIT pattern for PostgreSQL to use the HNSW ANN index.

Testing

Run the full test suite:

python -m pytest -q

Expected result in the current project state:

16 passed

Run a quick PDF ingestion check:

$env:PYTHONPATH='src'
python -c "from py_rag_engine.ingestion import ingest_file; chunks = ingest_file('data/gdp_document_0.pdf'); print(len(chunks), len({c.content_hash for c in chunks}))"

The first number is the total chunk count. The second number is the number of unique content hashes.

Development

Source code: src/py_rag_engine/.
Tests: tests/.
Domain models: src/py_rag_engine/domain.py.
Ingestion pipeline: src/py_rag_engine/ingestion/pipeline.py.
Document loaders: src/py_rag_engine/ingestion/loaders.py.
Recursive chunking: src/py_rag_engine/chunking/recursive.py.
Semantic chunking: src/py_rag_engine/chunking/semantic.py.

License

This repository is licensed under the MIT license. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

py-rag-engine

Features

Project Layout

Installation

Download the Test PDF

Usage

Process a PDF

Inspect the Full Chunk Output

Process Markdown

Quick Count Without Writing Files

Semantic Chunking

Main API

PostgreSQL Vector Persistence

Testing

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.claude		.claude
.github/workflows		.github/workflows
data		data
docs		docs
examples		examples
scripts		scripts
src/py_rag_engine		src/py_rag_engine
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

py-rag-engine

Features

Project Layout

Installation

Download the Test PDF

Usage

Process a PDF

Inspect the Full Chunk Output

Process Markdown

Quick Count Without Writing Files

Semantic Chunking

Main API

PostgreSQL Vector Persistence

Testing

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages