A Python RAG engine focused on document ingestion for PDF and Markdown files. It loads documents, extracts text, applies recursive chunking with dynamic overlap, optionally detects semantic topic breaks with embeddings, and emits deduplicated chunks ready for indexing.
- PDF,
.md, and.markdowningestion. - PDF page extraction with
pypdf. - Recursive chunking through
RecursiveCharacterTextSplitter. - Dynamic overlap based on the configured chunk size.
- Optional semantic paragraph chunking with user-provided embeddings.
- SHA-256 content hashes for duplicate detection.
- JSON-ready output with
text,metadata, andcontent_hash.
PY-RAG-ENGINE/
├── data/
│ ├── gdp_document_0.metadata.json # generated metadata for practical review
│ ├── gdp_first_rows.json # Hugging Face first-rows response
│ └── hf_rows.json # Markdown dataset sample
├── docs/
│ └── architecture.md
├── examples/
│ └── sample_hf.md
├── scripts/
│ └── process_document.py # local CLI for PDF/Markdown processing
├── src/
│ └── py_rag_engine/
│ ├── chunking/
│ │ ├── recursive.py # recursive splitter and dynamic overlap
│ │ └── semantic.py # embedding-based semantic splitting
│ ├── embeddings/
│ │ └── hashing.py # content hash helpers
│ ├── ingestion/
│ │ ├── loaders.py # PDF and Markdown loaders
│ │ └── pipeline.py # ingestion orchestration
│ ├── retrieval/
│ ├── domain.py # DocumentChunk and ChunkMetadata
│ └── vector_math.py
├── tests/
│ ├── test_chunking.py
│ ├── test_embeddings.py
│ ├── test_ingestion.py
│ └── test_retrieval.py
├── pyproject.toml
└── README.md
Generated document files and full processing outputs are intentionally kept out of Git:
data/*.pdfresults/outputs/
Metadata JSON files under data/ are kept versionable so the team can review
practical processing results without committing large documents or full chunk
payloads.
PowerShell:
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -e .
pip install pytestInstall optional embedding support when semantic chunking should use
sentence-transformers:
pip install -e ".[embeddings]"PostgreSQL persistence requires a database with the pgvector extension
available. The Python dependencies are installed with the default package:
pip install -e .The Hugging Face endpoint returns dataset rows with signed PDF URLs:
Invoke-WebRequest `
-Uri "https://datasets-server.huggingface.co/first-rows?dataset=surgeai%2FGDP.pdf&config=default&split=train" `
-OutFile "data\gdp_first_rows.json"Download the first PDF referenced by the response:
$rows = Get-Content -Raw data\gdp_first_rows.json | ConvertFrom-Json
$url = $rows.rows[0].row.pdf.src
Invoke-WebRequest -Uri $url -OutFile "data\gdp_document_0.pdf"$env:PYTHONPATH='src'
python scripts\process_document.py `
data\gdp_document_0.pdf `
--output results\gdp_chunks.json `
--metadata-output data\gdp_document_0.metadata.jsonExpected output for the current test PDF:
chunks=50
output=results\gdp_chunks.json
metadata_output=data\gdp_document_0.metadata.json
results\gdp_chunks.json contains the full chunk list, including text. It is
ignored by Git because it can become large.
data\gdp_document_0.metadata.json contains lightweight metadata suitable for
team review:
{
"chunk_count": 50,
"unique_hash_count": 50,
"sources": ["data\\gdp_document_0.pdf"],
"pages": [1, 2, 3],
"chunks": [
{
"content_hash": "8276b4d156b68d5ced56231850149a1e8ded0133bfab80375275e9702e45dbcb",
"source": "data\\gdp_document_0.pdf",
"page": 1,
"chunk_index": 0,
"text_chars": 234
}
]
}$env:PYTHONPATH='src'
Get-Content -Raw results\gdp_chunks.jsonEach chunk has this shape:
[
{
"text": "GEOTECHNICAL INVESTIGATION REPORT ...",
"metadata": {
"source": "C:\\Users\\...\\data\\gdp_document_0.pdf",
"page": 1,
"chunk_index": 0,
"extra": {}
},
"content_hash": "8276b4d156b68d5ced56231850149a1e8ded0133bfab80375275e9702e45dbcb"
}
]$env:PYTHONPATH='src'
python scripts\process_document.py `
examples\sample_hf.md `
--output results\sample_hf_chunks.json `
--metadata-output data\sample_hf.metadata.jsonFor Markdown documents, metadata.page is null because the source format has
no page boundaries.
$env:PYTHONPATH='src'
python -c "from py_rag_engine.ingestion import ingest_file; chunks = ingest_file('data/gdp_document_0.pdf'); print(len(chunks))"This command prints only the number of generated chunks.
Semantic chunking requires an embedding function. With sentence-transformers:
$env:PYTHONPATH='src'
python -c "from py_rag_engine.ingestion import ingest_file, make_sentence_transformer_embed; embed = make_sentence_transformer_embed('all-MiniLM-L6-v2'); chunks = ingest_file('examples/sample_hf.md', use_semantic_chunking=True, embed=embed); print(len(chunks))"Useful parameters:
chunk_size: target chunk size, default1200.chunk_overlap: fixed overlap. When omitted, dynamic overlap is used.overlap_ratio: dynamic overlap ratio, default0.12.use_semantic_chunking: enables embedding-based topic break detection.semantic_similarity_threshold: paragraph similarity threshold.deduplicate_by_hash: removes duplicated chunks, defaultTrue.
from py_rag_engine.ingestion import chunks_to_dicts, ingest_file
chunks = ingest_file("data/gdp_document_0.pdf")
payload = chunks_to_dicts(chunks)payload is a list of dictionaries:
[
{
"text": "...",
"metadata": {
"source": "...",
"page": 1,
"chunk_index": 0,
"extra": {},
},
"content_hash": "...",
}
]The storage layer uses SQLAlchemy with PostgreSQL, JSONB metadata, and pgvector.
It creates an embeddings table with a fixed vector dimension and an HNSW ANN
index for cosine distance.
Supported dimensions:
text-embedding-3-smalloropenai-3-small: 1536 dimensions.bge-m3: 1024 dimensions.
Example:
from sqlalchemy import create_engine
from py_rag_engine.storage import EmbeddingInput, PostgresEmbeddingStore
engine = create_engine("postgresql+psycopg://rag:rag@localhost:5432/rag")
store = PostgresEmbeddingStore(engine, embedding_model="openai-3-small")
store.create_schema()
store.add_embedding(
EmbeddingInput(
text="Original chunk text",
embedding=[0.0] * 1536,
metadata={"source": "examples/sample_hf.md", "page": None, "chunk_index": 0},
content_hash="chunk-sha256",
)
)
results = store.similarity_search([0.0] * 1536, top_k=5, ef_search=80)create_schema() runs CREATE EXTENSION IF NOT EXISTS vector, creates the
table, and creates these indexes:
CREATE INDEX ix_embeddings_embedding_hnsw_cosine
ON embeddings USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
CREATE INDEX ix_embeddings_metadata_gin
ON embeddings USING gin (metadata);Similarity search orders by pgvector cosine distance and returns
cosine_similarity = 1 - cosine_distance. Keep the ORDER BY and LIMIT
pattern for PostgreSQL to use the HNSW ANN index.
Run the full test suite:
python -m pytest -qExpected result in the current project state:
16 passed
Run a quick PDF ingestion check:
$env:PYTHONPATH='src'
python -c "from py_rag_engine.ingestion import ingest_file; chunks = ingest_file('data/gdp_document_0.pdf'); print(len(chunks), len({c.content_hash for c in chunks}))"The first number is the total chunk count. The second number is the number of unique content hashes.
- Source code:
src/py_rag_engine/. - Tests:
tests/. - Domain models:
src/py_rag_engine/domain.py. - Ingestion pipeline:
src/py_rag_engine/ingestion/pipeline.py. - Document loaders:
src/py_rag_engine/ingestion/loaders.py. - Recursive chunking:
src/py_rag_engine/chunking/recursive.py. - Semantic chunking:
src/py_rag_engine/chunking/semantic.py.
This repository is licensed under the MIT license. See LICENSE.