Unified Python toolkit for visual document understanding
Documentation • Installation • Quick Start • Tasks • Contributing
OmniDocs provides a single, consistent API for document AI tasks: layout detection, OCR, text extraction, table parsing, structured extraction, and reading order. Swap models and backends without changing your code.
result = extractor.extract(image)Why OmniDocs?
- One API —
.extract()for every task - Multi-backend — PyTorch, VLLM, MLX, API
- VLM API — Use any cloud VLM (Gemini, OpenRouter, Azure, OpenAI) with zero GPU
- Type-safe — Pydantic configs and outputs
- Structured extraction — Extract data into Pydantic schemas
- Production-ready — Modal deployment, batch processing
pip install omnidocsOr with uv:
uv pip install omnidocsCloud API access (Gemini, OpenRouter, Azure, OpenAI, ANANNAS AI) works out of the box — LiteLLM is included as a core dependency.
Install extras
pip install omnidocs[pytorch] # Local GPU inference
pip install omnidocs[vllm] # High-throughput production
pip install omnidocs[mlx] # Apple Silicon
pip install omnidocs[ocr] # Tesseract, EasyOCR, PaddleOCR
pip install omnidocs[all] # EverythingFrom source
git clone https://github.com/adithya-s-k/Omnidocs.git
cd Omnidocs
uv syncFlash Attention (optional, for PyTorch VLMs)
Download pre-built wheel from Flash Attention Releases:
# Example: Python 3.12, CUDA 12, PyTorch 2.5
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.5cxx11abiFALSE-cp312-cp312-linux_x86_64.whlUse any cloud VLM through a single, provider-agnostic API:
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.text_extraction import VLMTextExtractor
# Just set your env var: OPENROUTER_API_KEY, GOOGLE_API_KEY, etc.
config = VLMAPIConfig(model="openrouter/qwen/qwen3-vl-8b-instruct")
extractor = VLMTextExtractor(config=config)
result = extractor.extract("document.png", output_format="markdown")
print(result.content)Works with any provider: OpenRouter, Gemini, Azure, OpenAI, ANANNAS AI, self-hosted VLLM — if it speaks the OpenAI API, it works.
Extract typed data directly into Pydantic schemas:
from pydantic import BaseModel
from omnidocs.vlm import VLMAPIConfig
from omnidocs.tasks.structured_extraction import VLMStructuredExtractor
class Invoice(BaseModel):
vendor: str
total: float
items: list[str]
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
extractor = VLMStructuredExtractor(config=config)
result = extractor.extract(
image="invoice.png",
schema=Invoice,
prompt="Extract invoice details from this document.",
)
print(result.data.vendor, result.data.total)from omnidocs import Document
from omnidocs.tasks.text_extraction import QwenTextExtractor
from omnidocs.tasks.text_extraction.qwen import QwenTextVLLMConfig
doc = Document.from_pdf("report.pdf")
extractor = QwenTextExtractor(
backend=QwenTextVLLMConfig(model="Qwen/Qwen3-VL-8B-Instruct")
)
result = extractor.extract(doc.get_page(0), output_format="markdown")
print(result.content)from omnidocs import Document
from omnidocs.tasks.layout_extraction import DocLayoutYOLO, DocLayoutYOLOConfig
doc = Document.from_pdf("paper.pdf")
detector = DocLayoutYOLO(config=DocLayoutYOLOConfig(device="cuda"))
result = detector.extract(doc.get_page(0))
for box in result.bboxes:
print(f"{box.label.value}: {box.confidence:.2f}")from omnidocs.tasks.table_extraction import TableFormerExtractor, TableFormerConfig
extractor = TableFormerExtractor(config=TableFormerConfig(device="cuda"))
result = extractor.extract(table_image)
df = result.to_dataframe()
html = result.to_html()| Task | Description | Output |
|---|---|---|
| Text Extraction | Convert documents to Markdown/HTML | Formatted text |
| Layout Analysis | Detect titles, tables, figures, etc. | Bounding boxes + labels |
| OCR | Extract text with coordinates | Text blocks + positions |
| Table Extraction | Parse table structure | Cells, rows, columns |
| Structured Extraction | Extract typed data into Pydantic schemas | Validated model instances |
| Reading Order | Determine logical reading sequence | Ordered elements |
| Model | Backends | Notes |
|---|---|---|
| VLM API | Any cloud API | Provider-agnostic via LiteLLM |
| Qwen3-VL | PyTorch, VLLM, MLX, API | Best quality |
| MinerU VL | PyTorch, VLLM, API | Layout-aware extraction |
| Nanonets OCR2 | PyTorch, VLLM, MLX | Fast, accurate |
| Granite Docling | PyTorch, VLLM, MLX, API | IBM research model |
| DotsOCR | PyTorch, VLLM, API | Layout-aware |
| Model | Backends | Notes |
|---|---|---|
| VLM API | Any cloud API | Custom labels support |
| DocLayoutYOLO | PyTorch | Fast (0.1s/page) |
| RT-DETR | PyTorch | Transformer-based |
| Qwen Layout | PyTorch, VLLM, MLX, API | Custom labels |
| MinerU VL Layout | PyTorch, VLLM, API | High accuracy |
| Model | Backends | Notes |
|---|---|---|
| VLM API | Any cloud API | Pydantic schema output |
| Model | Backends | Notes |
|---|---|---|
| Tesseract | CPU | 100+ languages |
| EasyOCR | PyTorch | 80+ languages |
| PaddleOCR | PaddlePaddle | CJK optimized |
| Model | Backends | Notes |
|---|---|---|
| TableFormer | PyTorch | Structure + content |
| Model | Backends | Notes |
|---|---|---|
| Rule-based | CPU | R-tree indexing |
Use any VLM through a single config — just change the model string:
from omnidocs.vlm import VLMAPIConfig
# OpenRouter (100+ vision models)
config = VLMAPIConfig(model="openrouter/qwen/qwen3-vl-8b-instruct")
# Google Gemini
config = VLMAPIConfig(model="gemini/gemini-2.5-flash")
# Azure OpenAI
config = VLMAPIConfig(model="azure/gpt-5-mini", api_version="2024-12-01-preview")
# OpenAI
config = VLMAPIConfig(model="openai/gpt-4o")
# Any OpenAI-compatible API (ANANNAS AI, self-hosted VLLM, etc.)
config = VLMAPIConfig(
model="openai/model-name",
api_base="https://your-provider.com/v1",
)See the VLM API docs for full provider setup and model lists.
All VLM models support multiple inference backends:
# PyTorch (local GPU)
from omnidocs.tasks.text_extraction.qwen import QwenTextPyTorchConfig
config = QwenTextPyTorchConfig(model="Qwen/Qwen3-VL-8B-Instruct", device="cuda")
# VLLM (high-throughput)
from omnidocs.tasks.text_extraction.qwen import QwenTextVLLMConfig
config = QwenTextVLLMConfig(model="Qwen/Qwen3-VL-8B-Instruct", tensor_parallel_size=2)
# MLX (Apple Silicon)
from omnidocs.tasks.text_extraction.qwen import QwenTextMLXConfig
config = QwenTextMLXConfig(model="mlx-community/Qwen3-VL-8B-Instruct-4bit")
# API (provider-agnostic via litellm)
from omnidocs.tasks.text_extraction.qwen import QwenTextAPIConfig
config = QwenTextAPIConfig(model="openrouter/qwen/qwen3-vl-8b-instruct")from omnidocs import Document
# From file
doc = Document.from_pdf("file.pdf", page_range=(0, 9))
# From URL
doc = Document.from_url("https://arxiv.org/pdf/1706.03762")
# From images
doc = Document.from_images(["page1.png", "page2.png"])
# Access pages
image = doc.get_page(0) # PIL ImageSee the full Roadmap for planned features.
Coming soon:
- Math Recognition (LaTeX extraction)
- Chart Understanding
- Surya OCR + Layout
Contributions are welcome! See our Contributing Guide to get started.
# Setup
git clone https://github.com/adithya-s-k/Omnidocs.git
cd Omnidocs && uv sync
# Test
uv run pytest tests/ -v
# Lint
uv run ruff check . && uv run ruff format .
# Docs
uv run mkdocs serveResources:
Apache 2.0 — See LICENSE for details.
