Docman (v0.5.0) is a document processing pipeline built on the Heddle framework. It extracts content from PDF, DOCX, PPTX, XLSX, and HTML files using an adaptive two-tier extraction strategy (MarkItDown for speed, Docling for depth), with LLM-based classification and summarization stages.
This is a consumer of the Heddle framework — it provides concrete worker configs, processing backends, and pipeline definitions. The Heddle framework itself lives in a separate repo.
src/docman/
contracts.py # Pydantic I/O models — source of truth for worker schemas
backends/
docling_backend.py # DoclingBackend — deep PDF/DOCX extraction via IBM Docling (OCR, tables)
markitdown_backend.py # MarkItDownBackend — fast extraction via Microsoft MarkItDown (no ML)
smart_extractor.py # SmartExtractorBackend — MarkItDown-first, Docling fallback
duckdb_ingest.py # DuckDBIngestBackend — document persistence (serialize_writes=True)
duckdb_query.py # DocmanQueryBackend — thin subclass of heddle.contrib.duckdb.DuckDBQueryBackend
tools/
vector_search.py # DuckDBVectorTool — thin wrapper around heddle.contrib.duckdb.DuckDBVectorTool
manifest.yaml # App manifest for Heddle Workshop deployment
configs/
workers/ # YAML configs for doc_extractor, doc_classifier, doc_summarizer, doc_ingest, doc_query
orchestrators/ # Pipeline configs (doc_pipeline, doc_pipeline_local, doc_pipeline_smart)
mcp/ # MCP gateway config (docman.yaml)
scripts/
dev-start.sh # Local development launcher
dev-start.ps1 # Windows development launcher
build-app.sh # Build deployment ZIP for Heddle Workshop
docs/
ARCHITECTURE.md # System architecture overview
CONTRIBUTING.md # Contribution standards and CLA
setup-macos.md # Full macOS environment setup
setup-windows.md # Full Windows environment setup
docling-setup.md # Docling configuration and performance tuning
tests/ # Unit tests (mock backends, in-memory DuckDB, no infrastructure)
Docman depends on heddle[duckdb] as a package. It uses:
ProcessingBackendABC — DoclingBackend, MarkItDownBackend, SmartExtractorBackend, DuckDBIngestBackend implement thisresolve_schema_refs()— worker configs useinput_schema_ref/output_schema_refpointing todocman.contracts.*Pydantic models (Heddle resolves to JSON Schema at load time)heddle.contrib.duckdb.DuckDBQueryBackend— DocmanQueryBackend subclasses this with Docman-specific schema defaultsheddle.contrib.duckdb.DuckDBVectorTool— DuckDBVectorTool wraps this with Docman-specific column/table defaultsheddle.contrib.duckdb.DuckDBViewTool— used directly (no Docman wrapper needed, already generic)ProcessorWorker— runs extraction and DuckDB backends viaheddle processorCLILLMWorker— runs classifier and summarizer viaheddle workerCLIPipelineOrchestrator— orchestrates the 4-stage pipeline viaheddle pipelineCLI (with dependency-aware parallel stage execution)
The CLI loads backends by fully qualified class path from worker configs:
processing_backend: "docman.backends.smart_extractor.SmartExtractorBackend"Docman provides three extraction backends, all producing the same output contract (ExtractorOutput):
- MarkItDownBackend — Uses Microsoft MarkItDown for fast, lightweight document-to-Markdown conversion. No ML models, no torch dependency. Supports PDF, DOCX, PPTX, XLSX, HTML, and more. Cannot OCR scanned PDFs or extract complex table structures. Derives metadata (sections, tables, page count) from the Markdown output.
- DoclingBackend — Uses IBM Docling for deep extraction with OCR, table structure recognition, and layout analysis. Requires torch. Best for scanned PDFs and complex layouts.
- SmartExtractorBackend (recommended) — Composite backend that tries MarkItDown first and falls back to Docling when needed. Fallback triggers: extracted text shorter than
min_text_length(default: 50 chars), MarkItDown error, or file extension inforce_docling_extensionslist. Reportsmodel_used: "markitdown"or"docling"so you can see which path ran.
- doc_extractor (ProcessorWorker + SmartExtractorBackend or DoclingBackend) — Extracts text, tables, structure from documents. Writes extracted JSON to workspace, returns file_ref + metadata summary.
- doc_classifier (LLMWorker) — LLM classifies document type from text_preview + metadata. Returns document_type, confidence, reasoning.
- doc_summarizer (LLMWorker) — LLM summarizes based on document type and extracted content. Returns summary, key_points, word_count.
- doc_ingest (ProcessorWorker + DuckDBIngestBackend) — Persists all pipeline results (metadata, classification, summary, full text) into DuckDB. Reads full extracted text from workspace JSON. Returns document_id.
Pipeline execution order: Heddle's PipelineOrchestrator auto-infers dependencies from input_mapping paths and runs independent stages concurrently. Docman's pipeline has genuinely sequential dependencies (classify depends on extract, summarize depends on both, ingest depends on all three), so it produces 4 levels of 1 stage each — sequential execution.
Pipeline variants:
doc_pipeline.yaml— Standard (Docling extraction, standard-tier summarizer)doc_pipeline_local.yaml— All-local (Docling extraction, local-tier summarizer)doc_pipeline_smart.yaml— Smart extraction (MarkItDown-first, standard-tier summarizer)
Scaling note: To process multiple documents concurrently, run multiple pipeline orchestrator instances. NATS queue groups automatically load-balance across replicas:
# Process 3 documents concurrently — each instance handles one goal
heddle pipeline --config configs/orchestrators/doc_pipeline_smart.yaml &
heddle pipeline --config configs/orchestrators/doc_pipeline_smart.yaml &
heddle pipeline --config configs/orchestrators/doc_pipeline_smart.yaml &- doc_query (ProcessorWorker + DuckDBQueryBackend) — Not part of the pipeline. Accepts structured query requests against the DuckDB database. Supports 5 actions:
search(full-text via DuckDB FTS),filter(by document_type, has_tables, page range),stats(aggregate counts/averages),get(single document by ID),vector_search(semantic similarity via embeddings).
Worker I/O schemas are defined as Pydantic models in src/docman/contracts.py. Worker YAML configs reference them via input_schema_ref / output_schema_ref, and Heddle's resolve_schema_refs() converts them to JSON Schema at load time.
Models: ExtractorInput, ExtractorOutput, ClassifierInput, ClassifierOutput, SummarizerInput, SummarizerOutput, IngestInput, IngestOutput, QueryInput, QueryOutput.
- Large data passes via file references in a shared workspace directory (
--workspace-dir) - Messages carry only file_ref strings, not inline content
- Extraction backends (MarkItDown, Docling) read source file from workspace, write extracted JSON to workspace
- Summarizer file resolution:
resolve_file_refs: ["file_ref"]andworkspace_dirare set in the summarizer config — Heddle's LLMWorker reads extracted JSON from workspace automatically.
DoclingBackend reads tuning options from the backend_config section of doc_extractor.yaml. Key settings for Apple Silicon (M1 Pro 32GB):
device: "mps"— GPU acceleration via Metal Performance Shadersnum_threads: 8— matches M1 Pro's 8 performance coresocr_engine: "ocrmac"— native macOS Vision framework OCRlayout_batch_size: 4/ocr_batch_size: 4— balanced for 32GB RAM
Pre-download detection models: docling-tools models download
Full guide: docs/docling-setup.md
Docman can be exposed as an MCP (Model Context Protocol) server using Heddle's built-in MCP gateway — zero MCP-specific code needed.
# Start Docman as an MCP server (requires heddle[mcp] and NATS + workers running)
heddle mcp --config configs/mcp/docman.yaml
# Or with streamable-http transport
heddle mcp --config configs/mcp/docman.yaml --transport streamable-http --port 8000The MCP config (configs/mcp/docman.yaml) maps Docman's workers and query backend to MCP tools:
- Pipeline →
process_documenttool (full extract → classify → summarize → ingest) - Query backend →
docman_search,docman_filter,docman_stats,docman_gettools - Workspace files exposed as MCP resources
See Heddle's Building Workflows Part 11 for full MCP gateway documentation.
- All extraction backends extend
SyncProcessingBackend— synchronous work runs in a thread pool (asyncio.run_in_executor) - SmartExtractorBackend creates inner backends lazily — importing docman does not pull in torch or markitdown
- DuckDB backends also run synchronously via
SyncProcessingBackend(DuckDB is synchronous) - Path traversal validation: file_ref must resolve within workspace_dir
text_preview(first ~500 words) is included inline in extractor output so the classifier doesn't need file access- Workspace directory is shared filesystem, configured per deployment
- DuckDB database file path is configurable via
backend_config.db_path(defaults to workspace) - DuckDB schema is auto-created on first ingestion (no migration step needed)
- DuckDB FTS extension enables full-text search across document content and summaries
- Query results exclude
full_textcolumn by default to keep NATS messages small; usegetaction for full content - Vector embeddings use
FLOAT[](variable-length) column in DuckDB — uselist_cosine_similarity(NOTarray_cosine_similaritywhich requires fixed-sizeFLOAT[N]) - Embedding generation is optional — controlled by
embeddingconfig section indoc_ingest.yaml. When absent, embedding column stores NULL - DuckDBViewTool and DuckDBVectorTool implement Heddle's
SyncToolProviderfor LLM function-calling viaknowledge_silosconfig
# Install all dependencies (requires Python 3.11+, uses uv)
# Heddle is resolved from ../heddle via [tool.uv.sources] in pyproject.toml
uv sync --extra dev
# Pre-download Docling detection models (avoids delay on first run)
uv run docling-tools models download
# Run unit tests (no infrastructure needed)
uv run pytest tests/ -v
# Run with infrastructure (needs NATS + Heddle installed)
# Terminal 1: docker run -p 4222:4222 nats:latest
# Terminal 2: uv run heddle router --nats-url nats://localhost:4222
# Terminal 3: uv run heddle processor --config configs/workers/doc_extractor.yaml --nats-url nats://localhost:4222
# Terminal 4: OLLAMA_URL=http://localhost:11434 uv run heddle worker --config configs/workers/doc_classifier.yaml --tier local --nats-url nats://localhost:4222
# Terminal 5: ANTHROPIC_API_KEY=sk-... uv run heddle worker --config configs/workers/doc_summarizer.yaml --tier standard --nats-url nats://localhost:4222
# Terminal 6: uv run heddle processor --config configs/workers/doc_ingest.yaml --nats-url nats://localhost:4222
# Terminal 7: uv run heddle processor --config configs/workers/doc_query.yaml --nats-url nats://localhost:4222
# Terminal 8: uv run heddle pipeline --config configs/orchestrators/doc_pipeline.yaml --nats-url nats://localhost:4222
# Submit: uv run heddle submit "Process document" --context file_ref=test.pdf --nats-url nats://localhost:4222The following items are implemented and working:
- Pydantic I/O contracts (
src/docman/contracts.py) — source of truth for all worker schemas, resolved at load time via Heddle'sresolve_schema_refs() - MarkItDownBackend (
src/docman/backends/markitdown_backend.py) — fast extraction via Microsoft MarkItDown, derives metadata from Markdown output - SmartExtractorBackend (
src/docman/backends/smart_extractor.py) — composite MarkItDown-first with Docling fallback, configurable thresholds - DoclingBackend (
src/docman/backends/docling_backend.py) — deep extraction with OCR, table structure, layout analysis - DuckDBIngestBackend (
src/docman/backends/duckdb_ingest.py) — persists pipeline results to DuckDB with auto-schema creation, FTS index, optional vector embeddings - DocmanQueryBackend (
src/docman/backends/duckdb_query.py) — thin subclass with Docman document schema defaults - DuckDBVectorTool (
src/docman/tools/vector_search.py) — thin wrapper with Docman-specific defaults - Worker configs for all pipeline stages + standalone query worker — using
input_schema_ref/output_schema_ref - Pipeline configs:
doc_pipeline.yaml(Docling),doc_pipeline_local.yaml(all local),doc_pipeline_smart.yaml(MarkItDown-first) - App manifest (
manifest.yaml) — declares all configs, Python package, and required Heddle extras - Build script (
scripts/build-app.sh) — generates deployment ZIP for Heddle Workshop
- End-to-end test — With NATS, Valkey, and Ollama running locally
- Design a parallel pipeline variant — Current pipeline is inherently sequential, but a variant could run classify and summarize concurrently if the summarizer doesn't need
document_type(Heddle's pipeline parallelism would auto-detect this from input_mapping) - MCP progress notifications — When Heddle's MCP bridge wires progress callbacks to MCP progress tokens, Docman's pipeline would automatically report per-stage progress to MCP clients
- Apple Silicon Mac
- Python >=3.11 (pyproject.toml), recommend 3.13 for compatibility
- MarkItDown >=0.1.0 for fast document extraction (lightweight, no ML)
- Docling >=2.0.0 for deep document extraction (pulls torch, torchvision)
- DuckDB >=1.0.0 for embedded analytics database
- Ollama for local LLM tier (llama3.2:3b recommended)