Document pipeline parallelism and horizontal scaling

hooman · claude · hooman · commit 858448461e1f · 2026-03-13T01:20:05.000-07:00
- CLAUDE.md: Add pipeline execution order explanation, horizontal
  scaling note with NATS queue group examples, future work items
  for parallel pipeline variant and MCP progress notifications
- docs/ARCHITECTURE.md: Update pipeline stages with execution level
  annotations, note dependency-aware parallelism

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -38,7 +38,7 @@ Docman depends on `loom[duckdb]` as a package. It uses:
 - `loom.contrib.duckdb.DuckDBViewTool` — used directly (no Docman wrapper needed, already generic)
 - `ProcessorWorker` — runs DoclingBackend and DuckDB backends via `loom processor` CLI
 - `LLMWorker` — runs classifier and summarizer via `loom worker` CLI
-- `PipelineOrchestrator` — orchestrates the 4-stage pipeline via `loom pipeline` CLI
+- `PipelineOrchestrator` — orchestrates the 4-stage pipeline via `loom pipeline` CLI (now with dependency-aware parallel stage execution)
 
 The CLI loads backends by fully qualified class path from worker configs:
 ```yaml
@@ -52,6 +52,16 @@ processing_backend: "docman.backends.docling_backend.DoclingBackend"
 3. **doc_summarizer** (LLMWorker) — LLM summarizes based on document type and extracted content. Returns summary, key_points, word_count.
 4. **doc_ingest** (ProcessorWorker + DuckDBIngestBackend) — Persists all pipeline results (metadata, classification, summary, full text) into DuckDB. Reads full extracted text from workspace JSON. Returns document_id.
 
+**Pipeline execution order:** Loom's `PipelineOrchestrator` now auto-infers dependencies from `input_mapping` paths and runs independent stages concurrently. Docman's pipeline has genuinely sequential dependencies (classify depends on extract, summarize depends on both, ingest depends on all three), so it produces 4 levels of 1 stage each — identical sequential execution to before. No config changes were needed.
+
+**Scaling note:** To process multiple documents concurrently, run multiple pipeline orchestrator instances. NATS queue groups automatically load-balance across replicas:
+```bash
+# Process 3 documents concurrently — each instance handles one goal
+loom pipeline --config configs/orchestrators/doc_pipeline.yaml &
+loom pipeline --config configs/orchestrators/doc_pipeline.yaml &
+loom pipeline --config configs/orchestrators/doc_pipeline.yaml &
+```
+
 ## Standalone workers
 
 - **doc_query** (ProcessorWorker + DuckDBQueryBackend) — Not part of the pipeline. Accepts structured query requests against the DuckDB database. Supports 5 actions: `search` (full-text via DuckDB FTS), `filter` (by document_type, has_tables, page range), `stats` (aggregate counts/averages), `get` (single document by ID), `vector_search` (semantic similarity via embeddings).
@@ -152,6 +162,8 @@ The following items are **implemented and working**:
 
 1. **Wire summarizer file_ref resolution** — Add `resolve_file_refs: ["file_ref"]` and `workspace_dir` to `doc_summarizer.yaml` (Loom now supports this natively)
 2. **End-to-end test** — With NATS, Redis, and Ollama running locally
+3. **Design a parallel pipeline variant** — Current pipeline is inherently sequential, but a variant could run classify and summarize concurrently if the summarizer doesn't need `document_type` (Loom's pipeline parallelism would auto-detect this from input_mapping)
+4. **MCP progress notifications** — When Loom's MCP bridge wires progress callbacks to MCP progress tokens, Docman's pipeline would automatically report per-stage progress to MCP clients
 
 ## Known issues
 
diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md
@@ -63,12 +63,19 @@ tests/                      # 40 unit tests (no infrastructure needed)
 
 ## Pipeline Stages
 
-The pipeline processes documents through four sequential stages:
+The pipeline processes documents through four stages. Loom's `PipelineOrchestrator`
+auto-infers dependencies from `input_mapping` paths and runs independent stages
+concurrently. In Docman's case, each stage depends on the previous one, so
+execution remains sequential:
 
 ```
 PDF/DOCX → [Extract] → [Classify] → [Summarize] → [Ingest] → DuckDB
+           Level 0      Level 1      Level 2       Level 3
 ```
 
+To process multiple documents concurrently, run multiple pipeline instances —
+NATS queue groups handle load balancing automatically.
+
 ### Stage 1: Extract (`doc_extractor`)
 
 - **Type:** ProcessorWorker + DoclingBackend
@@ -187,7 +194,7 @@ Docman depends on `loom[duckdb]` as a package and uses these Loom components:
 | `DuckDBViewTool` | Used directly (no wrapper) |
 | `ProcessorWorker` | Runs extraction and ingestion stages |
 | `LLMWorker` | Runs classification and summarization stages |
-| `PipelineOrchestrator` | Orchestrates 4-stage pipeline |
+| `PipelineOrchestrator` | Orchestrates 4-stage pipeline (dependency-aware parallelism) |
 | `WorkspaceManager` | File-ref resolution with path traversal protection |
 | `MCPGateway` | Exposes Docman as MCP server via `configs/mcp/docman.yaml` |