Skip to content

Commit 8584484

Browse files
hoomanclaude
andcommitted
Document pipeline parallelism and horizontal scaling
- CLAUDE.md: Add pipeline execution order explanation, horizontal scaling note with NATS queue group examples, future work items for parallel pipeline variant and MCP progress notifications - docs/ARCHITECTURE.md: Update pipeline stages with execution level annotations, note dependency-aware parallelism Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 408c666 commit 8584484

2 files changed

Lines changed: 22 additions & 3 deletions

File tree

CLAUDE.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ Docman depends on `loom[duckdb]` as a package. It uses:
3838
- `loom.contrib.duckdb.DuckDBViewTool` — used directly (no Docman wrapper needed, already generic)
3939
- `ProcessorWorker` — runs DoclingBackend and DuckDB backends via `loom processor` CLI
4040
- `LLMWorker` — runs classifier and summarizer via `loom worker` CLI
41-
- `PipelineOrchestrator` — orchestrates the 4-stage pipeline via `loom pipeline` CLI
41+
- `PipelineOrchestrator` — orchestrates the 4-stage pipeline via `loom pipeline` CLI (now with dependency-aware parallel stage execution)
4242

4343
The CLI loads backends by fully qualified class path from worker configs:
4444
```yaml
@@ -52,6 +52,16 @@ processing_backend: "docman.backends.docling_backend.DoclingBackend"
5252
3. **doc_summarizer** (LLMWorker) — LLM summarizes based on document type and extracted content. Returns summary, key_points, word_count.
5353
4. **doc_ingest** (ProcessorWorker + DuckDBIngestBackend) — Persists all pipeline results (metadata, classification, summary, full text) into DuckDB. Reads full extracted text from workspace JSON. Returns document_id.
5454
55+
**Pipeline execution order:** Loom's `PipelineOrchestrator` now auto-infers dependencies from `input_mapping` paths and runs independent stages concurrently. Docman's pipeline has genuinely sequential dependencies (classify depends on extract, summarize depends on both, ingest depends on all three), so it produces 4 levels of 1 stage each — identical sequential execution to before. No config changes were needed.
56+
57+
**Scaling note:** To process multiple documents concurrently, run multiple pipeline orchestrator instances. NATS queue groups automatically load-balance across replicas:
58+
```bash
59+
# Process 3 documents concurrently — each instance handles one goal
60+
loom pipeline --config configs/orchestrators/doc_pipeline.yaml &
61+
loom pipeline --config configs/orchestrators/doc_pipeline.yaml &
62+
loom pipeline --config configs/orchestrators/doc_pipeline.yaml &
63+
```
64+
5565
## Standalone workers
5666

5767
- **doc_query** (ProcessorWorker + DuckDBQueryBackend) — Not part of the pipeline. Accepts structured query requests against the DuckDB database. Supports 5 actions: `search` (full-text via DuckDB FTS), `filter` (by document_type, has_tables, page range), `stats` (aggregate counts/averages), `get` (single document by ID), `vector_search` (semantic similarity via embeddings).
@@ -152,6 +162,8 @@ The following items are **implemented and working**:
152162

153163
1. **Wire summarizer file_ref resolution** — Add `resolve_file_refs: ["file_ref"]` and `workspace_dir` to `doc_summarizer.yaml` (Loom now supports this natively)
154164
2. **End-to-end test** — With NATS, Redis, and Ollama running locally
165+
3. **Design a parallel pipeline variant** — Current pipeline is inherently sequential, but a variant could run classify and summarize concurrently if the summarizer doesn't need `document_type` (Loom's pipeline parallelism would auto-detect this from input_mapping)
166+
4. **MCP progress notifications** — When Loom's MCP bridge wires progress callbacks to MCP progress tokens, Docman's pipeline would automatically report per-stage progress to MCP clients
155167

156168
## Known issues
157169

docs/ARCHITECTURE.md

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,12 +63,19 @@ tests/ # 40 unit tests (no infrastructure needed)
6363

6464
## Pipeline Stages
6565

66-
The pipeline processes documents through four sequential stages:
66+
The pipeline processes documents through four stages. Loom's `PipelineOrchestrator`
67+
auto-infers dependencies from `input_mapping` paths and runs independent stages
68+
concurrently. In Docman's case, each stage depends on the previous one, so
69+
execution remains sequential:
6770

6871
```
6972
PDF/DOCX → [Extract] → [Classify] → [Summarize] → [Ingest] → DuckDB
73+
Level 0 Level 1 Level 2 Level 3
7074
```
7175

76+
To process multiple documents concurrently, run multiple pipeline instances —
77+
NATS queue groups handle load balancing automatically.
78+
7279
### Stage 1: Extract (`doc_extractor`)
7380

7481
- **Type:** ProcessorWorker + DoclingBackend
@@ -187,7 +194,7 @@ Docman depends on `loom[duckdb]` as a package and uses these Loom components:
187194
| `DuckDBViewTool` | Used directly (no wrapper) |
188195
| `ProcessorWorker` | Runs extraction and ingestion stages |
189196
| `LLMWorker` | Runs classification and summarization stages |
190-
| `PipelineOrchestrator` | Orchestrates 4-stage pipeline |
197+
| `PipelineOrchestrator` | Orchestrates 4-stage pipeline (dependency-aware parallelism) |
191198
| `WorkspaceManager` | File-ref resolution with path traversal protection |
192199
| `MCPGateway` | Exposes Docman as MCP server via `configs/mcp/docman.yaml` |
193200

0 commit comments

Comments
 (0)