You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Document pipeline parallelism and horizontal scaling
- CLAUDE.md: Add pipeline execution order explanation, horizontal
scaling note with NATS queue group examples, future work items
for parallel pipeline variant and MCP progress notifications
- docs/ARCHITECTURE.md: Update pipeline stages with execution level
annotations, note dependency-aware parallelism
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3. **doc_summarizer** (LLMWorker) — LLM summarizes based on document type and extracted content. Returns summary, key_points, word_count.
53
53
4. **doc_ingest** (ProcessorWorker + DuckDBIngestBackend) — Persists all pipeline results (metadata, classification, summary, full text) into DuckDB. Reads full extracted text from workspace JSON. Returns document_id.
54
54
55
+
**Pipeline execution order:** Loom's `PipelineOrchestrator` now auto-infers dependencies from `input_mapping` paths and runs independent stages concurrently. Docman's pipeline has genuinely sequential dependencies (classify depends on extract, summarize depends on both, ingest depends on all three), so it produces 4 levels of 1 stage each — identical sequential execution to before. No config changes were needed.
56
+
57
+
**Scaling note:** To process multiple documents concurrently, run multiple pipeline orchestrator instances. NATS queue groups automatically load-balance across replicas:
58
+
```bash
59
+
# Process 3 documents concurrently — each instance handles one goal
- **doc_query** (ProcessorWorker + DuckDBQueryBackend) — Not part of the pipeline. Accepts structured query requests against the DuckDB database. Supports 5 actions: `search` (full-text via DuckDB FTS), `filter` (by document_type, has_tables, page range), `stats` (aggregate counts/averages), `get` (single document by ID), `vector_search` (semantic similarity via embeddings).
@@ -152,6 +162,8 @@ The following items are **implemented and working**:
152
162
153
163
1. **Wire summarizer file_ref resolution** — Add `resolve_file_refs: ["file_ref"]` and `workspace_dir` to `doc_summarizer.yaml` (Loom now supports this natively)
154
164
2. **End-to-end test** — With NATS, Redis, and Ollama running locally
165
+
3. **Design a parallel pipeline variant** — Current pipeline is inherently sequential, but a variant could run classify and summarize concurrently if the summarizer doesn't need `document_type` (Loom's pipeline parallelism would auto-detect this from input_mapping)
166
+
4. **MCP progress notifications** — When Loom's MCP bridge wires progress callbacks to MCP progress tokens, Docman's pipeline would automatically report per-stage progress to MCP clients
0 commit comments