You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CLAUDE.md
+42-42Lines changed: 42 additions & 42 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,9 +2,9 @@
2
2
3
3
## What this project is
4
4
5
-
Docman (v0.5.0) is a document processing pipeline built on the Loom framework. It extracts content from PDF, DOCX, PPTX, XLSX, and HTML files using an adaptive two-tier extraction strategy (MarkItDown for speed, Docling for depth), with LLM-based classification and summarization stages.
5
+
Docman (v0.5.0) is a document processing pipeline built on the Heddle framework. It extracts content from PDF, DOCX, PPTX, XLSX, and HTML files using an adaptive two-tier extraction strategy (MarkItDown for speed, Docling for depth), with LLM-based classification and summarization stages.
6
6
7
-
This is a **consumer** of the Loom framework — it provides concrete worker configs, processing backends, and pipeline definitions. The Loom framework itself lives in a separate repo.
7
+
This is a **consumer** of the Heddle framework — it provides concrete worker configs, processing backends, and pipeline definitions. The Heddle framework itself lives in a separate repo.
8
8
9
9
## Project structure
10
10
@@ -16,18 +16,18 @@ src/docman/
16
16
markitdown_backend.py # MarkItDownBackend — fast extraction via Microsoft MarkItDown (no ML)
build-app.sh # Build deployment ZIP for Loom Workshop
30
+
build-app.sh # Build deployment ZIP for Heddle Workshop
31
31
docs/
32
32
ARCHITECTURE.md # System architecture overview
33
33
CONTRIBUTING.md # Contribution standards and CLA
@@ -37,18 +37,18 @@ docs/
37
37
tests/ # Unit tests (mock backends, in-memory DuckDB, no infrastructure)
38
38
```
39
39
40
-
## Relationship to Loom
40
+
## Relationship to Heddle
41
41
42
-
Docman depends on `loom-ai[duckdb]` as a package. It uses:
42
+
Docman depends on `heddle[duckdb]` as a package. It uses:
43
43
44
44
-`ProcessingBackend` ABC — DoclingBackend, MarkItDownBackend, SmartExtractorBackend, DuckDBIngestBackend implement this
45
-
-`resolve_schema_refs()` — worker configs use `input_schema_ref` / `output_schema_ref` pointing to `docman.contracts.*` Pydantic models (Loom resolves to JSON Schema at load time)
46
-
-`loom.contrib.duckdb.DuckDBQueryBackend` — DocmanQueryBackend subclasses this with Docman-specific schema defaults
47
-
-`loom.contrib.duckdb.DuckDBVectorTool` — DuckDBVectorTool wraps this with Docman-specific column/table defaults
48
-
-`loom.contrib.duckdb.DuckDBViewTool` — used directly (no Docman wrapper needed, already generic)
49
-
-`ProcessorWorker` — runs extraction and DuckDB backends via `loom processor` CLI
50
-
-`LLMWorker` — runs classifier and summarizer via `loom worker` CLI
51
-
-`PipelineOrchestrator` — orchestrates the 4-stage pipeline via `loom pipeline` CLI (with dependency-aware parallel stage execution)
45
+
-`resolve_schema_refs()` — worker configs use `input_schema_ref` / `output_schema_ref` pointing to `docman.contracts.*` Pydantic models (Heddle resolves to JSON Schema at load time)
46
+
-`heddle.contrib.duckdb.DuckDBQueryBackend` — DocmanQueryBackend subclasses this with Docman-specific schema defaults
47
+
-`heddle.contrib.duckdb.DuckDBVectorTool` — DuckDBVectorTool wraps this with Docman-specific column/table defaults
48
+
-`heddle.contrib.duckdb.DuckDBViewTool` — used directly (no Docman wrapper needed, already generic)
49
+
-`ProcessorWorker` — runs extraction and DuckDB backends via `heddle processor` CLI
50
+
-`LLMWorker` — runs classifier and summarizer via `heddle worker` CLI
51
+
-`PipelineOrchestrator` — orchestrates the 4-stage pipeline via `heddle pipeline` CLI (with dependency-aware parallel stage execution)
52
52
53
53
The CLI loads backends by fully qualified class path from worker configs:
54
54
@@ -71,7 +71,7 @@ Docman provides three extraction backends, all producing the same output contrac
71
71
3. **doc_summarizer** (LLMWorker) — LLM summarizes based on document type and extracted content. Returns summary, key_points, word_count.
72
72
4. **doc_ingest** (ProcessorWorker + DuckDBIngestBackend) — Persists all pipeline results (metadata, classification, summary, full text) into DuckDB. Reads full extracted text from workspace JSON. Returns document_id.
73
73
74
-
**Pipeline execution order:** Loom's `PipelineOrchestrator` auto-infers dependencies from `input_mapping` paths and runs independent stages concurrently. Docman's pipeline has genuinely sequential dependencies (classify depends on extract, summarize depends on both, ingest depends on all three), so it produces 4 levels of 1 stage each — sequential execution.
74
+
**Pipeline execution order:** Heddle's `PipelineOrchestrator` auto-infers dependencies from `input_mapping` paths and runs independent stages concurrently. Docman's pipeline has genuinely sequential dependencies (classify depends on extract, summarize depends on both, ingest depends on all three), so it produces 4 levels of 1 stage each — sequential execution.
75
75
76
76
**Pipeline variants:**
77
77
@@ -83,9 +83,9 @@ Docman provides three extraction backends, all producing the same output contrac
83
83
84
84
```bash
85
85
# Process 3 documents concurrently — each instance handles one goal
Worker I/O schemas are defined as Pydantic models in `src/docman/contracts.py`. Worker YAML configs reference them via `input_schema_ref` / `output_schema_ref`, and Loom's `resolve_schema_refs()` converts them to JSON Schema at load time.
97
+
Worker I/O schemas are defined as Pydantic models in `src/docman/contracts.py`. Worker YAML configs reference them via `input_schema_ref` / `output_schema_ref`, and Heddle's `resolve_schema_refs()` converts them to JSON Schema at load time.
- Large data passes via **file references** in a shared workspace directory (`--workspace-dir`)
104
104
- Messages carry only file_ref strings, not inline content
105
105
- Extraction backends (MarkItDown, Docling) read source file from workspace, write extracted JSON to workspace
106
-
- **Summarizer file resolution:** `resolve_file_refs: ["file_ref"]` and `workspace_dir` are set in the summarizer config — Loom's LLMWorker reads extracted JSON from workspace automatically.
106
+
- **Summarizer file resolution:** `resolve_file_refs: ["file_ref"]` and `workspace_dir` are set in the summarizer config — Heddle's LLMWorker reads extracted JSON from workspace automatically.
107
107
108
108
## Docling configuration
109
109
@@ -120,14 +120,14 @@ Full guide: `docs/docling-setup.md`
120
120
121
121
## MCP gateway
122
122
123
-
Docman can be exposed as an MCP (Model Context Protocol) server using Loom's built-in MCP gateway — zero MCP-specific code needed.
123
+
Docman can be exposed as an MCP (Model Context Protocol) server using Heddle's built-in MCP gateway — zero MCP-specific code needed.
124
124
125
125
```bash
126
-
# Start Docman as an MCP server (requires loom[mcp] and NATS + workers running)
127
-
loom mcp --config configs/mcp/docman.yaml
126
+
# Start Docman as an MCP server (requires heddle[mcp] and NATS + workers running)
See Loom's [Building Workflows](https://github.com/IranTransitionProject/loom/blob/main/docs/building-workflows.md) Part 11 for full MCP gateway documentation.
139
+
See Heddle's [Building Workflows](https://github.com/getheddle/heddle/blob/main/docs/building-workflows.md) Part 11 for full MCP gateway documentation.
140
140
141
141
## Key design rules
142
142
@@ -152,13 +152,13 @@ See Loom's [Building Workflows](https://github.com/IranTransitionProject/loom/bl
152
152
- Query results exclude `full_text` column by default to keep NATS messages small; use `get` action for full content
153
153
- Vector embeddings use `FLOAT[]` (variable-length) column in DuckDB — use `list_cosine_similarity` (NOT `array_cosine_similarity` which requires fixed-size `FLOAT[N]`)
154
154
- Embedding generation is optional — controlled by `embedding` config section in `doc_ingest.yaml`. When absent, embedding column stores NULL
155
-
- DuckDBViewTool and DuckDBVectorTool implement Loom's `SyncToolProvider` for LLM function-calling via `knowledge_silos` config
155
+
- DuckDBViewTool and DuckDBVectorTool implement Heddle's `SyncToolProvider` for LLM function-calling via `knowledge_silos` config
156
156
157
157
## Build and test commands
158
158
159
159
```bash
160
160
# Install all dependencies (requires Python 3.11+, uses uv)
161
-
# Loom is resolved from ../loom via [tool.uv.sources] in pyproject.toml
161
+
# Heddle is resolved from ../heddle via [tool.uv.sources] in pyproject.toml
162
162
uv sync --extra dev
163
163
164
164
# Pre-download Docling detection models (avoids delay on first run)
@@ -167,23 +167,23 @@ uv run docling-tools models download
167
167
# Run unit tests (no infrastructure needed)
168
168
uv run pytest tests/ -v
169
169
170
-
# Run with infrastructure (needs NATS + Loom installed)
170
+
# Run with infrastructure (needs NATS + Heddle installed)
171
171
# Terminal 1: docker run -p 4222:4222 nats:latest
172
-
# Terminal 2: uv run loom router --nats-url nats://localhost:4222
The following items are **implemented and working**:
185
185
186
-
- Pydantic I/O contracts (`src/docman/contracts.py`) — source of truth for all worker schemas, resolved at load time via Loom's `resolve_schema_refs()`
186
+
- Pydantic I/O contracts (`src/docman/contracts.py`) — source of truth for all worker schemas, resolved at load time via Heddle's `resolve_schema_refs()`
187
187
- MarkItDownBackend (`src/docman/backends/markitdown_backend.py`) — fast extraction via Microsoft MarkItDown, derives metadata from Markdown output
- App manifest (`manifest.yaml`) — declares all configs, Python package, and required Loom extras
196
-
- Build script (`scripts/build-app.sh`) — generates deployment ZIP for Loom Workshop
195
+
- App manifest (`manifest.yaml`) — declares all configs, Python package, and required Heddle extras
196
+
- Build script (`scripts/build-app.sh`) — generates deployment ZIP for Heddle Workshop
197
197
198
198
## What to implement next
199
199
200
200
1. **End-to-end test** — With NATS, Valkey, and Ollama running locally
201
-
2. **Design a parallel pipeline variant** — Current pipeline is inherently sequential, but a variant could run classify and summarize concurrently if the summarizer doesn't need `document_type` (Loom's pipeline parallelism would auto-detect this from input_mapping)
202
-
3. **MCP progress notifications** — When Loom's MCP bridge wires progress callbacks to MCP progress tokens, Docman's pipeline would automatically report per-stage progress to MCP clients
201
+
2. **Design a parallel pipeline variant** — Current pipeline is inherently sequential, but a variant could run classify and summarize concurrently if the summarizer doesn't need `document_type` (Heddle's pipeline parallelism would auto-detect this from input_mapping)
202
+
3. **MCP progress notifications** — When Heddle's MCP bridge wires progress callbacks to MCP progress tokens, Docman's pipeline would automatically report per-stage progress to MCP clients
0 commit comments