From 11bc2f0e319a1613e09944bfd9578afde63cb362 Mon Sep 17 00:00:00 2001
From: Vexa <vexa.tski@gmail.com>
Date: Sun, 5 Apr 2026 09:04:45 -0400
Subject: [PATCH 1/2] docs: add .repo-docs with comprehensive repository
 documentation

Generates standardized documentation covering project overview, architecture,
development setup, integration map, conventions, troubleshooting, and
architectural decision records for the data-gen ETL pipeline.
---
 .repo-docs/architecture.md                    | 158 ++++++++++++++++++
 .repo-docs/conventions.md                     |  92 ++++++++++
 .../2026-04-05-just-as-orchestrator.md        |  43 +++++
 .../2026-04-05-markdown-html-roundtrip.md     |  35 ++++
 .../2026-04-05-multiple-output-variants.md    |  48 ++++++
 .../2026-04-05-scc-classification.md          |  40 +++++
 .repo-docs/decisions/README.md                |  79 +++++++++
 .repo-docs/development.md                     | 146 ++++++++++++++++
 .repo-docs/index.md                           | 113 +++++++++++++
 .repo-docs/integration.md                     | 121 ++++++++++++++
 .repo-docs/project.md                         |  90 ++++++++++
 .repo-docs/troubleshooting.md                 |  89 ++++++++++
 12 files changed, 1054 insertions(+)
 create mode 100644 .repo-docs/architecture.md
 create mode 100644 .repo-docs/conventions.md
 create mode 100644 .repo-docs/decisions/2026-04-05-just-as-orchestrator.md
 create mode 100644 .repo-docs/decisions/2026-04-05-markdown-html-roundtrip.md
 create mode 100644 .repo-docs/decisions/2026-04-05-multiple-output-variants.md
 create mode 100644 .repo-docs/decisions/2026-04-05-scc-classification.md
 create mode 100644 .repo-docs/decisions/README.md
 create mode 100644 .repo-docs/development.md
 create mode 100644 .repo-docs/index.md
 create mode 100644 .repo-docs/integration.md
 create mode 100644 .repo-docs/project.md
 create mode 100644 .repo-docs/troubleshooting.md

diff --git a/.repo-docs/architecture.md b/.repo-docs/architecture.md
new file mode 100644
index 0000000..fb489af
--- /dev/null
+++ b/.repo-docs/architecture.md
@@ -0,0 +1,158 @@
+---
+repo: data-gen
+updated: 2026-04-05
+---
+
+# Architecture
+
+## System Overview
+
+data-gen is a multi-stage ETL pipeline orchestrated by `just` (a command runner). Each stage transforms the source material closer to the final structured output.
+
+```
+                         ┌──────────────┐
+                         │  PDF Source   │
+                         └──────┬───────┘
+                                │ (manual, pdf_to_md/)
+                                v
+                    ┌───────────────────────┐
+                    │  Markdown Source Docs  │
+                    │  (input/heroes/*.md)   │
+                    │  (input/monsters/*.md) │
+                    └───────────┬───────────┘
+                                │
+              ┌─────────────────┼─────────────────┐
+              v                 v                  v
+        ┌──────────┐    ┌────────────┐    ┌──────────────┐
+        │  Heroes  │    │  Monsters  │    │  Adventures  │
+        │ Pipeline │    │  Pipeline  │    │  (stub)      │
+        └────┬─────┘    └─────┬──────┘    └──────┬───────┘
+             │                │                   │
+             v                v                   v
+     ┌──────────────────────────────────────────────────┐
+     │              Output data-* repos                 │
+     │  data-rules-md, data-rules-json, data-rules-yaml │
+     │  data-bestiary-md, data-bestiary-json, ...       │
+     │  data-rules-md-dse, data-bestiary-md-dse, ...    │
+     │  data-rules-md-linked, data-md-linked, ...       │
+     └──────────────────────┬───────────────────────────┘
+                            │
+                            v
+                    ┌───────────────┐
+                    │   Unification │
+                    │  (data-md,    │
+                    │   data-md-dse)│
+                    └───────────────┘
+```
+
+## Pipeline Stages (Heroes)
+
+The heroes pipeline is the most complete. Each stage produces output in a numbered subdirectory under `staging/heroes/`:
+
+| Stage | Directory | What happens |
+|-------|-----------|--------------|
+| 0 | `0_features/` | Extract table of contents from source markdown; generate `features.yml` config |
+| 1 | `1_html/` | Convert source markdown to HTML via pandoc |
+| 2 | `2_html_sections/` | Extract individual sections from HTML using XPath (driven by YAML configs) |
+| 3 | `3_md_sections/` | Convert HTML sections back to markdown via pandoc with Lua filters |
+| -- | (metadata) | Generate SCC/SCDC classification, frontmatter, and indexes |
+| 7 | `7_preformatted/` | Assemble all markdown sections; clean up original source doc |
+| 8 | `8_formatted_md/` | Lint/format all markdown with mdformat |
+| 9 | `9_linking_md/` | Split into linked (SCC links applied) and unlinked variants |
+| 10 | `10_conversions/` | Convert markdown to JSON, YAML; generate DSE markdown; build aggregates |
+
+## Components
+
+| Component | Location | Responsibility | Depends on |
+|-----------|----------|---------------|------------|
+| **Main justfile** | `etl/justfile` | Entry point, shared constants, utility recipes | All modules |
+| **heroes.just** | `etl/heroes.just` | Orchestrates full heroes pipeline | markdown, html, heroes_sections, features, frontmatter, sc_classification, linker, index |
+| **monsters.just** | `etl/monsters.just` | Orchestrates full monsters pipeline | markdown, html, monsters_sections, statblocks, featureblocks, frontmatter, sc_classification |
+| **unified.just** | `etl/unified.just` | Merges per-book repos into unified `data-md` | Output from heroes + monsters + adventures |
+| **adventures.just** | `etl/adventures.just` | Placeholder for adventures pipeline | -- |
+| **markdown.just** | `etl/markdown.just` | MD-to-HTML, TOC extraction, linting, blockquote separation, YAML embedding | pandoc, mdformat |
+| **html.just** | `etl/html.just` | HTML-to-MD conversion with Lua filter | pandoc |
+| **features.just** | `etl/features.just` | Feature metadata generation, conversion to JSON/YAML, DSE embedding, indexes | sc-convert (npm) |
+| **statblocks.just** | `etl/statblocks.just` | Statblock MD-to-JSON/YAML conversion | sc-convert (npm) |
+| **featureblocks.just** | `etl/featureblocks.just` | Featureblock conversion (malice, dynamic terrain) | sc-convert (npm) |
+| **sc_classification.just** | `etl/sc_classification.just` | SCC/SCDC classification assignment (Python) | frontmatter in MD files |
+| **frontmatter.just** | `etl/frontmatter.just` | Frontmatter processing: cleanup, sorting (Python) | PyYAML |
+| **linker.just** | `etl/linker.just` | Apply/remove SCC cross-reference links (Python) | scc_to_path.json |
+| **index.just** | `etl/index.just` | Generate markdown index tables from directory contents | yq |
+| **aggregate.just** | `etl/aggregate.just` | Build single-file aggregates of all features/statblocks | -- |
+| **section_config.just** | `etl/section_config.just` | Expand section config YAMLs with XPath data | -- |
+| **extract_html_sections.just** | `etl/extract_html_sections.just` | Extract HTML sections using XPath | -- |
+| **heroes_sections.just** | `etl/heroes_sections.just` | Heroes-specific section extraction config | section_config, extract_html_sections |
+| **monsters_sections.just** | `etl/monsters_sections.just` | Monsters-specific section extraction | section_config, extract_html_sections |
+| **heroes_frontmatter.just** | `etl/heroes_frontmatter.just` | Heroes-specific frontmatter enrichment | -- |
+| **monsters_frontmatter.just** | `etl/monsters_frontmatter.just` | Monsters-specific frontmatter enrichment | -- |
+| **features_config.just** | `etl/features_config.just` | Generate features.yml from TOC | -- |
+| **link_md/** | `etl/link_md/` | Obsidian-style auto-linker (legacy, Python) | inflect |
+| **pre_automation_tools/** | `etl/pre_automation_tools/` | Legacy one-time preprocessing scripts | -- |
+| **pdf_to_md/** | `pdf_to_md/` | PDF-to-markdown conversion | marker, OpenAI API |
+
+## Data Flow
+
+Primary input-to-output path (heroes):
+
+1. **Input**: `input/heroes/Draw Steel Heroes.md` (cleaned markdown of the full Heroes book)
+2. **Config**: `input/heroes/*.yml` (YAML files defining how to extract each section)
+3. **HTML conversion**: Pandoc converts the source markdown to a single HTML file
+4. **Section extraction**: XPath queries (derived from YAML configs) extract individual sections as HTML files
+5. **MD conversion**: Pandoc converts each HTML section back to markdown (with Lua filters to fix links)
+6. **Metadata**: Python scripts assign SCC classifications, generate frontmatter, create index files
+7. **Formatting**: mdformat lints all markdown; blockquotes are separated
+8. **Linking**: SCC-based cross-references are applied (linked variant) or stripped (unlinked variant)
+9. **Conversion**: `sc-convert` transforms markdown to JSON and YAML; YAML is embedded for DSE variant
+10. **Output**: Results are copied to sibling `data-*` repos
+
+## Key Design Decisions
+
+| Decision | Rationale |
+|----------|-----------|
+| Markdown-first pipeline | Source material is authored in markdown; preserving it as the canonical intermediate format keeps the pipeline inspectable |
+| Round-trip MD -> HTML -> MD | XPath section extraction is much easier on HTML than markdown; the HTML intermediate is a pragmatic choice |
+| Per-section files | Each ability, class, ancestry, etc. gets its own file for granular access and independent versioning |
+| Multiple output variants | Different consumers need different formats: plain MD for the website, JSON/YAML for APIs, DSE for web components, linked/unlinked for different rendering contexts |
+| `just` as orchestrator | Declarative recipe dependencies, file-level modularity, and embedded Python/Bash scripts in a single toolchain |
+| SCC classification system | A universal, hierarchical identifier that works across all content types and supports both human-readable (string) and machine-efficient (decimal) forms |
+
+## Dependencies
+
+| Dependency | Version | Why |
+|------------|---------|-----|
+| bash | latest (via devbox) | Script execution |
+| python | latest (via devbox) | Frontmatter processing, classification, linking |
+| just | latest (via devbox) | Task runner / pipeline orchestration |
+| jq | latest (via devbox) | JSON processing |
+| pandoc | latest (via devbox) | Markdown/HTML conversion |
+| html-tidy | latest (via devbox) | HTML cleanup |
+| rsync | latest (via devbox) | Directory synchronization (unified pipeline) |
+| yq-go | latest (via devbox) | YAML/frontmatter extraction |
+| iconv | latest (via devbox) | Character encoding normalization |
+| go | latest (via devbox) | Required for go-based tools |
+| perl | via devbox | Text processing in some ETL steps |
+| figlet | latest (via devbox) | Decorative section headers in pipeline output |
+| nodejs | latest (via devbox) | Required for `steel-compendium-sdk` / `sc-convert` CLI |
+| python-frontmatter | pip | Frontmatter parsing/writing |
+| PyYAML | pip | YAML processing |
+| mdformat + plugins | pip | Markdown formatting |
+| lxml | pip | XML/HTML processing |
+| inflect | pip | Singular/plural form generation (auto-linker) |
+| rapidfuzz | pip | Fuzzy matching |
+| python-slugify | pip | Slug generation |
+| steel-compendium-sdk | npm | `sc-convert` CLI for feature/statblock conversion |
+
+## Extension Points
+
+- **New book**: Add a new `<book>.just` module alongside `heroes.just` and `monsters.just`. Wire it into the main `justfile`'s `gen` recipe and `unified.just`.
+- **New section type**: Add a new YAML config in `input/<book>/` and register it in the relevant `*_sections.just` module.
+- **New output format**: Add conversion recipes in the relevant pipeline module, using `sc-convert` or custom scripts.
+- **New frontmatter fields**: Extend `*_frontmatter.just` modules with additional metadata generation.
+
+## Constraints
+
+- The pipeline must run inside a devbox shell (or equivalent environment with all dependencies).
+- The `staging/` directory is ephemeral and wiped at the start of each `gen` run.
+- Output repos (`data-*`) must exist as sibling directories in the workspace.
+- The `sc-convert` CLI must be installed via npm within the devbox environment.
diff --git a/.repo-docs/conventions.md b/.repo-docs/conventions.md
new file mode 100644
index 0000000..8dcb814
--- /dev/null
+++ b/.repo-docs/conventions.md
@@ -0,0 +1,92 @@
+---
+repo: data-gen
+updated: 2026-04-05
+---
+
+# Conventions
+
+## File and Directory Naming
+
+- **Justfile modules**: `snake_case.just` (e.g., `heroes_sections.just`, `sc_classification.just`)
+- **Input YAML configs**: `snake_case.yml` (e.g., `abilities.yml`, `feature_fixes.yml`)
+- **Source documents**: Title Case with spaces (e.g., `Draw Steel Heroes.md`, `Draw Steel Monsters.md`)
+- **Output directories**: Title Case (e.g., `Features/`, `Abilities/`, `Chapters/`, `Statblocks/`)
+- **Generated markdown files**: Title Case with spaces (matching the game content name)
+- **Staging directories**: Numbered prefixes for pipeline order (e.g., `0_features/`, `1_html/`, `8_formatted_md/`)
+
+## Code Style
+
+### Justfile recipes
+
+- Module declaration at top: `# Justfile module expected to be named "<name>"`
+- Constants section, then public recipes, then private recipes -- each separated by `#` banner comments
+- Private recipes prefixed with `_`
+- All scripts use `#!/usr/bin/env bash` with `set -euo pipefail`
+- Python scripts are embedded directly in justfile recipes using `#!/usr/bin/env python3`
+- `BASH_ENV` and shell settings declared per module as needed
+
+### Python (embedded)
+
+- Python scripts are embedded in `.just` files, not standalone `.py` files (exception: `link_md/obs-auto-linker.py`)
+- Uses `pathlib.Path` for file operations
+- Uses `python-frontmatter` library for markdown frontmatter parsing
+- Uses `yaml.safe_load` / `yaml.safe_dump` for YAML processing
+
+### Variable naming (justfile)
+
+- Directory paths: `*_dpath` suffix (e.g., `heroes_staging_dpath`, `input_dpath`)
+- File paths: `*_fpath` suffix (e.g., `html_fpath`, `scc_to_path_json_fpath`)
+- Log prefixes: `log_prefix` variable for module identification
+
+## Commit Messages
+
+Observed pattern: imperative or present-tense descriptions without conventional commit prefixes.
+
+Examples from history:
+- `Corrects Malice ability on Ogre Jug and adds noncombatant`
+- `Adds aggregate files for json and yaml`
+- `Correcting a lot of issues with type uniqueness`
+- `Cleanup from errata`
+- `Bugfixes!`
+
+Style: informal, content-focused. Describes what changed in game-data terms rather than code terms.
+
+## Markdown Header Levels
+
+The source documents use extended header levels beyond standard markdown:
+
+| Level | Meaning |
+|-------|---------|
+| H1-H6 | Standard chapter/section hierarchy |
+| H7 (7 `#`) | Statblocks (converted to bold+span in output) |
+| H8 (8 `#`) | Abilities/features (converted to bold+span in output) |
+| H9 (9 `#`) | Featureblocks: malice, dynamic terrain (converted to bold+span in output) |
+
+## Justfile Module Pattern
+
+Every module follows this structure:
+
+```just
+# Justfile module expected to be named "<module_name>"
+
+##################################################
+# Constants and env vars
+##################################################
+
+# ... module-specific constants
+
+##################################################
+# Public Recipes
+##################################################
+
+export BASH_ENV := ".utils/.utilsrc"
+set shell := ["bash", "-c"]
+
+# public recipes here
+
+##################################################
+# Private Recipes
+##################################################
+
+# private recipes here (prefixed with _)
+```
diff --git a/.repo-docs/decisions/2026-04-05-just-as-orchestrator.md b/.repo-docs/decisions/2026-04-05-just-as-orchestrator.md
new file mode 100644
index 0000000..8321c41
--- /dev/null
+++ b/.repo-docs/decisions/2026-04-05-just-as-orchestrator.md
@@ -0,0 +1,43 @@
+---
+status: accepted
+date: 2026-04-05
+---
+
+# Just as pipeline orchestrator
+
+## Context
+
+The ETL pipeline involves dozens of interdependent steps across multiple tools (pandoc, Python scripts, shell utilities, npm CLIs). A task runner is needed to orchestrate these steps.
+
+## Options Considered
+
+### Option A: Makefile
+- Pros: Universal, well-known, file-based dependency tracking
+- Cons: Arcane syntax, poor string handling, hard to embed multi-line scripts
+
+### Option B: Shell scripts
+- Pros: No extra tool needed
+- Cons: No dependency management, poor modularity, hard to compose
+
+### Option C: Just
+- Pros: Clean syntax, module system (`mod`), inline Python/Bash scripts, parameter passing, built-in functions (`titlecase`, `file_stem`)
+- Cons: Less common than Make, no file-based dependency tracking
+
+### Option D: Python/Invoke
+- Pros: Full programming language, good for complex logic
+- Cons: Overhead for simple shell commands, requires Python environment for orchestration layer
+
+## Decision
+
+Option C: Just. The module system enables splitting the pipeline into focused, composable files. Inline multi-line scripts (both Bash and Python) keep logic close to where it's used. The clean syntax reduces boilerplate compared to Make.
+
+## Consequences
+
+- Each pipeline component is a `.just` module imported into the main justfile
+- Python logic is embedded directly in justfile recipes rather than standalone scripts
+- No file-based dependency tracking -- the pipeline always runs all stages
+- `devbox` provides the `just` binary
+
+## Outcome
+
+The modular justfile structure has scaled well to 25+ modules. Embedding Python scripts in recipes works but can make individual recipes long. The lack of file-based dependency tracking means every run is a full rebuild, which takes a few minutes but is acceptable for the current scale.
diff --git a/.repo-docs/decisions/2026-04-05-markdown-html-roundtrip.md b/.repo-docs/decisions/2026-04-05-markdown-html-roundtrip.md
new file mode 100644
index 0000000..47eb645
--- /dev/null
+++ b/.repo-docs/decisions/2026-04-05-markdown-html-roundtrip.md
@@ -0,0 +1,35 @@
+---
+status: accepted
+date: 2026-04-05
+---
+
+# Markdown-first pipeline with HTML round-trip
+
+## Context
+
+The source material is a TTRPG rulebook provided as a single large markdown file. The pipeline needs to split this into hundreds of individual files (one per ability, class, ancestry, etc.) while preserving formatting and hierarchy.
+
+## Options Considered
+
+### Option A: Parse markdown directly with regex/AST
+- Pros: No intermediate format, simpler toolchain
+- Cons: Markdown parsing is fragile for deep nesting (H7-H9 headers); extracting arbitrary sections by path is hard without a DOM
+
+### Option B: Convert to HTML, extract via XPath, convert back
+- Pros: XPath provides reliable, configurable section extraction; HTML is a well-defined DOM; pandoc handles both conversions
+- Cons: Round-trip conversion can introduce formatting artifacts; more pipeline stages
+
+## Decision
+
+Option B: round-trip through HTML. Pandoc's markdown-to-HTML and HTML-to-markdown conversions are robust enough, and XPath extraction is far more reliable than regex-based markdown splitting. A Lua filter fixes link extensions during the HTML-to-markdown conversion.
+
+## Consequences
+
+- Pipeline has more stages (MD -> HTML -> section HTML -> section MD), making debugging harder
+- HTML Tidy is required to clean up pandoc's HTML output
+- Character encoding and entity handling requires explicit normalization at multiple stages
+- Section extraction is highly configurable via YAML config files with XPath expressions
+
+## Outcome
+
+Has worked reliably for the Heroes book (400+ pages, hundreds of abilities). The XPath-based extraction handles the complex nesting of Draw Steel's content hierarchy well. The main pain point is HTML entity normalization (smart quotes, special characters).
diff --git a/.repo-docs/decisions/2026-04-05-multiple-output-variants.md b/.repo-docs/decisions/2026-04-05-multiple-output-variants.md
new file mode 100644
index 0000000..28e0e1c
--- /dev/null
+++ b/.repo-docs/decisions/2026-04-05-multiple-output-variants.md
@@ -0,0 +1,48 @@
+---
+status: accepted
+date: 2026-04-05
+---
+
+# Multiple output variants
+
+## Context
+
+Different downstream consumers need the same content in different formats and with different features:
+
+- The compendium website needs plain markdown with cross-reference links
+- The `draw-steel-elements` web components need markdown with embedded YAML
+- API consumers need structured JSON/YAML
+- Some tools want cross-reference links; others don't
+
+## Options Considered
+
+### Option A: Single canonical format, consumers transform
+- Pros: One output to maintain
+- Cons: Pushes complexity to every consumer; consumers may not have the tooling
+
+### Option B: Multiple output variants, all generated by this pipeline
+- Pros: Each consumer gets exactly what it needs; transformation logic is centralized
+- Cons: More repos to maintain, more pipeline complexity, larger total output
+
+## Decision
+
+Option B: generate all variants centrally. The pipeline produces:
+
+- **Plain markdown** (unlinked): `data-rules-md`, `data-bestiary-md`
+- **Linked markdown**: `data-rules-md-linked`, `data-bestiary-md` (TODO: linked variant)
+- **DSE markdown** (embedded YAML, unlinked): `data-rules-md-dse`, `data-bestiary-md-dse`
+- **DSE markdown** (linked): `data-rules-md-dse-linked`
+- **JSON**: `data-rules-json`, `data-bestiary-json`
+- **YAML**: `data-rules-yaml`, `data-bestiary-yaml`
+- **Unified**: `data-md`, `data-md-linked`, `data-md-dse`, `data-md-dse-linked`
+
+## Consequences
+
+- 14+ downstream repos are generated from a single pipeline run
+- Adding a new variant requires adding conversion steps and a new target repo
+- The `_copy_data_to_repo` utility handles wiping and populating target repos
+- Unified repos (`data-md*`) aggregate content from all book-specific repos
+
+## Outcome
+
+The approach works but has created a large number of repos. The unified variants (`data-md*`) were added later to simplify consumption for downstream tools that want all content in one place.
diff --git a/.repo-docs/decisions/2026-04-05-scc-classification.md b/.repo-docs/decisions/2026-04-05-scc-classification.md
new file mode 100644
index 0000000..df9cca3
--- /dev/null
+++ b/.repo-docs/decisions/2026-04-05-scc-classification.md
@@ -0,0 +1,40 @@
+---
+status: accepted
+date: 2026-04-05
+---
+
+# SCC classification system
+
+## Context
+
+Content from the Draw Steel rulebooks needs a universal, stable identifier that works across different output formats, supports cross-referencing between items, and is both human-readable and machine-processable.
+
+## Options Considered
+
+### Option A: File path as identifier
+- Pros: Simple, no extra tooling
+- Cons: Coupled to directory structure, not portable across repos, no semantic meaning
+
+### Option B: UUIDs
+- Pros: Globally unique, no coordination needed
+- Cons: Not human-readable, no hierarchy information, can't derive from content
+
+### Option C: Hierarchical classification (SCC)
+- Pros: Human-readable string form (`mcdm.heroes.v1:abilities.fury:gouge`), compact decimal form (`1.1.1:2.4:28`), encodes source, type, and item; supports multiple classifications per item
+- Cons: Requires a registration/state system, more complex to implement
+
+## Decision
+
+Option C: Steel Compendium Classification (SCC). The three-component `source:type:item` structure maps naturally to the TTRPG content hierarchy. The dual string/decimal representation serves different use cases (human vs. machine).
+
+## Consequences
+
+- A state file (`classification.json`) tracks the type/source tree and assigned IDs
+- Each markdown file gets `scc` and `scdc` frontmatter fields
+- Cross-references can use `scc:` protocol links (e.g., `[Gouge](scc:mcdm.heroes.v1:abilities.fury:gouge)`)
+- One item can have multiple SCC codes (e.g., an ability classified by class, by level, and globally)
+- Decimal codes are positional and depend on registration order
+
+## Outcome
+
+Works well for cross-referencing and downstream consumption. The main challenge is that `classification.json` is regenerated on each run, so decimal codes can shift if source material changes. The string form is stable as long as item names don't change.
diff --git a/.repo-docs/decisions/README.md b/.repo-docs/decisions/README.md
new file mode 100644
index 0000000..930bb74
--- /dev/null
+++ b/.repo-docs/decisions/README.md
@@ -0,0 +1,79 @@
+---
+repo: data-gen
+updated: 2026-04-05
+---
+
+# Decision Log
+
+A running record of architectural and design decisions made in this project.
+
+## Why log decisions?
+
+Context doesn't survive in memory. Logging decisions prevents relitigating past choices and helps new contributors understand why things are the way they are.
+
+## What to log
+
+Every decision: library choices, format changes, rejected approaches, conventions, reverted experiments. Small decisions compound -- logging them builds a navigable history.
+
+## How to create a record
+
+1. Copy the template below
+2. Name the file `YYYY-MM-DD-short-description.md`
+3. Fill in all sections
+4. Set status to `proposed`, `accepted`, `tried`, `superseded`, or `deprecated`
+
+### Template
+
+```markdown
+---
+status: proposed
+date: YYYY-MM-DD
+---
+
+# Title
+
+## Context
+
+Why was this decision needed?
+
+## Options Considered
+
+### Option A
+- Pros: ...
+- Cons: ...
+
+### Option B
+- Pros: ...
+- Cons: ...
+
+## Decision
+
+What was chosen and why.
+
+## Consequences
+
+Positive outcomes and accepted tradeoffs.
+
+## Outcome
+
+Leave blank until there is real experience to report.
+```
+
+### Status definitions
+
+| Status | Meaning |
+|--------|---------|
+| `proposed` | Under consideration, not yet implemented |
+| `accepted` | Agreed upon and implemented |
+| `tried` | Implemented but didn't work; reverted or abandoned |
+| `superseded` | Replaced by a newer decision |
+| `deprecated` | Still in place but scheduled for removal |
+
+## Index
+
+| Date | Decision | Status |
+|------|----------|--------|
+| 2026-04-05 | [Markdown-first pipeline with HTML round-trip](2026-04-05-markdown-html-roundtrip.md) | accepted |
+| 2026-04-05 | [Just as pipeline orchestrator](2026-04-05-just-as-orchestrator.md) | accepted |
+| 2026-04-05 | [SCC classification system](2026-04-05-scc-classification.md) | accepted |
+| 2026-04-05 | [Multiple output variants](2026-04-05-multiple-output-variants.md) | accepted |
diff --git a/.repo-docs/development.md b/.repo-docs/development.md
new file mode 100644
index 0000000..34cd79c
--- /dev/null
+++ b/.repo-docs/development.md
@@ -0,0 +1,146 @@
+---
+repo: data-gen
+updated: 2026-04-05
+---
+
+# Development
+
+## Prerequisites
+
+| Tool | Version | How to get |
+|------|---------|-----------|
+| devbox | any recent | [jetify.com/devbox](https://www.jetify.com/devbox) |
+| git | any recent | System package manager |
+
+All other dependencies (bash, python, just, pandoc, node, etc.) are installed automatically by devbox.
+
+## Setup
+
+1. Clone the repo and its sibling data repos:
+   ```bash
+   # From the workspace root (steel_compendium/workspace/)
+   just clone-all
+   ```
+
+2. Enter the devbox environment:
+   ```bash
+   cd data-gen/etl
+   devbox shell
+   ```
+   This installs all system packages, creates a Python venv, installs pip packages, installs `steel-compendium-sdk` via npm, and installs Go tools.
+
+3. Ensure source documents exist:
+   - `input/heroes/Draw Steel Heroes.md` -- the cleaned-up Heroes book markdown
+   - `input/monsters/Draw Steel Monsters.md` -- the cleaned-up Monsters book markdown
+   - `input/heroes/*.yml` -- section config files (already committed)
+
+4. Run the full pipeline:
+   ```bash
+   devbox run gen
+   # or directly:
+   just gen
+   ```
+
+## Required Environment Variables
+
+| Variable | Required by | Description |
+|----------|------------|-------------|
+| `OPEN_AI_KEY` | `pdf_to_md/` only | OpenAI API key for LLM-assisted PDF conversion |
+
+The main ETL pipeline (`etl/`) does not require any environment variables.
+
+## Common Workflows
+
+### Generate all output
+
+```bash
+cd etl && devbox shell
+just gen
+```
+
+This runs: wipe staging -> heroes pipeline -> monsters pipeline -> adventures (stub) -> unify.
+
+### Generate heroes only
+
+```bash
+just gen_heroes
+```
+
+### Generate monsters only
+
+```bash
+just gen_monsters
+```
+
+### Convert a PDF to markdown
+
+```bash
+cd pdf_to_md && devbox shell
+# Requires OPEN_AI_KEY env var
+just convert_heroes "Draw Steel Heroes.pdf"
+just convert_monsters "Draw_Steel_Monsters_v1.pdf"
+```
+
+### Switch all data repos to a branch
+
+```bash
+just switch_repos_to develop
+```
+
+### Push generated data to repos
+
+After `just gen`, the output is already copied to the sibling `data-*` repos. Commit and push each one manually:
+
+```bash
+cd ../data-rules-md && git add -A && git commit -m "Update from data-gen" && git push
+# Repeat for each data-* repo
+```
+
+### Convert a single feature/statblock
+
+```bash
+# Feature markdown to JSON
+just features convert "path/to/feature.md" json
+
+# Statblock markdown to YAML
+just statblocks convert "path/to/statblock.md" yaml
+```
+
+### Lint markdown files
+
+```bash
+just markdown lint "path/to/directory"
+```
+
+## Testing
+
+There is no automated test suite. Validation is manual:
+
+1. Run `just gen` and inspect the `staging/` directory for intermediate output.
+2. Check the `data-*` repos for final output.
+3. Verify the compendium website renders correctly after updating.
+
+## Debugging
+
+### Pipeline output
+
+Each stage writes to a numbered subdirectory in `staging/`. To inspect a specific stage:
+
+```bash
+ls staging/heroes/
+# 0_features/  1_html/  2_html_sections/  3_md_sections/  7_preformatted/  8_formatted_md/  9_linking_md/  10_conversions/
+```
+
+### Verbose just output
+
+`just` recipes use `set -euo pipefail` and print section headers. Check stderr for progress messages.
+
+### Frontmatter inspection
+
+```bash
+yq e --front-matter=markdown '.' path/to/file.md
+```
+
+### Classification state
+
+The SCC classification tree state is stored in `input/classification.json`. Inspect it to understand current type/source assignments.
diff --git a/.repo-docs/index.md b/.repo-docs/index.md
new file mode 100644
index 0000000..58df9e8
--- /dev/null
+++ b/.repo-docs/index.md
@@ -0,0 +1,113 @@
+---
+repo: data-gen
+type: tool
+status: active
+tech:
+  - just (task runner)
+  - bash (ETL scripts)
+  - python (frontmatter, classification, linking)
+  - pandoc (markdown/html conversion)
+  - devbox (environment management)
+updated: 2026-04-05
+---
+
+# data-gen
+
+ETL pipeline that converts Draw Steel TTRPG source documents (markdown) into structured, multi-format output distributed across multiple `data-*` repos. Part of the Steel Compendium project.
+
+**This repo is not:** a data repo itself, a web application, or a content authoring tool. It processes existing markdown source documents and outputs to sibling `data-*` repos.
+
+## Quick Reference
+
+| Action | Command |
+|--------|---------|
+| Enter dev environment | `devbox shell` (from `etl/` directory) |
+| Generate all outputs | `devbox run gen` or `just gen` (from `etl/`) |
+| Generate heroes only | `devbox run gen_heroes` or `just gen_heroes` |
+| Generate monsters only | `just gen_monsters` |
+| Convert PDF to markdown | `just convert_heroes` (from `pdf_to_md/`, requires `OPEN_AI_KEY`) |
+| Switch data repo branches | `just switch_repos_to <branch>` |
+
+| Resource | URL |
+|----------|-----|
+| Repository | https://github.com/SteelCompendium/data-gen |
+| Bug reports | [Google Form](https://docs.google.com/forms/d/e/1FAIpQLSc6m-pZ0NLt2EArE-Tcxr-XbAPMyhu40ANHJKtyRvvwBd2LSw/viewform) |
+| Issue tracker | https://github.com/SteelCompendium/data-gen/issues |
+
+## Repo Structure
+
+```
+data-gen/
+  input/                    # Source material and config
+    heroes/                 # Heroes book markdown + section config YAMLs
+      Draw Steel Heroes.md  # Primary source document
+      abilities.yml         # Section extraction config
+      classes.yml           # Section extraction config
+      ...                   # Other section configs
+    monsters/               # Monsters book markdown + section configs
+      Draw Steel Monsters.md
+      monsters.yml
+    classification.json     # SCC type/source tree state
+  etl/                      # ETL pipeline (justfiles + scripts)
+    justfile                # Main entry point
+    heroes.just             # Heroes book pipeline
+    monsters.just           # Monsters book pipeline
+    adventures.just         # Adventures pipeline (stub)
+    unified.just            # Unification across books
+    html.just               # HTML conversion utilities
+    markdown.just           # Markdown conversion/linting
+    features.just           # Feature extraction and conversion
+    statblocks.just         # Statblock conversion
+    sc_classification.just  # SCC classification generation
+    frontmatter.just        # Frontmatter processing
+    linker.just             # SCC link application/removal
+    index.just              # Index file generation
+    aggregate.just          # Aggregate data file generation
+    link_md/                # Obsidian-style auto-linker (Python)
+    pre_automation_tools/   # Legacy pre-processing helpers
+    devbox.json             # Dev environment packages
+  pdf_to_md/                # PDF-to-markdown converter (uses marker + OpenAI)
+    justfile
+    devbox.json
+  staging/                  # Generated intermediate files (gitignored)
+```
+
+## Reading Guide by Role
+
+### Human Roles
+
+| Role | Start here | Then read |
+|------|-----------|-----------|
+| **New to this repo** | This file | [project.md](project.md) |
+| **Developer** | [development.md](development.md) | [architecture.md](architecture.md), [conventions.md](conventions.md) |
+| **Architect** | [architecture.md](architecture.md) | [integration.md](integration.md), [decisions/](decisions/) |
+| **DevOps / SRE** | [development.md](development.md) | [integration.md](integration.md) |
+
+### Agent Roles
+
+| Agent Role | Start here | Then read |
+|------------|-----------|-----------|
+| **Code review** | [conventions.md](conventions.md) | [architecture.md](architecture.md) |
+| **Bug fix / debug** | [troubleshooting.md](troubleshooting.md) | [development.md](development.md), [architecture.md](architecture.md) |
+| **Feature implementation** | [architecture.md](architecture.md) | [conventions.md](conventions.md), [development.md](development.md), [decisions/](decisions/) |
+| **Documentation** | This file | [project.md](project.md), [architecture.md](architecture.md) |
+| **Onboarding / Q&A** | This file | [project.md](project.md), [development.md](development.md) |
+
+## Current Status
+
+- **Health:** Active development
+- **Last significant change:** Aggregate files for JSON/YAML, SCC link application, monster book parsing
+- **Known blockers:** Adventures pipeline not yet implemented; monster book linking not yet wired up
+
+## Documents in This Directory
+
+| File | Description |
+|------|-------------|
+| [index.md](index.md) | This file -- overview, quick reference, structure |
+| [project.md](project.md) | Domain context, glossary, feature inventory |
+| [architecture.md](architecture.md) | Pipeline stages, components, data flow |
+| [development.md](development.md) | Setup, prerequisites, workflows |
+| [integration.md](integration.md) | Upstream/downstream repos, data contracts |
+| [conventions.md](conventions.md) | Naming, commit style, code patterns |
+| [troubleshooting.md](troubleshooting.md) | Known issues, common errors |
+| [decisions/](decisions/) | Architectural decision records |
diff --git a/.repo-docs/integration.md b/.repo-docs/integration.md
new file mode 100644
index 0000000..b317090
--- /dev/null
+++ b/.repo-docs/integration.md
@@ -0,0 +1,121 @@
+---
+repo: data-gen
+updated: 2026-04-05
+---
+
+# Integration
+
+## Dependency Map
+
+```
+  PDF Source Docs          Section Config YAMLs
+       |                         |
+       v                         v
+  ┌─────────┐              ┌──────────┐
+  │pdf_to_md│              │  input/  │
+  └────┬────┘              └────┬─────┘
+       |                        |
+       v                        v
+  input/*.md ──────────> [ data-gen ETL ] <── steel-compendium-sdk (npm)
+                                |
+                 ┌──────────────┼──────────────────┐
+                 v              v                   v
+          data-rules-*    data-bestiary-*    data-adventures-md
+                 |              |                   |
+                 v              v                   v
+              data-md    data-md-linked    data-md-dse    data-md-dse-linked
+                 |
+                 v
+          compendium website (steelCompendium.io)
+          draw-steel-elements (web components)
+          third-party tools
+```
+
+## Upstream Dependencies
+
+| Dependency | Type | Description |
+|------------|------|-------------|
+| Draw Steel rulebook PDFs | Manual input | Source material, converted to markdown via `pdf_to_md/` |
+| `input/heroes/Draw Steel Heroes.md` | File | Cleaned-up Heroes book markdown (manually curated) |
+| `input/monsters/Draw Steel Monsters.md` | File | Cleaned-up Monsters book markdown |
+| `input/heroes/*.yml` | Config files | Section extraction configs defining how to split the book |
+| `input/classification.json` | State file | SCC type/source tree (regenerated on each run) |
+| `steel-compendium-sdk` | npm package | Provides `sc-convert` CLI for markdown-to-JSON/YAML conversion |
+
+## Downstream Dependents
+
+| Repo | Format | Content |
+|------|--------|---------|
+| `data-rules-md` | Markdown | Heroes book sections (unlinked) |
+| `data-rules-md-linked` | Markdown | Heroes book sections (with SCC links) |
+| `data-rules-md-dse` | Markdown + YAML | Heroes sections with embedded YAML for DSE web components (unlinked) |
+| `data-rules-md-dse-linked` | Markdown + YAML | Heroes sections with embedded YAML for DSE web components (linked) |
+| `data-rules-json` | JSON | Heroes features and abilities as JSON |
+| `data-rules-yaml` | YAML | Heroes features and abilities as YAML |
+| `data-bestiary-md` | Markdown | Monsters book sections (unlinked) |
+| `data-bestiary-md-dse` | Markdown + YAML | Monsters sections with embedded YAML for DSE |
+| `data-bestiary-json` | JSON | Monster statblocks, featureblocks as JSON |
+| `data-bestiary-yaml` | YAML | Monster statblocks, featureblocks as YAML |
+| `data-adventures-md` | Markdown | Adventures content (stub) |
+| `data-md` | Markdown | Unified: all books combined (unlinked) |
+| `data-md-linked` | Markdown | Unified: all books combined (linked) |
+| `data-md-dse` | Markdown + YAML | Unified: all books with DSE (unlinked) |
+| `data-md-dse-linked` | Markdown + YAML | Unified: all books with DSE (linked) |
+
+## API Surface
+
+data-gen does not expose an API. It is a batch pipeline that writes files to sibling repositories.
+
+**Output formats:**
+- Markdown files with YAML frontmatter (metadata: title, type, source, SCC, SCDC, etc.)
+- JSON files (structured feature/statblock data)
+- YAML files (structured feature/statblock data)
+- Aggregate files (`features.json`, `statblocks.json`, etc.) containing all items in a single file
+
+## Data Contracts
+
+### Frontmatter schema (markdown output)
+
+Every generated markdown file includes YAML frontmatter with at minimum:
+
+```yaml
+---
+title: "Ability Name"
+type: abilities/fury
+source: mcdm.heroes.v1
+item_id: gouge
+scc:
+  - "mcdm.heroes.v1:abilities.fury:gouge"
+scdc:
+  - "1.1.1:2.4:28"
+item_index: "28"
+---
+```
+
+### SCC format
+
+`source:type:item` where each component is a dot-separated path of slug values.
+
+- Source: `mcdm.heroes.v1`, `mcdm.monsters.v1`
+- Type: `abilities.fury`, `chapters`, `monster.statblock`
+- Item: slug of the item name
+
+### JSON/YAML structure
+
+Generated by `sc-convert` from the `steel-compendium-sdk`. The exact schema is defined by that tool.
+
+## Cross-Repo Workflows
+
+### Full regeneration workflow
+
+1. Run `just gen` in `data-gen/etl/` to regenerate all output
+2. Output is automatically copied to all `data-*` sibling repos
+3. Commit and push each `data-*` repo
+4. In the `compendium` project, run `just update` to pull new `data-*` commits into the website
+
+### Source material update workflow
+
+1. If updating from a new PDF: run `pdf_to_md/` converter
+2. Clean up the resulting markdown manually
+3. Update section config YAMLs if the book structure changed
+4. Run `just gen` to regenerate
diff --git a/.repo-docs/project.md b/.repo-docs/project.md
new file mode 100644
index 0000000..bd9a586
--- /dev/null
+++ b/.repo-docs/project.md
@@ -0,0 +1,90 @@
+---
+repo: data-gen
+updated: 2026-04-05
+---
+
+# Project Context
+
+## Product Overview
+
+data-gen is the ETL backbone of the Steel Compendium project. It takes the Draw Steel TTRPG rulebooks (in cleaned-up markdown form) and breaks them into individual, structured files -- one per ability, class, ancestry, monster statblock, etc. These files are output in multiple formats (markdown, JSON, YAML) with rich frontmatter metadata, then distributed to purpose-specific `data-*` repos that downstream tools consume.
+
+The problem it solves: the source books are monolithic documents. The compendium website, custom web components (`draw-steel-elements`), and third-party tools need granular, structured, machine-readable data. This pipeline bridges the gap.
+
+## Domain Context
+
+This is a content processing pipeline for a tabletop RPG rulebook. The source material is the Draw Steel TTRPG by MCDM Productions. Understanding the domain requires familiarity with:
+
+- **TTRPG structure**: rulebooks contain hierarchical content -- chapters contain classes, classes contain abilities, abilities have levels and costs.
+- **Draw Steel specifics**: the game has ancestries, careers, cultures, classes (Censor, Fury, Shadow, Tactician, etc.), kits, titles, conditions, skills, and treasures for heroes; and monster statblocks, dynamic terrain, and retainers for the bestiary.
+- **Content hierarchy**: a single "ability" lives inside a class, at a specific level, with a cost tier. The pipeline must preserve and expose this hierarchy.
+
+## Key Concepts
+
+| Concept | Description |
+|---------|-------------|
+| **Feature** | A discrete game mechanic (ability, class feature, ancestry benefit, etc.) extracted as its own file with frontmatter |
+| **Statblock** | A monster's stat block, extracted and converted to structured JSON/YAML |
+| **Featureblock** | A block of related features (e.g., malice abilities, dynamic terrain effects) |
+| **Section** | A logical portion of a book extracted via HTML XPath (e.g., "Ancestries", "Kits") |
+| **Frontmatter** | YAML metadata prepended to markdown files (type, source, SCC codes, etc.) |
+| **SCC** | Steel Compendium Classification -- hierarchical `source:type:item` identifier |
+| **SCDC** | Steel Compendium Decimal Classification -- numeric equivalent of SCC |
+| **Linked/Unlinked** | Whether cross-references in markdown use `scc:` protocol links or plain text |
+| **DSE (Draw Steel Elements)** | Web components that render features/statblocks; the `md-dse` format embeds YAML for these components |
+
+## Glossary
+
+| Term | Meaning |
+|------|---------|
+| `data-*` repos | Sibling repositories that hold generated output (e.g., `data-rules-md`, `data-bestiary-json`) |
+| `devbox` | Nix-based development environment manager |
+| `just` | Command runner (like `make` but simpler), used as the task runner for all ETL |
+| `marker` | PDF-to-markdown conversion tool used in `pdf_to_md/` |
+| `mdformat` | Markdown formatter/linter |
+| `pandoc` | Universal document converter -- used for markdown-to-HTML and HTML-to-markdown |
+| `sc-convert` | CLI from `steel-compendium-sdk` npm package; converts feature/statblock markdown to JSON/YAML |
+| `tidy` | HTML Tidy -- used to clean up pandoc HTML output |
+| `yq` | YAML/JSON processor (Go version) -- used for frontmatter extraction |
+
+## Audiences
+
+| Audience | How they use this repo |
+|----------|----------------------|
+| **Steel Compendium maintainers** | Run the pipeline when source material updates; fix parsing issues |
+| **Compendium website** | Consumes `data-md` and `data-md-linked` repos generated by this pipeline |
+| **draw-steel-elements** | Consumes `data-*-md-dse` repos with embedded YAML for rendering web components |
+| **Third-party tool authors** | Consume `data-rules-json`, `data-bestiary-yaml`, etc. for custom tooling |
+
+## Feature Inventory
+
+### Shipped
+
+- Heroes book full pipeline (markdown -> HTML -> sections -> formatted markdown -> JSON/YAML)
+- Monsters book pipeline (statblocks, featureblocks, dynamic terrain)
+- SCC/SCDC classification generation
+- SCC-based cross-reference linking (linked and unlinked variants)
+- DSE markdown generation (embedded YAML for web components)
+- Aggregate data files (combined JSON/YAML of all features or statblocks)
+- Index file generation for each section
+- PDF-to-markdown conversion (via marker + OpenAI)
+- Unified `data-md` repo assembly from multiple source-specific repos
+- Markdown formatting/linting (mdformat)
+
+### In Progress
+
+- Monster book SCC linking (TODO in code)
+- Adventures pipeline (stub only)
+
+### Planned
+
+- Cards (downloadable images/HTML for abilities/statblocks)
+- Auto-linking improvements
+
+## Constraints and Risks
+
+- **License**: Published under the DRAW STEEL Creator License. Not affiliated with MCDM Productions.
+- **Source material dependency**: The pipeline is tightly coupled to the structure of the Draw Steel rulebooks. Structural changes in source documents may require ETL updates.
+- **No tests**: The pipeline has no automated test suite. Validation is manual.
+- **Staging is ephemeral**: The `staging/` directory is wiped on each run. Intermediate outputs are not preserved.
+- **sc-convert dependency**: The `steel-compendium-sdk` npm package provides the `sc-convert` CLI. Breaking changes there break this pipeline.
diff --git a/.repo-docs/troubleshooting.md b/.repo-docs/troubleshooting.md
new file mode 100644
index 0000000..2421939
--- /dev/null
+++ b/.repo-docs/troubleshooting.md
@@ -0,0 +1,89 @@
+---
+repo: data-gen
+updated: 2026-04-05
+---
+
+# Troubleshooting
+
+## Do NOT
+
+- **Do not edit files in `staging/`** -- they are wiped on every `just gen` run.
+- **Do not run `just gen` outside of a devbox shell** -- required tools (pandoc, yq, mdformat, sc-convert, etc.) will be missing.
+- **Do not delete `input/classification.json` manually** -- it is regenerated by the pipeline but its state accumulates across runs. If deleted, SCC decimal codes may be reassigned differently.
+- **Do not use `H8` or `H9` headers for non-ability/non-feature content** in source markdown -- the pipeline treats these header levels specially and will attempt to extract them as features.
+- **Do not add new content to `data-*` repos directly** -- they are overwritten by `data-gen` output on each run (except `README.md` and `.git/`).
+
+## Known Issues
+
+- **Adventures pipeline is a stub**: `adventures.just` creates a placeholder file. No actual adventure content processing exists yet.
+- **Monster linking not wired up**: The monsters pipeline generates unlinked output only. The `link_md_files` recipe is commented out.
+- **`classification.json` is deleted and regenerated on each heroes/monsters run**: This means running `gen_heroes` alone will lose monster classifications and vice versa. Run `just gen` for a full build.
+- **tidy warnings**: `html-tidy` outputs warnings during HTML cleanup. These are suppressed with `|| true` but may mask real HTML issues.
+- **Encoding issues**: Some content (especially in the Perks section) has encoding problems. The auto-linker force-writes all files as UTF-8 regardless of changes as a workaround.
+
+## Common Errors
+
+### `sc-convert: command not found`
+
+**Cause**: Not running inside a devbox shell, or `steel-compendium-sdk` npm package not installed.
+
+**Fix**:
+```bash
+cd etl && devbox shell
+# devbox init_hook installs it automatically
+```
+
+### `figlet: command not found`
+
+**Cause**: Not in devbox shell.
+
+**Fix**: Enter devbox shell.
+
+### `yq: command not found` or wrong yq version
+
+**Cause**: System `yq` (Python version) conflicts with the Go version (`yq-go`) required by this project.
+
+**Fix**: Use devbox shell which provides the correct `yq-go`.
+
+### SCC duplicate error: `Adding scc entry for '...' but there is already a value`
+
+**Cause**: Two markdown files generated the same SCC classification string. Usually a data issue in the source markdown (duplicate headers or section configs).
+
+**Fix**: Check for duplicate entries in the relevant `input/*.yml` config or duplicate headers in the source markdown file.
+
+### pandoc conversion produces garbled output
+
+**Cause**: Source markdown contains non-standard encoding or HTML entities that pandoc doesn't handle well.
+
+**Fix**: Check the source markdown for special characters. The pipeline runs `iconv -f UTF-8 -t UTF-8//TRANSLIT` to normalize encoding, but some characters may still cause issues.
+
+## Debug Playbooks
+
+### Inspect a specific pipeline stage
+
+1. Run the full pipeline: `just gen`
+2. Look at the numbered staging directories:
+   ```bash
+   ls staging/heroes/
+   ```
+3. Compare input and output of the failing stage to identify where data is lost or corrupted.
+
+### Debug frontmatter issues
+
+1. Check the raw frontmatter on a generated file:
+   ```bash
+   yq e --front-matter=markdown '.' staging/heroes/3_md_sections/path/to/file.md
+   ```
+2. Check `input/classification.json` for the SCC tree state.
+3. Check `staging/scc_to_path.json` for the SCC-to-path mapping.
+
+### Debug section extraction
+
+1. Check the expanded section config:
+   ```bash
+   cat staging/heroes/0_sections/<config>.yml
+   ```
+2. Verify the XPath matches content in the HTML:
+   ```bash
+   cat staging/heroes/1_html/Draw\ Steel\ Heroes.html | grep -i "<section"
+   ```

From 208bbe0d80c8dbe631913ac5c103ce4a7f49c4b0 Mon Sep 17 00:00:00 2001
From: Vexa <vexa.tski@gmail.com>
Date: Tue, 14 Apr 2026 16:31:01 -0400
Subject: [PATCH 2/2] Moving to match schema with data-sdk-npm: phase 2

---
 .repo-docs/architecture.md     |  5 ---
 .repo-docs/ci-cd.md            | 60 ++++++++++++++++++++++++++++++++++
 .repo-docs/conventions.md      |  5 ---
 .repo-docs/decisions/README.md |  5 ---
 .repo-docs/development.md      |  5 ---
 .repo-docs/index.md            |  7 ++--
 .repo-docs/integration.md      |  5 ---
 .repo-docs/project.md          |  5 ---
 .repo-docs/troubleshooting.md  |  5 ---
 CLAUDE.md                      |  8 +++++
 10 files changed, 72 insertions(+), 38 deletions(-)
 create mode 100644 .repo-docs/ci-cd.md
 create mode 100644 CLAUDE.md

diff --git a/.repo-docs/architecture.md b/.repo-docs/architecture.md
index fb489af..3812633 100644
--- a/.repo-docs/architecture.md
+++ b/.repo-docs/architecture.md
@@ -1,8 +1,3 @@
----
-repo: data-gen
-updated: 2026-04-05
----
-
 # Architecture
 
 ## System Overview
diff --git a/.repo-docs/ci-cd.md b/.repo-docs/ci-cd.md
new file mode 100644
index 0000000..875f80f
--- /dev/null
+++ b/.repo-docs/ci-cd.md
@@ -0,0 +1,60 @@
+# CI/CD
+
+## Pipeline Overview
+
+data-gen has no CI/CD pipeline. The ETL pipeline is run manually on a developer's machine inside a devbox shell. There are no automated builds, tests, or deployments.
+
+## Build Process
+
+All builds are local. The canonical build command:
+
+```bash
+cd etl && devbox shell
+just gen
+```
+
+This runs the full pipeline: wipe staging, heroes pipeline, monsters pipeline, adventures (stub), and unification. Output is copied to sibling `data-*` repos.
+
+### Build Artifacts
+
+| Artifact | Location | Description |
+|----------|----------|-------------|
+| Staging intermediates | `staging/` (gitignored) | Ephemeral; wiped on each run |
+| Final output | Sibling `data-*` repos | Markdown, JSON, YAML, DSE variants |
+| Classification state | `input/classification.json` | SCC type/source tree; committed to this repo |
+
+## Branch Strategy
+
+| Branch | Purpose |
+|--------|---------|
+| `main` | Stable, current output |
+| `develop` | Active development |
+| `monsters` | Monster book pipeline work |
+| `links` | SCC linking feature development |
+| Various feature branches | Short-lived, merged to develop or main |
+
+No branch protection rules are configured.
+
+## Release Process
+
+There are no versioned releases or tags. The workflow is:
+
+1. Run `just gen` locally
+2. Commit and push changes to each `data-*` output repo manually
+3. In the `compendium` project, run `just update` to pull new data commits into the website
+
+### Rollback
+
+Roll back by reverting commits in the affected `data-*` repos and re-running `just update` in the compendium project.
+
+## Environments
+
+| Environment | Description |
+|-------------|-------------|
+| Local (devbox shell) | Only environment; all pipeline work happens here |
+
+## Secrets and Configuration
+
+| Secret | Used by | Description |
+|--------|---------|-------------|
+| `OPEN_AI_KEY` | `pdf_to_md/` only | OpenAI API key for LLM-assisted PDF conversion. Not needed for the main ETL pipeline. |
diff --git a/.repo-docs/conventions.md b/.repo-docs/conventions.md
index 8dcb814..be60e39 100644
--- a/.repo-docs/conventions.md
+++ b/.repo-docs/conventions.md
@@ -1,8 +1,3 @@
----
-repo: data-gen
-updated: 2026-04-05
----
-
 # Conventions
 
 ## File and Directory Naming
diff --git a/.repo-docs/decisions/README.md b/.repo-docs/decisions/README.md
index 930bb74..e2ec8d5 100644
--- a/.repo-docs/decisions/README.md
+++ b/.repo-docs/decisions/README.md
@@ -1,8 +1,3 @@
----
-repo: data-gen
-updated: 2026-04-05
----
-
 # Decision Log
 
 A running record of architectural and design decisions made in this project.
diff --git a/.repo-docs/development.md b/.repo-docs/development.md
index 34cd79c..8e9878b 100644
--- a/.repo-docs/development.md
+++ b/.repo-docs/development.md
@@ -1,8 +1,3 @@
----
-repo: data-gen
-updated: 2026-04-05
----
-
 # Development
 
 ## Prerequisites
diff --git a/.repo-docs/index.md b/.repo-docs/index.md
index 58df9e8..123adc4 100644
--- a/.repo-docs/index.md
+++ b/.repo-docs/index.md
@@ -8,7 +8,7 @@ tech:
   - python (frontmatter, classification, linking)
   - pandoc (markdown/html conversion)
   - devbox (environment management)
-updated: 2026-04-05
+updated: 2026-04-07
 ---
 
 # data-gen
@@ -81,7 +81,7 @@ data-gen/
 | **New to this repo** | This file | [project.md](project.md) |
 | **Developer** | [development.md](development.md) | [architecture.md](architecture.md), [conventions.md](conventions.md) |
 | **Architect** | [architecture.md](architecture.md) | [integration.md](integration.md), [decisions/](decisions/) |
-| **DevOps / SRE** | [development.md](development.md) | [integration.md](integration.md) |
+| **DevOps / SRE** | [ci-cd.md](ci-cd.md) | [development.md](development.md), [integration.md](integration.md) |
 
 ### Agent Roles
 
@@ -90,6 +90,7 @@ data-gen/
 | **Code review** | [conventions.md](conventions.md) | [architecture.md](architecture.md) |
 | **Bug fix / debug** | [troubleshooting.md](troubleshooting.md) | [development.md](development.md), [architecture.md](architecture.md) |
 | **Feature implementation** | [architecture.md](architecture.md) | [conventions.md](conventions.md), [development.md](development.md), [decisions/](decisions/) |
+| **CI/CD / DevOps** | [ci-cd.md](ci-cd.md) | [development.md](development.md), [integration.md](integration.md) |
 | **Documentation** | This file | [project.md](project.md), [architecture.md](architecture.md) |
 | **Onboarding / Q&A** | This file | [project.md](project.md), [development.md](development.md) |
 
@@ -103,11 +104,11 @@ data-gen/
 
 | File | Description |
 |------|-------------|
-| [index.md](index.md) | This file -- overview, quick reference, structure |
 | [project.md](project.md) | Domain context, glossary, feature inventory |
 | [architecture.md](architecture.md) | Pipeline stages, components, data flow |
 | [development.md](development.md) | Setup, prerequisites, workflows |
 | [integration.md](integration.md) | Upstream/downstream repos, data contracts |
+| [ci-cd.md](ci-cd.md) | Build process, release workflow, branch strategy |
 | [conventions.md](conventions.md) | Naming, commit style, code patterns |
 | [troubleshooting.md](troubleshooting.md) | Known issues, common errors |
 | [decisions/](decisions/) | Architectural decision records |
diff --git a/.repo-docs/integration.md b/.repo-docs/integration.md
index b317090..c598d76 100644
--- a/.repo-docs/integration.md
+++ b/.repo-docs/integration.md
@@ -1,8 +1,3 @@
----
-repo: data-gen
-updated: 2026-04-05
----
-
 # Integration
 
 ## Dependency Map
diff --git a/.repo-docs/project.md b/.repo-docs/project.md
index bd9a586..69946e0 100644
--- a/.repo-docs/project.md
+++ b/.repo-docs/project.md
@@ -1,8 +1,3 @@
----
-repo: data-gen
-updated: 2026-04-05
----
-
 # Project Context
 
 ## Product Overview
diff --git a/.repo-docs/troubleshooting.md b/.repo-docs/troubleshooting.md
index 2421939..515f059 100644
--- a/.repo-docs/troubleshooting.md
+++ b/.repo-docs/troubleshooting.md
@@ -1,8 +1,3 @@
----
-repo: data-gen
-updated: 2026-04-05
----
-
 # Troubleshooting
 
 ## Do NOT
diff --git a/CLAUDE.md b/CLAUDE.md
new file mode 100644
index 0000000..2ef88f0
--- /dev/null
+++ b/CLAUDE.md
@@ -0,0 +1,8 @@
+# data-gen
+
+ETL pipeline that converts Draw Steel TTRPG source documents (markdown) into structured, multi-format output distributed across multiple `data-*` repos.
+
+## Repository Documentation
+
+This repo uses standardized `.repo-docs/` documentation. **Read `.repo-docs/index.md`
+first** -- it contains the reading guide, role-based routing, and links to all other docs.