From 11bc2f0e319a1613e09944bfd9578afde63cb362 Mon Sep 17 00:00:00 2001 From: Vexa Date: Sun, 5 Apr 2026 09:04:45 -0400 Subject: [PATCH 1/2] docs: add .repo-docs with comprehensive repository documentation Generates standardized documentation covering project overview, architecture, development setup, integration map, conventions, troubleshooting, and architectural decision records for the data-gen ETL pipeline. --- .repo-docs/architecture.md | 158 ++++++++++++++++++ .repo-docs/conventions.md | 92 ++++++++++ .../2026-04-05-just-as-orchestrator.md | 43 +++++ .../2026-04-05-markdown-html-roundtrip.md | 35 ++++ .../2026-04-05-multiple-output-variants.md | 48 ++++++ .../2026-04-05-scc-classification.md | 40 +++++ .repo-docs/decisions/README.md | 79 +++++++++ .repo-docs/development.md | 146 ++++++++++++++++ .repo-docs/index.md | 113 +++++++++++++ .repo-docs/integration.md | 121 ++++++++++++++ .repo-docs/project.md | 90 ++++++++++ .repo-docs/troubleshooting.md | 89 ++++++++++ 12 files changed, 1054 insertions(+) create mode 100644 .repo-docs/architecture.md create mode 100644 .repo-docs/conventions.md create mode 100644 .repo-docs/decisions/2026-04-05-just-as-orchestrator.md create mode 100644 .repo-docs/decisions/2026-04-05-markdown-html-roundtrip.md create mode 100644 .repo-docs/decisions/2026-04-05-multiple-output-variants.md create mode 100644 .repo-docs/decisions/2026-04-05-scc-classification.md create mode 100644 .repo-docs/decisions/README.md create mode 100644 .repo-docs/development.md create mode 100644 .repo-docs/index.md create mode 100644 .repo-docs/integration.md create mode 100644 .repo-docs/project.md create mode 100644 .repo-docs/troubleshooting.md diff --git a/.repo-docs/architecture.md b/.repo-docs/architecture.md new file mode 100644 index 0000000..fb489af --- /dev/null +++ b/.repo-docs/architecture.md @@ -0,0 +1,158 @@ +--- +repo: data-gen +updated: 2026-04-05 +--- + +# Architecture + +## System Overview + +data-gen is a multi-stage ETL pipeline orchestrated by `just` (a command runner). Each stage transforms the source material closer to the final structured output. + +``` + ┌──────────────┐ + │ PDF Source │ + └──────┬───────┘ + │ (manual, pdf_to_md/) + v + ┌───────────────────────┐ + │ Markdown Source Docs │ + │ (input/heroes/*.md) │ + │ (input/monsters/*.md) │ + └───────────┬───────────┘ + │ + ┌─────────────────┼─────────────────┐ + v v v + ┌──────────┐ ┌────────────┐ ┌──────────────┐ + │ Heroes │ │ Monsters │ │ Adventures │ + │ Pipeline │ │ Pipeline │ │ (stub) │ + └────┬─────┘ └─────┬──────┘ └──────┬───────┘ + │ │ │ + v v v + ┌──────────────────────────────────────────────────┐ + │ Output data-* repos │ + │ data-rules-md, data-rules-json, data-rules-yaml │ + │ data-bestiary-md, data-bestiary-json, ... │ + │ data-rules-md-dse, data-bestiary-md-dse, ... │ + │ data-rules-md-linked, data-md-linked, ... │ + └──────────────────────┬───────────────────────────┘ + │ + v + ┌───────────────┐ + │ Unification │ + │ (data-md, │ + │ data-md-dse)│ + └───────────────┘ +``` + +## Pipeline Stages (Heroes) + +The heroes pipeline is the most complete. Each stage produces output in a numbered subdirectory under `staging/heroes/`: + +| Stage | Directory | What happens | +|-------|-----------|--------------| +| 0 | `0_features/` | Extract table of contents from source markdown; generate `features.yml` config | +| 1 | `1_html/` | Convert source markdown to HTML via pandoc | +| 2 | `2_html_sections/` | Extract individual sections from HTML using XPath (driven by YAML configs) | +| 3 | `3_md_sections/` | Convert HTML sections back to markdown via pandoc with Lua filters | +| -- | (metadata) | Generate SCC/SCDC classification, frontmatter, and indexes | +| 7 | `7_preformatted/` | Assemble all markdown sections; clean up original source doc | +| 8 | `8_formatted_md/` | Lint/format all markdown with mdformat | +| 9 | `9_linking_md/` | Split into linked (SCC links applied) and unlinked variants | +| 10 | `10_conversions/` | Convert markdown to JSON, YAML; generate DSE markdown; build aggregates | + +## Components + +| Component | Location | Responsibility | Depends on | +|-----------|----------|---------------|------------| +| **Main justfile** | `etl/justfile` | Entry point, shared constants, utility recipes | All modules | +| **heroes.just** | `etl/heroes.just` | Orchestrates full heroes pipeline | markdown, html, heroes_sections, features, frontmatter, sc_classification, linker, index | +| **monsters.just** | `etl/monsters.just` | Orchestrates full monsters pipeline | markdown, html, monsters_sections, statblocks, featureblocks, frontmatter, sc_classification | +| **unified.just** | `etl/unified.just` | Merges per-book repos into unified `data-md` | Output from heroes + monsters + adventures | +| **adventures.just** | `etl/adventures.just` | Placeholder for adventures pipeline | -- | +| **markdown.just** | `etl/markdown.just` | MD-to-HTML, TOC extraction, linting, blockquote separation, YAML embedding | pandoc, mdformat | +| **html.just** | `etl/html.just` | HTML-to-MD conversion with Lua filter | pandoc | +| **features.just** | `etl/features.just` | Feature metadata generation, conversion to JSON/YAML, DSE embedding, indexes | sc-convert (npm) | +| **statblocks.just** | `etl/statblocks.just` | Statblock MD-to-JSON/YAML conversion | sc-convert (npm) | +| **featureblocks.just** | `etl/featureblocks.just` | Featureblock conversion (malice, dynamic terrain) | sc-convert (npm) | +| **sc_classification.just** | `etl/sc_classification.just` | SCC/SCDC classification assignment (Python) | frontmatter in MD files | +| **frontmatter.just** | `etl/frontmatter.just` | Frontmatter processing: cleanup, sorting (Python) | PyYAML | +| **linker.just** | `etl/linker.just` | Apply/remove SCC cross-reference links (Python) | scc_to_path.json | +| **index.just** | `etl/index.just` | Generate markdown index tables from directory contents | yq | +| **aggregate.just** | `etl/aggregate.just` | Build single-file aggregates of all features/statblocks | -- | +| **section_config.just** | `etl/section_config.just` | Expand section config YAMLs with XPath data | -- | +| **extract_html_sections.just** | `etl/extract_html_sections.just` | Extract HTML sections using XPath | -- | +| **heroes_sections.just** | `etl/heroes_sections.just` | Heroes-specific section extraction config | section_config, extract_html_sections | +| **monsters_sections.just** | `etl/monsters_sections.just` | Monsters-specific section extraction | section_config, extract_html_sections | +| **heroes_frontmatter.just** | `etl/heroes_frontmatter.just` | Heroes-specific frontmatter enrichment | -- | +| **monsters_frontmatter.just** | `etl/monsters_frontmatter.just` | Monsters-specific frontmatter enrichment | -- | +| **features_config.just** | `etl/features_config.just` | Generate features.yml from TOC | -- | +| **link_md/** | `etl/link_md/` | Obsidian-style auto-linker (legacy, Python) | inflect | +| **pre_automation_tools/** | `etl/pre_automation_tools/` | Legacy one-time preprocessing scripts | -- | +| **pdf_to_md/** | `pdf_to_md/` | PDF-to-markdown conversion | marker, OpenAI API | + +## Data Flow + +Primary input-to-output path (heroes): + +1. **Input**: `input/heroes/Draw Steel Heroes.md` (cleaned markdown of the full Heroes book) +2. **Config**: `input/heroes/*.yml` (YAML files defining how to extract each section) +3. **HTML conversion**: Pandoc converts the source markdown to a single HTML file +4. **Section extraction**: XPath queries (derived from YAML configs) extract individual sections as HTML files +5. **MD conversion**: Pandoc converts each HTML section back to markdown (with Lua filters to fix links) +6. **Metadata**: Python scripts assign SCC classifications, generate frontmatter, create index files +7. **Formatting**: mdformat lints all markdown; blockquotes are separated +8. **Linking**: SCC-based cross-references are applied (linked variant) or stripped (unlinked variant) +9. **Conversion**: `sc-convert` transforms markdown to JSON and YAML; YAML is embedded for DSE variant +10. **Output**: Results are copied to sibling `data-*` repos + +## Key Design Decisions + +| Decision | Rationale | +|----------|-----------| +| Markdown-first pipeline | Source material is authored in markdown; preserving it as the canonical intermediate format keeps the pipeline inspectable | +| Round-trip MD -> HTML -> MD | XPath section extraction is much easier on HTML than markdown; the HTML intermediate is a pragmatic choice | +| Per-section files | Each ability, class, ancestry, etc. gets its own file for granular access and independent versioning | +| Multiple output variants | Different consumers need different formats: plain MD for the website, JSON/YAML for APIs, DSE for web components, linked/unlinked for different rendering contexts | +| `just` as orchestrator | Declarative recipe dependencies, file-level modularity, and embedded Python/Bash scripts in a single toolchain | +| SCC classification system | A universal, hierarchical identifier that works across all content types and supports both human-readable (string) and machine-efficient (decimal) forms | + +## Dependencies + +| Dependency | Version | Why | +|------------|---------|-----| +| bash | latest (via devbox) | Script execution | +| python | latest (via devbox) | Frontmatter processing, classification, linking | +| just | latest (via devbox) | Task runner / pipeline orchestration | +| jq | latest (via devbox) | JSON processing | +| pandoc | latest (via devbox) | Markdown/HTML conversion | +| html-tidy | latest (via devbox) | HTML cleanup | +| rsync | latest (via devbox) | Directory synchronization (unified pipeline) | +| yq-go | latest (via devbox) | YAML/frontmatter extraction | +| iconv | latest (via devbox) | Character encoding normalization | +| go | latest (via devbox) | Required for go-based tools | +| perl | via devbox | Text processing in some ETL steps | +| figlet | latest (via devbox) | Decorative section headers in pipeline output | +| nodejs | latest (via devbox) | Required for `steel-compendium-sdk` / `sc-convert` CLI | +| python-frontmatter | pip | Frontmatter parsing/writing | +| PyYAML | pip | YAML processing | +| mdformat + plugins | pip | Markdown formatting | +| lxml | pip | XML/HTML processing | +| inflect | pip | Singular/plural form generation (auto-linker) | +| rapidfuzz | pip | Fuzzy matching | +| python-slugify | pip | Slug generation | +| steel-compendium-sdk | npm | `sc-convert` CLI for feature/statblock conversion | + +## Extension Points + +- **New book**: Add a new `.just` module alongside `heroes.just` and `monsters.just`. Wire it into the main `justfile`'s `gen` recipe and `unified.just`. +- **New section type**: Add a new YAML config in `input//` and register it in the relevant `*_sections.just` module. +- **New output format**: Add conversion recipes in the relevant pipeline module, using `sc-convert` or custom scripts. +- **New frontmatter fields**: Extend `*_frontmatter.just` modules with additional metadata generation. + +## Constraints + +- The pipeline must run inside a devbox shell (or equivalent environment with all dependencies). +- The `staging/` directory is ephemeral and wiped at the start of each `gen` run. +- Output repos (`data-*`) must exist as sibling directories in the workspace. +- The `sc-convert` CLI must be installed via npm within the devbox environment. diff --git a/.repo-docs/conventions.md b/.repo-docs/conventions.md new file mode 100644 index 0000000..8dcb814 --- /dev/null +++ b/.repo-docs/conventions.md @@ -0,0 +1,92 @@ +--- +repo: data-gen +updated: 2026-04-05 +--- + +# Conventions + +## File and Directory Naming + +- **Justfile modules**: `snake_case.just` (e.g., `heroes_sections.just`, `sc_classification.just`) +- **Input YAML configs**: `snake_case.yml` (e.g., `abilities.yml`, `feature_fixes.yml`) +- **Source documents**: Title Case with spaces (e.g., `Draw Steel Heroes.md`, `Draw Steel Monsters.md`) +- **Output directories**: Title Case (e.g., `Features/`, `Abilities/`, `Chapters/`, `Statblocks/`) +- **Generated markdown files**: Title Case with spaces (matching the game content name) +- **Staging directories**: Numbered prefixes for pipeline order (e.g., `0_features/`, `1_html/`, `8_formatted_md/`) + +## Code Style + +### Justfile recipes + +- Module declaration at top: `# Justfile module expected to be named ""` +- Constants section, then public recipes, then private recipes -- each separated by `#` banner comments +- Private recipes prefixed with `_` +- All scripts use `#!/usr/bin/env bash` with `set -euo pipefail` +- Python scripts are embedded directly in justfile recipes using `#!/usr/bin/env python3` +- `BASH_ENV` and shell settings declared per module as needed + +### Python (embedded) + +- Python scripts are embedded in `.just` files, not standalone `.py` files (exception: `link_md/obs-auto-linker.py`) +- Uses `pathlib.Path` for file operations +- Uses `python-frontmatter` library for markdown frontmatter parsing +- Uses `yaml.safe_load` / `yaml.safe_dump` for YAML processing + +### Variable naming (justfile) + +- Directory paths: `*_dpath` suffix (e.g., `heroes_staging_dpath`, `input_dpath`) +- File paths: `*_fpath` suffix (e.g., `html_fpath`, `scc_to_path_json_fpath`) +- Log prefixes: `log_prefix` variable for module identification + +## Commit Messages + +Observed pattern: imperative or present-tense descriptions without conventional commit prefixes. + +Examples from history: +- `Corrects Malice ability on Ogre Jug and adds noncombatant` +- `Adds aggregate files for json and yaml` +- `Correcting a lot of issues with type uniqueness` +- `Cleanup from errata` +- `Bugfixes!` + +Style: informal, content-focused. Describes what changed in game-data terms rather than code terms. + +## Markdown Header Levels + +The source documents use extended header levels beyond standard markdown: + +| Level | Meaning | +|-------|---------| +| H1-H6 | Standard chapter/section hierarchy | +| H7 (7 `#`) | Statblocks (converted to bold+span in output) | +| H8 (8 `#`) | Abilities/features (converted to bold+span in output) | +| H9 (9 `#`) | Featureblocks: malice, dynamic terrain (converted to bold+span in output) | + +## Justfile Module Pattern + +Every module follows this structure: + +```just +# Justfile module expected to be named "" + +################################################## +# Constants and env vars +################################################## + +# ... module-specific constants + +################################################## +# Public Recipes +################################################## + +export BASH_ENV := ".utils/.utilsrc" +set shell := ["bash", "-c"] + +# public recipes here + +################################################## +# Private Recipes +################################################## + +# private recipes here (prefixed with _) +``` diff --git a/.repo-docs/decisions/2026-04-05-just-as-orchestrator.md b/.repo-docs/decisions/2026-04-05-just-as-orchestrator.md new file mode 100644 index 0000000..8321c41 --- /dev/null +++ b/.repo-docs/decisions/2026-04-05-just-as-orchestrator.md @@ -0,0 +1,43 @@ +--- +status: accepted +date: 2026-04-05 +--- + +# Just as pipeline orchestrator + +## Context + +The ETL pipeline involves dozens of interdependent steps across multiple tools (pandoc, Python scripts, shell utilities, npm CLIs). A task runner is needed to orchestrate these steps. + +## Options Considered + +### Option A: Makefile +- Pros: Universal, well-known, file-based dependency tracking +- Cons: Arcane syntax, poor string handling, hard to embed multi-line scripts + +### Option B: Shell scripts +- Pros: No extra tool needed +- Cons: No dependency management, poor modularity, hard to compose + +### Option C: Just +- Pros: Clean syntax, module system (`mod`), inline Python/Bash scripts, parameter passing, built-in functions (`titlecase`, `file_stem`) +- Cons: Less common than Make, no file-based dependency tracking + +### Option D: Python/Invoke +- Pros: Full programming language, good for complex logic +- Cons: Overhead for simple shell commands, requires Python environment for orchestration layer + +## Decision + +Option C: Just. The module system enables splitting the pipeline into focused, composable files. Inline multi-line scripts (both Bash and Python) keep logic close to where it's used. The clean syntax reduces boilerplate compared to Make. + +## Consequences + +- Each pipeline component is a `.just` module imported into the main justfile +- Python logic is embedded directly in justfile recipes rather than standalone scripts +- No file-based dependency tracking -- the pipeline always runs all stages +- `devbox` provides the `just` binary + +## Outcome + +The modular justfile structure has scaled well to 25+ modules. Embedding Python scripts in recipes works but can make individual recipes long. The lack of file-based dependency tracking means every run is a full rebuild, which takes a few minutes but is acceptable for the current scale. diff --git a/.repo-docs/decisions/2026-04-05-markdown-html-roundtrip.md b/.repo-docs/decisions/2026-04-05-markdown-html-roundtrip.md new file mode 100644 index 0000000..47eb645 --- /dev/null +++ b/.repo-docs/decisions/2026-04-05-markdown-html-roundtrip.md @@ -0,0 +1,35 @@ +--- +status: accepted +date: 2026-04-05 +--- + +# Markdown-first pipeline with HTML round-trip + +## Context + +The source material is a TTRPG rulebook provided as a single large markdown file. The pipeline needs to split this into hundreds of individual files (one per ability, class, ancestry, etc.) while preserving formatting and hierarchy. + +## Options Considered + +### Option A: Parse markdown directly with regex/AST +- Pros: No intermediate format, simpler toolchain +- Cons: Markdown parsing is fragile for deep nesting (H7-H9 headers); extracting arbitrary sections by path is hard without a DOM + +### Option B: Convert to HTML, extract via XPath, convert back +- Pros: XPath provides reliable, configurable section extraction; HTML is a well-defined DOM; pandoc handles both conversions +- Cons: Round-trip conversion can introduce formatting artifacts; more pipeline stages + +## Decision + +Option B: round-trip through HTML. Pandoc's markdown-to-HTML and HTML-to-markdown conversions are robust enough, and XPath extraction is far more reliable than regex-based markdown splitting. A Lua filter fixes link extensions during the HTML-to-markdown conversion. + +## Consequences + +- Pipeline has more stages (MD -> HTML -> section HTML -> section MD), making debugging harder +- HTML Tidy is required to clean up pandoc's HTML output +- Character encoding and entity handling requires explicit normalization at multiple stages +- Section extraction is highly configurable via YAML config files with XPath expressions + +## Outcome + +Has worked reliably for the Heroes book (400+ pages, hundreds of abilities). The XPath-based extraction handles the complex nesting of Draw Steel's content hierarchy well. The main pain point is HTML entity normalization (smart quotes, special characters). diff --git a/.repo-docs/decisions/2026-04-05-multiple-output-variants.md b/.repo-docs/decisions/2026-04-05-multiple-output-variants.md new file mode 100644 index 0000000..28e0e1c --- /dev/null +++ b/.repo-docs/decisions/2026-04-05-multiple-output-variants.md @@ -0,0 +1,48 @@ +--- +status: accepted +date: 2026-04-05 +--- + +# Multiple output variants + +## Context + +Different downstream consumers need the same content in different formats and with different features: + +- The compendium website needs plain markdown with cross-reference links +- The `draw-steel-elements` web components need markdown with embedded YAML +- API consumers need structured JSON/YAML +- Some tools want cross-reference links; others don't + +## Options Considered + +### Option A: Single canonical format, consumers transform +- Pros: One output to maintain +- Cons: Pushes complexity to every consumer; consumers may not have the tooling + +### Option B: Multiple output variants, all generated by this pipeline +- Pros: Each consumer gets exactly what it needs; transformation logic is centralized +- Cons: More repos to maintain, more pipeline complexity, larger total output + +## Decision + +Option B: generate all variants centrally. The pipeline produces: + +- **Plain markdown** (unlinked): `data-rules-md`, `data-bestiary-md` +- **Linked markdown**: `data-rules-md-linked`, `data-bestiary-md` (TODO: linked variant) +- **DSE markdown** (embedded YAML, unlinked): `data-rules-md-dse`, `data-bestiary-md-dse` +- **DSE markdown** (linked): `data-rules-md-dse-linked` +- **JSON**: `data-rules-json`, `data-bestiary-json` +- **YAML**: `data-rules-yaml`, `data-bestiary-yaml` +- **Unified**: `data-md`, `data-md-linked`, `data-md-dse`, `data-md-dse-linked` + +## Consequences + +- 14+ downstream repos are generated from a single pipeline run +- Adding a new variant requires adding conversion steps and a new target repo +- The `_copy_data_to_repo` utility handles wiping and populating target repos +- Unified repos (`data-md*`) aggregate content from all book-specific repos + +## Outcome + +The approach works but has created a large number of repos. The unified variants (`data-md*`) were added later to simplify consumption for downstream tools that want all content in one place. diff --git a/.repo-docs/decisions/2026-04-05-scc-classification.md b/.repo-docs/decisions/2026-04-05-scc-classification.md new file mode 100644 index 0000000..df9cca3 --- /dev/null +++ b/.repo-docs/decisions/2026-04-05-scc-classification.md @@ -0,0 +1,40 @@ +--- +status: accepted +date: 2026-04-05 +--- + +# SCC classification system + +## Context + +Content from the Draw Steel rulebooks needs a universal, stable identifier that works across different output formats, supports cross-referencing between items, and is both human-readable and machine-processable. + +## Options Considered + +### Option A: File path as identifier +- Pros: Simple, no extra tooling +- Cons: Coupled to directory structure, not portable across repos, no semantic meaning + +### Option B: UUIDs +- Pros: Globally unique, no coordination needed +- Cons: Not human-readable, no hierarchy information, can't derive from content + +### Option C: Hierarchical classification (SCC) +- Pros: Human-readable string form (`mcdm.heroes.v1:abilities.fury:gouge`), compact decimal form (`1.1.1:2.4:28`), encodes source, type, and item; supports multiple classifications per item +- Cons: Requires a registration/state system, more complex to implement + +## Decision + +Option C: Steel Compendium Classification (SCC). The three-component `source:type:item` structure maps naturally to the TTRPG content hierarchy. The dual string/decimal representation serves different use cases (human vs. machine). + +## Consequences + +- A state file (`classification.json`) tracks the type/source tree and assigned IDs +- Each markdown file gets `scc` and `scdc` frontmatter fields +- Cross-references can use `scc:` protocol links (e.g., `[Gouge](scc:mcdm.heroes.v1:abilities.fury:gouge)`) +- One item can have multiple SCC codes (e.g., an ability classified by class, by level, and globally) +- Decimal codes are positional and depend on registration order + +## Outcome + +Works well for cross-referencing and downstream consumption. The main challenge is that `classification.json` is regenerated on each run, so decimal codes can shift if source material changes. The string form is stable as long as item names don't change. diff --git a/.repo-docs/decisions/README.md b/.repo-docs/decisions/README.md new file mode 100644 index 0000000..930bb74 --- /dev/null +++ b/.repo-docs/decisions/README.md @@ -0,0 +1,79 @@ +--- +repo: data-gen +updated: 2026-04-05 +--- + +# Decision Log + +A running record of architectural and design decisions made in this project. + +## Why log decisions? + +Context doesn't survive in memory. Logging decisions prevents relitigating past choices and helps new contributors understand why things are the way they are. + +## What to log + +Every decision: library choices, format changes, rejected approaches, conventions, reverted experiments. Small decisions compound -- logging them builds a navigable history. + +## How to create a record + +1. Copy the template below +2. Name the file `YYYY-MM-DD-short-description.md` +3. Fill in all sections +4. Set status to `proposed`, `accepted`, `tried`, `superseded`, or `deprecated` + +### Template + +```markdown +--- +status: proposed +date: YYYY-MM-DD +--- + +# Title + +## Context + +Why was this decision needed? + +## Options Considered + +### Option A +- Pros: ... +- Cons: ... + +### Option B +- Pros: ... +- Cons: ... + +## Decision + +What was chosen and why. + +## Consequences + +Positive outcomes and accepted tradeoffs. + +## Outcome + +Leave blank until there is real experience to report. +``` + +### Status definitions + +| Status | Meaning | +|--------|---------| +| `proposed` | Under consideration, not yet implemented | +| `accepted` | Agreed upon and implemented | +| `tried` | Implemented but didn't work; reverted or abandoned | +| `superseded` | Replaced by a newer decision | +| `deprecated` | Still in place but scheduled for removal | + +## Index + +| Date | Decision | Status | +|------|----------|--------| +| 2026-04-05 | [Markdown-first pipeline with HTML round-trip](2026-04-05-markdown-html-roundtrip.md) | accepted | +| 2026-04-05 | [Just as pipeline orchestrator](2026-04-05-just-as-orchestrator.md) | accepted | +| 2026-04-05 | [SCC classification system](2026-04-05-scc-classification.md) | accepted | +| 2026-04-05 | [Multiple output variants](2026-04-05-multiple-output-variants.md) | accepted | diff --git a/.repo-docs/development.md b/.repo-docs/development.md new file mode 100644 index 0000000..34cd79c --- /dev/null +++ b/.repo-docs/development.md @@ -0,0 +1,146 @@ +--- +repo: data-gen +updated: 2026-04-05 +--- + +# Development + +## Prerequisites + +| Tool | Version | How to get | +|------|---------|-----------| +| devbox | any recent | [jetify.com/devbox](https://www.jetify.com/devbox) | +| git | any recent | System package manager | + +All other dependencies (bash, python, just, pandoc, node, etc.) are installed automatically by devbox. + +## Setup + +1. Clone the repo and its sibling data repos: + ```bash + # From the workspace root (steel_compendium/workspace/) + just clone-all + ``` + +2. Enter the devbox environment: + ```bash + cd data-gen/etl + devbox shell + ``` + This installs all system packages, creates a Python venv, installs pip packages, installs `steel-compendium-sdk` via npm, and installs Go tools. + +3. Ensure source documents exist: + - `input/heroes/Draw Steel Heroes.md` -- the cleaned-up Heroes book markdown + - `input/monsters/Draw Steel Monsters.md` -- the cleaned-up Monsters book markdown + - `input/heroes/*.yml` -- section config files (already committed) + +4. Run the full pipeline: + ```bash + devbox run gen + # or directly: + just gen + ``` + +## Required Environment Variables + +| Variable | Required by | Description | +|----------|------------|-------------| +| `OPEN_AI_KEY` | `pdf_to_md/` only | OpenAI API key for LLM-assisted PDF conversion | + +The main ETL pipeline (`etl/`) does not require any environment variables. + +## Common Workflows + +### Generate all output + +```bash +cd etl && devbox shell +just gen +``` + +This runs: wipe staging -> heroes pipeline -> monsters pipeline -> adventures (stub) -> unify. + +### Generate heroes only + +```bash +just gen_heroes +``` + +### Generate monsters only + +```bash +just gen_monsters +``` + +### Convert a PDF to markdown + +```bash +cd pdf_to_md && devbox shell +# Requires OPEN_AI_KEY env var +just convert_heroes "Draw Steel Heroes.pdf" +just convert_monsters "Draw_Steel_Monsters_v1.pdf" +``` + +### Switch all data repos to a branch + +```bash +just switch_repos_to develop +``` + +### Push generated data to repos + +After `just gen`, the output is already copied to the sibling `data-*` repos. Commit and push each one manually: + +```bash +cd ../data-rules-md && git add -A && git commit -m "Update from data-gen" && git push +# Repeat for each data-* repo +``` + +### Convert a single feature/statblock + +```bash +# Feature markdown to JSON +just features convert "path/to/feature.md" json + +# Statblock markdown to YAML +just statblocks convert "path/to/statblock.md" yaml +``` + +### Lint markdown files + +```bash +just markdown lint "path/to/directory" +``` + +## Testing + +There is no automated test suite. Validation is manual: + +1. Run `just gen` and inspect the `staging/` directory for intermediate output. +2. Check the `data-*` repos for final output. +3. Verify the compendium website renders correctly after updating. + +## Debugging + +### Pipeline output + +Each stage writes to a numbered subdirectory in `staging/`. To inspect a specific stage: + +```bash +ls staging/heroes/ +# 0_features/ 1_html/ 2_html_sections/ 3_md_sections/ 7_preformatted/ 8_formatted_md/ 9_linking_md/ 10_conversions/ +``` + +### Verbose just output + +`just` recipes use `set -euo pipefail` and print section headers. Check stderr for progress messages. + +### Frontmatter inspection + +```bash +yq e --front-matter=markdown '.' path/to/file.md +``` + +### Classification state + +The SCC classification tree state is stored in `input/classification.json`. Inspect it to understand current type/source assignments. diff --git a/.repo-docs/index.md b/.repo-docs/index.md new file mode 100644 index 0000000..58df9e8 --- /dev/null +++ b/.repo-docs/index.md @@ -0,0 +1,113 @@ +--- +repo: data-gen +type: tool +status: active +tech: + - just (task runner) + - bash (ETL scripts) + - python (frontmatter, classification, linking) + - pandoc (markdown/html conversion) + - devbox (environment management) +updated: 2026-04-05 +--- + +# data-gen + +ETL pipeline that converts Draw Steel TTRPG source documents (markdown) into structured, multi-format output distributed across multiple `data-*` repos. Part of the Steel Compendium project. + +**This repo is not:** a data repo itself, a web application, or a content authoring tool. It processes existing markdown source documents and outputs to sibling `data-*` repos. + +## Quick Reference + +| Action | Command | +|--------|---------| +| Enter dev environment | `devbox shell` (from `etl/` directory) | +| Generate all outputs | `devbox run gen` or `just gen` (from `etl/`) | +| Generate heroes only | `devbox run gen_heroes` or `just gen_heroes` | +| Generate monsters only | `just gen_monsters` | +| Convert PDF to markdown | `just convert_heroes` (from `pdf_to_md/`, requires `OPEN_AI_KEY`) | +| Switch data repo branches | `just switch_repos_to ` | + +| Resource | URL | +|----------|-----| +| Repository | https://github.com/SteelCompendium/data-gen | +| Bug reports | [Google Form](https://docs.google.com/forms/d/e/1FAIpQLSc6m-pZ0NLt2EArE-Tcxr-XbAPMyhu40ANHJKtyRvvwBd2LSw/viewform) | +| Issue tracker | https://github.com/SteelCompendium/data-gen/issues | + +## Repo Structure + +``` +data-gen/ + input/ # Source material and config + heroes/ # Heroes book markdown + section config YAMLs + Draw Steel Heroes.md # Primary source document + abilities.yml # Section extraction config + classes.yml # Section extraction config + ... # Other section configs + monsters/ # Monsters book markdown + section configs + Draw Steel Monsters.md + monsters.yml + classification.json # SCC type/source tree state + etl/ # ETL pipeline (justfiles + scripts) + justfile # Main entry point + heroes.just # Heroes book pipeline + monsters.just # Monsters book pipeline + adventures.just # Adventures pipeline (stub) + unified.just # Unification across books + html.just # HTML conversion utilities + markdown.just # Markdown conversion/linting + features.just # Feature extraction and conversion + statblocks.just # Statblock conversion + sc_classification.just # SCC classification generation + frontmatter.just # Frontmatter processing + linker.just # SCC link application/removal + index.just # Index file generation + aggregate.just # Aggregate data file generation + link_md/ # Obsidian-style auto-linker (Python) + pre_automation_tools/ # Legacy pre-processing helpers + devbox.json # Dev environment packages + pdf_to_md/ # PDF-to-markdown converter (uses marker + OpenAI) + justfile + devbox.json + staging/ # Generated intermediate files (gitignored) +``` + +## Reading Guide by Role + +### Human Roles + +| Role | Start here | Then read | +|------|-----------|-----------| +| **New to this repo** | This file | [project.md](project.md) | +| **Developer** | [development.md](development.md) | [architecture.md](architecture.md), [conventions.md](conventions.md) | +| **Architect** | [architecture.md](architecture.md) | [integration.md](integration.md), [decisions/](decisions/) | +| **DevOps / SRE** | [development.md](development.md) | [integration.md](integration.md) | + +### Agent Roles + +| Agent Role | Start here | Then read | +|------------|-----------|-----------| +| **Code review** | [conventions.md](conventions.md) | [architecture.md](architecture.md) | +| **Bug fix / debug** | [troubleshooting.md](troubleshooting.md) | [development.md](development.md), [architecture.md](architecture.md) | +| **Feature implementation** | [architecture.md](architecture.md) | [conventions.md](conventions.md), [development.md](development.md), [decisions/](decisions/) | +| **Documentation** | This file | [project.md](project.md), [architecture.md](architecture.md) | +| **Onboarding / Q&A** | This file | [project.md](project.md), [development.md](development.md) | + +## Current Status + +- **Health:** Active development +- **Last significant change:** Aggregate files for JSON/YAML, SCC link application, monster book parsing +- **Known blockers:** Adventures pipeline not yet implemented; monster book linking not yet wired up + +## Documents in This Directory + +| File | Description | +|------|-------------| +| [index.md](index.md) | This file -- overview, quick reference, structure | +| [project.md](project.md) | Domain context, glossary, feature inventory | +| [architecture.md](architecture.md) | Pipeline stages, components, data flow | +| [development.md](development.md) | Setup, prerequisites, workflows | +| [integration.md](integration.md) | Upstream/downstream repos, data contracts | +| [conventions.md](conventions.md) | Naming, commit style, code patterns | +| [troubleshooting.md](troubleshooting.md) | Known issues, common errors | +| [decisions/](decisions/) | Architectural decision records | diff --git a/.repo-docs/integration.md b/.repo-docs/integration.md new file mode 100644 index 0000000..b317090 --- /dev/null +++ b/.repo-docs/integration.md @@ -0,0 +1,121 @@ +--- +repo: data-gen +updated: 2026-04-05 +--- + +# Integration + +## Dependency Map + +``` + PDF Source Docs Section Config YAMLs + | | + v v + ┌─────────┐ ┌──────────┐ + │pdf_to_md│ │ input/ │ + └────┬────┘ └────┬─────┘ + | | + v v + input/*.md ──────────> [ data-gen ETL ] <── steel-compendium-sdk (npm) + | + ┌──────────────┼──────────────────┐ + v v v + data-rules-* data-bestiary-* data-adventures-md + | | | + v v v + data-md data-md-linked data-md-dse data-md-dse-linked + | + v + compendium website (steelCompendium.io) + draw-steel-elements (web components) + third-party tools +``` + +## Upstream Dependencies + +| Dependency | Type | Description | +|------------|------|-------------| +| Draw Steel rulebook PDFs | Manual input | Source material, converted to markdown via `pdf_to_md/` | +| `input/heroes/Draw Steel Heroes.md` | File | Cleaned-up Heroes book markdown (manually curated) | +| `input/monsters/Draw Steel Monsters.md` | File | Cleaned-up Monsters book markdown | +| `input/heroes/*.yml` | Config files | Section extraction configs defining how to split the book | +| `input/classification.json` | State file | SCC type/source tree (regenerated on each run) | +| `steel-compendium-sdk` | npm package | Provides `sc-convert` CLI for markdown-to-JSON/YAML conversion | + +## Downstream Dependents + +| Repo | Format | Content | +|------|--------|---------| +| `data-rules-md` | Markdown | Heroes book sections (unlinked) | +| `data-rules-md-linked` | Markdown | Heroes book sections (with SCC links) | +| `data-rules-md-dse` | Markdown + YAML | Heroes sections with embedded YAML for DSE web components (unlinked) | +| `data-rules-md-dse-linked` | Markdown + YAML | Heroes sections with embedded YAML for DSE web components (linked) | +| `data-rules-json` | JSON | Heroes features and abilities as JSON | +| `data-rules-yaml` | YAML | Heroes features and abilities as YAML | +| `data-bestiary-md` | Markdown | Monsters book sections (unlinked) | +| `data-bestiary-md-dse` | Markdown + YAML | Monsters sections with embedded YAML for DSE | +| `data-bestiary-json` | JSON | Monster statblocks, featureblocks as JSON | +| `data-bestiary-yaml` | YAML | Monster statblocks, featureblocks as YAML | +| `data-adventures-md` | Markdown | Adventures content (stub) | +| `data-md` | Markdown | Unified: all books combined (unlinked) | +| `data-md-linked` | Markdown | Unified: all books combined (linked) | +| `data-md-dse` | Markdown + YAML | Unified: all books with DSE (unlinked) | +| `data-md-dse-linked` | Markdown + YAML | Unified: all books with DSE (linked) | + +## API Surface + +data-gen does not expose an API. It is a batch pipeline that writes files to sibling repositories. + +**Output formats:** +- Markdown files with YAML frontmatter (metadata: title, type, source, SCC, SCDC, etc.) +- JSON files (structured feature/statblock data) +- YAML files (structured feature/statblock data) +- Aggregate files (`features.json`, `statblocks.json`, etc.) containing all items in a single file + +## Data Contracts + +### Frontmatter schema (markdown output) + +Every generated markdown file includes YAML frontmatter with at minimum: + +```yaml +--- +title: "Ability Name" +type: abilities/fury +source: mcdm.heroes.v1 +item_id: gouge +scc: + - "mcdm.heroes.v1:abilities.fury:gouge" +scdc: + - "1.1.1:2.4:28" +item_index: "28" +--- +``` + +### SCC format + +`source:type:item` where each component is a dot-separated path of slug values. + +- Source: `mcdm.heroes.v1`, `mcdm.monsters.v1` +- Type: `abilities.fury`, `chapters`, `monster.statblock` +- Item: slug of the item name + +### JSON/YAML structure + +Generated by `sc-convert` from the `steel-compendium-sdk`. The exact schema is defined by that tool. + +## Cross-Repo Workflows + +### Full regeneration workflow + +1. Run `just gen` in `data-gen/etl/` to regenerate all output +2. Output is automatically copied to all `data-*` sibling repos +3. Commit and push each `data-*` repo +4. In the `compendium` project, run `just update` to pull new `data-*` commits into the website + +### Source material update workflow + +1. If updating from a new PDF: run `pdf_to_md/` converter +2. Clean up the resulting markdown manually +3. Update section config YAMLs if the book structure changed +4. Run `just gen` to regenerate diff --git a/.repo-docs/project.md b/.repo-docs/project.md new file mode 100644 index 0000000..bd9a586 --- /dev/null +++ b/.repo-docs/project.md @@ -0,0 +1,90 @@ +--- +repo: data-gen +updated: 2026-04-05 +--- + +# Project Context + +## Product Overview + +data-gen is the ETL backbone of the Steel Compendium project. It takes the Draw Steel TTRPG rulebooks (in cleaned-up markdown form) and breaks them into individual, structured files -- one per ability, class, ancestry, monster statblock, etc. These files are output in multiple formats (markdown, JSON, YAML) with rich frontmatter metadata, then distributed to purpose-specific `data-*` repos that downstream tools consume. + +The problem it solves: the source books are monolithic documents. The compendium website, custom web components (`draw-steel-elements`), and third-party tools need granular, structured, machine-readable data. This pipeline bridges the gap. + +## Domain Context + +This is a content processing pipeline for a tabletop RPG rulebook. The source material is the Draw Steel TTRPG by MCDM Productions. Understanding the domain requires familiarity with: + +- **TTRPG structure**: rulebooks contain hierarchical content -- chapters contain classes, classes contain abilities, abilities have levels and costs. +- **Draw Steel specifics**: the game has ancestries, careers, cultures, classes (Censor, Fury, Shadow, Tactician, etc.), kits, titles, conditions, skills, and treasures for heroes; and monster statblocks, dynamic terrain, and retainers for the bestiary. +- **Content hierarchy**: a single "ability" lives inside a class, at a specific level, with a cost tier. The pipeline must preserve and expose this hierarchy. + +## Key Concepts + +| Concept | Description | +|---------|-------------| +| **Feature** | A discrete game mechanic (ability, class feature, ancestry benefit, etc.) extracted as its own file with frontmatter | +| **Statblock** | A monster's stat block, extracted and converted to structured JSON/YAML | +| **Featureblock** | A block of related features (e.g., malice abilities, dynamic terrain effects) | +| **Section** | A logical portion of a book extracted via HTML XPath (e.g., "Ancestries", "Kits") | +| **Frontmatter** | YAML metadata prepended to markdown files (type, source, SCC codes, etc.) | +| **SCC** | Steel Compendium Classification -- hierarchical `source:type:item` identifier | +| **SCDC** | Steel Compendium Decimal Classification -- numeric equivalent of SCC | +| **Linked/Unlinked** | Whether cross-references in markdown use `scc:` protocol links or plain text | +| **DSE (Draw Steel Elements)** | Web components that render features/statblocks; the `md-dse` format embeds YAML for these components | + +## Glossary + +| Term | Meaning | +|------|---------| +| `data-*` repos | Sibling repositories that hold generated output (e.g., `data-rules-md`, `data-bestiary-json`) | +| `devbox` | Nix-based development environment manager | +| `just` | Command runner (like `make` but simpler), used as the task runner for all ETL | +| `marker` | PDF-to-markdown conversion tool used in `pdf_to_md/` | +| `mdformat` | Markdown formatter/linter | +| `pandoc` | Universal document converter -- used for markdown-to-HTML and HTML-to-markdown | +| `sc-convert` | CLI from `steel-compendium-sdk` npm package; converts feature/statblock markdown to JSON/YAML | +| `tidy` | HTML Tidy -- used to clean up pandoc HTML output | +| `yq` | YAML/JSON processor (Go version) -- used for frontmatter extraction | + +## Audiences + +| Audience | How they use this repo | +|----------|----------------------| +| **Steel Compendium maintainers** | Run the pipeline when source material updates; fix parsing issues | +| **Compendium website** | Consumes `data-md` and `data-md-linked` repos generated by this pipeline | +| **draw-steel-elements** | Consumes `data-*-md-dse` repos with embedded YAML for rendering web components | +| **Third-party tool authors** | Consume `data-rules-json`, `data-bestiary-yaml`, etc. for custom tooling | + +## Feature Inventory + +### Shipped + +- Heroes book full pipeline (markdown -> HTML -> sections -> formatted markdown -> JSON/YAML) +- Monsters book pipeline (statblocks, featureblocks, dynamic terrain) +- SCC/SCDC classification generation +- SCC-based cross-reference linking (linked and unlinked variants) +- DSE markdown generation (embedded YAML for web components) +- Aggregate data files (combined JSON/YAML of all features or statblocks) +- Index file generation for each section +- PDF-to-markdown conversion (via marker + OpenAI) +- Unified `data-md` repo assembly from multiple source-specific repos +- Markdown formatting/linting (mdformat) + +### In Progress + +- Monster book SCC linking (TODO in code) +- Adventures pipeline (stub only) + +### Planned + +- Cards (downloadable images/HTML for abilities/statblocks) +- Auto-linking improvements + +## Constraints and Risks + +- **License**: Published under the DRAW STEEL Creator License. Not affiliated with MCDM Productions. +- **Source material dependency**: The pipeline is tightly coupled to the structure of the Draw Steel rulebooks. Structural changes in source documents may require ETL updates. +- **No tests**: The pipeline has no automated test suite. Validation is manual. +- **Staging is ephemeral**: The `staging/` directory is wiped on each run. Intermediate outputs are not preserved. +- **sc-convert dependency**: The `steel-compendium-sdk` npm package provides the `sc-convert` CLI. Breaking changes there break this pipeline. diff --git a/.repo-docs/troubleshooting.md b/.repo-docs/troubleshooting.md new file mode 100644 index 0000000..2421939 --- /dev/null +++ b/.repo-docs/troubleshooting.md @@ -0,0 +1,89 @@ +--- +repo: data-gen +updated: 2026-04-05 +--- + +# Troubleshooting + +## Do NOT + +- **Do not edit files in `staging/`** -- they are wiped on every `just gen` run. +- **Do not run `just gen` outside of a devbox shell** -- required tools (pandoc, yq, mdformat, sc-convert, etc.) will be missing. +- **Do not delete `input/classification.json` manually** -- it is regenerated by the pipeline but its state accumulates across runs. If deleted, SCC decimal codes may be reassigned differently. +- **Do not use `H8` or `H9` headers for non-ability/non-feature content** in source markdown -- the pipeline treats these header levels specially and will attempt to extract them as features. +- **Do not add new content to `data-*` repos directly** -- they are overwritten by `data-gen` output on each run (except `README.md` and `.git/`). + +## Known Issues + +- **Adventures pipeline is a stub**: `adventures.just` creates a placeholder file. No actual adventure content processing exists yet. +- **Monster linking not wired up**: The monsters pipeline generates unlinked output only. The `link_md_files` recipe is commented out. +- **`classification.json` is deleted and regenerated on each heroes/monsters run**: This means running `gen_heroes` alone will lose monster classifications and vice versa. Run `just gen` for a full build. +- **tidy warnings**: `html-tidy` outputs warnings during HTML cleanup. These are suppressed with `|| true` but may mask real HTML issues. +- **Encoding issues**: Some content (especially in the Perks section) has encoding problems. The auto-linker force-writes all files as UTF-8 regardless of changes as a workaround. + +## Common Errors + +### `sc-convert: command not found` + +**Cause**: Not running inside a devbox shell, or `steel-compendium-sdk` npm package not installed. + +**Fix**: +```bash +cd etl && devbox shell +# devbox init_hook installs it automatically +``` + +### `figlet: command not found` + +**Cause**: Not in devbox shell. + +**Fix**: Enter devbox shell. + +### `yq: command not found` or wrong yq version + +**Cause**: System `yq` (Python version) conflicts with the Go version (`yq-go`) required by this project. + +**Fix**: Use devbox shell which provides the correct `yq-go`. + +### SCC duplicate error: `Adding scc entry for '...' but there is already a value` + +**Cause**: Two markdown files generated the same SCC classification string. Usually a data issue in the source markdown (duplicate headers or section configs). + +**Fix**: Check for duplicate entries in the relevant `input/*.yml` config or duplicate headers in the source markdown file. + +### pandoc conversion produces garbled output + +**Cause**: Source markdown contains non-standard encoding or HTML entities that pandoc doesn't handle well. + +**Fix**: Check the source markdown for special characters. The pipeline runs `iconv -f UTF-8 -t UTF-8//TRANSLIT` to normalize encoding, but some characters may still cause issues. + +## Debug Playbooks + +### Inspect a specific pipeline stage + +1. Run the full pipeline: `just gen` +2. Look at the numbered staging directories: + ```bash + ls staging/heroes/ + ``` +3. Compare input and output of the failing stage to identify where data is lost or corrupted. + +### Debug frontmatter issues + +1. Check the raw frontmatter on a generated file: + ```bash + yq e --front-matter=markdown '.' staging/heroes/3_md_sections/path/to/file.md + ``` +2. Check `input/classification.json` for the SCC tree state. +3. Check `staging/scc_to_path.json` for the SCC-to-path mapping. + +### Debug section extraction + +1. Check the expanded section config: + ```bash + cat staging/heroes/0_sections/.yml + ``` +2. Verify the XPath matches content in the HTML: + ```bash + cat staging/heroes/1_html/Draw\ Steel\ Heroes.html | grep -i " Date: Tue, 14 Apr 2026 16:31:01 -0400 Subject: [PATCH 2/2] Moving to match schema with data-sdk-npm: phase 2 --- .repo-docs/architecture.md | 5 --- .repo-docs/ci-cd.md | 60 ++++++++++++++++++++++++++++++++++ .repo-docs/conventions.md | 5 --- .repo-docs/decisions/README.md | 5 --- .repo-docs/development.md | 5 --- .repo-docs/index.md | 7 ++-- .repo-docs/integration.md | 5 --- .repo-docs/project.md | 5 --- .repo-docs/troubleshooting.md | 5 --- CLAUDE.md | 8 +++++ 10 files changed, 72 insertions(+), 38 deletions(-) create mode 100644 .repo-docs/ci-cd.md create mode 100644 CLAUDE.md diff --git a/.repo-docs/architecture.md b/.repo-docs/architecture.md index fb489af..3812633 100644 --- a/.repo-docs/architecture.md +++ b/.repo-docs/architecture.md @@ -1,8 +1,3 @@ ---- -repo: data-gen -updated: 2026-04-05 ---- - # Architecture ## System Overview diff --git a/.repo-docs/ci-cd.md b/.repo-docs/ci-cd.md new file mode 100644 index 0000000..875f80f --- /dev/null +++ b/.repo-docs/ci-cd.md @@ -0,0 +1,60 @@ +# CI/CD + +## Pipeline Overview + +data-gen has no CI/CD pipeline. The ETL pipeline is run manually on a developer's machine inside a devbox shell. There are no automated builds, tests, or deployments. + +## Build Process + +All builds are local. The canonical build command: + +```bash +cd etl && devbox shell +just gen +``` + +This runs the full pipeline: wipe staging, heroes pipeline, monsters pipeline, adventures (stub), and unification. Output is copied to sibling `data-*` repos. + +### Build Artifacts + +| Artifact | Location | Description | +|----------|----------|-------------| +| Staging intermediates | `staging/` (gitignored) | Ephemeral; wiped on each run | +| Final output | Sibling `data-*` repos | Markdown, JSON, YAML, DSE variants | +| Classification state | `input/classification.json` | SCC type/source tree; committed to this repo | + +## Branch Strategy + +| Branch | Purpose | +|--------|---------| +| `main` | Stable, current output | +| `develop` | Active development | +| `monsters` | Monster book pipeline work | +| `links` | SCC linking feature development | +| Various feature branches | Short-lived, merged to develop or main | + +No branch protection rules are configured. + +## Release Process + +There are no versioned releases or tags. The workflow is: + +1. Run `just gen` locally +2. Commit and push changes to each `data-*` output repo manually +3. In the `compendium` project, run `just update` to pull new data commits into the website + +### Rollback + +Roll back by reverting commits in the affected `data-*` repos and re-running `just update` in the compendium project. + +## Environments + +| Environment | Description | +|-------------|-------------| +| Local (devbox shell) | Only environment; all pipeline work happens here | + +## Secrets and Configuration + +| Secret | Used by | Description | +|--------|---------|-------------| +| `OPEN_AI_KEY` | `pdf_to_md/` only | OpenAI API key for LLM-assisted PDF conversion. Not needed for the main ETL pipeline. | diff --git a/.repo-docs/conventions.md b/.repo-docs/conventions.md index 8dcb814..be60e39 100644 --- a/.repo-docs/conventions.md +++ b/.repo-docs/conventions.md @@ -1,8 +1,3 @@ ---- -repo: data-gen -updated: 2026-04-05 ---- - # Conventions ## File and Directory Naming diff --git a/.repo-docs/decisions/README.md b/.repo-docs/decisions/README.md index 930bb74..e2ec8d5 100644 --- a/.repo-docs/decisions/README.md +++ b/.repo-docs/decisions/README.md @@ -1,8 +1,3 @@ ---- -repo: data-gen -updated: 2026-04-05 ---- - # Decision Log A running record of architectural and design decisions made in this project. diff --git a/.repo-docs/development.md b/.repo-docs/development.md index 34cd79c..8e9878b 100644 --- a/.repo-docs/development.md +++ b/.repo-docs/development.md @@ -1,8 +1,3 @@ ---- -repo: data-gen -updated: 2026-04-05 ---- - # Development ## Prerequisites diff --git a/.repo-docs/index.md b/.repo-docs/index.md index 58df9e8..123adc4 100644 --- a/.repo-docs/index.md +++ b/.repo-docs/index.md @@ -8,7 +8,7 @@ tech: - python (frontmatter, classification, linking) - pandoc (markdown/html conversion) - devbox (environment management) -updated: 2026-04-05 +updated: 2026-04-07 --- # data-gen @@ -81,7 +81,7 @@ data-gen/ | **New to this repo** | This file | [project.md](project.md) | | **Developer** | [development.md](development.md) | [architecture.md](architecture.md), [conventions.md](conventions.md) | | **Architect** | [architecture.md](architecture.md) | [integration.md](integration.md), [decisions/](decisions/) | -| **DevOps / SRE** | [development.md](development.md) | [integration.md](integration.md) | +| **DevOps / SRE** | [ci-cd.md](ci-cd.md) | [development.md](development.md), [integration.md](integration.md) | ### Agent Roles @@ -90,6 +90,7 @@ data-gen/ | **Code review** | [conventions.md](conventions.md) | [architecture.md](architecture.md) | | **Bug fix / debug** | [troubleshooting.md](troubleshooting.md) | [development.md](development.md), [architecture.md](architecture.md) | | **Feature implementation** | [architecture.md](architecture.md) | [conventions.md](conventions.md), [development.md](development.md), [decisions/](decisions/) | +| **CI/CD / DevOps** | [ci-cd.md](ci-cd.md) | [development.md](development.md), [integration.md](integration.md) | | **Documentation** | This file | [project.md](project.md), [architecture.md](architecture.md) | | **Onboarding / Q&A** | This file | [project.md](project.md), [development.md](development.md) | @@ -103,11 +104,11 @@ data-gen/ | File | Description | |------|-------------| -| [index.md](index.md) | This file -- overview, quick reference, structure | | [project.md](project.md) | Domain context, glossary, feature inventory | | [architecture.md](architecture.md) | Pipeline stages, components, data flow | | [development.md](development.md) | Setup, prerequisites, workflows | | [integration.md](integration.md) | Upstream/downstream repos, data contracts | +| [ci-cd.md](ci-cd.md) | Build process, release workflow, branch strategy | | [conventions.md](conventions.md) | Naming, commit style, code patterns | | [troubleshooting.md](troubleshooting.md) | Known issues, common errors | | [decisions/](decisions/) | Architectural decision records | diff --git a/.repo-docs/integration.md b/.repo-docs/integration.md index b317090..c598d76 100644 --- a/.repo-docs/integration.md +++ b/.repo-docs/integration.md @@ -1,8 +1,3 @@ ---- -repo: data-gen -updated: 2026-04-05 ---- - # Integration ## Dependency Map diff --git a/.repo-docs/project.md b/.repo-docs/project.md index bd9a586..69946e0 100644 --- a/.repo-docs/project.md +++ b/.repo-docs/project.md @@ -1,8 +1,3 @@ ---- -repo: data-gen -updated: 2026-04-05 ---- - # Project Context ## Product Overview diff --git a/.repo-docs/troubleshooting.md b/.repo-docs/troubleshooting.md index 2421939..515f059 100644 --- a/.repo-docs/troubleshooting.md +++ b/.repo-docs/troubleshooting.md @@ -1,8 +1,3 @@ ---- -repo: data-gen -updated: 2026-04-05 ---- - # Troubleshooting ## Do NOT diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..2ef88f0 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,8 @@ +# data-gen + +ETL pipeline that converts Draw Steel TTRPG source documents (markdown) into structured, multi-format output distributed across multiple `data-*` repos. + +## Repository Documentation + +This repo uses standardized `.repo-docs/` documentation. **Read `.repo-docs/index.md` +first** -- it contains the reading guide, role-based routing, and links to all other docs.