docs: add plan for docs-driven skill generation by Ethan-Arrowood · Pull Request #35 · HarperFast/skills

Ethan-Arrowood · 2026-05-22T19:02:50Z

Summary

Adds docs/plans/docs-driven-skills.md — the design for auto-generating Harper skill rules from the documentation repo so they stop drifting from the source of truth.

This PR is the plan document only. No implementation, no behavior change. Merging it commits the team to the approach; the work itself is broken into phases inside the doc.

Background

Today the 20 rule files under harper-best-practices/rules/ are maintained by hand — sometimes with agent assistance, but a human still has to notice a docs change, prompt the rewrite, and open a PR. Drift example: the rest: true config prereq existed in reference/rest/overview.md long before the skill was patched to mention it.

What the plan covers

Concepts. Defines rule vs. skill vs. AGENTS.md vs. manifest, so the file purposes are explicit. Two generation modes: generate (auto-produced from docs) and synthesized (hand-authored, for content with no canonical docs source).
Guiding principle. Humans own the rule taxonomy; automation owns keeping rule bodies in sync with their declared sources.
User stories. Five workflows covering automated regen on docs prose changes, adding a new rule manually, authoring synthesized rules, fixing the manifest when docs structure changes, and adding a whole new skill.
Phased migration.
- Phase 0 — manifest + lightweight validator, every existing rule mapped as synthesized (no behavior change today)
- Phase 1 — one rule end-to-end (vector-indexing)
- Phase 2 — expand to obvious .md-only rules
- Phase 3 — flat-markdown export in HarperFast/documentation (Docusaurus plugin that flattens MDX components to plain markdown alongside the HTML build)
- Phase 4 — MDX-sourced rules + observability
- Phase 5 — steady state
Developer documentation. Substantially expands the existing .github/CONTRIBUTING.MD to explain repo anatomy, the generation pipeline, common contributor tasks, and what's automated vs. manual.
Validation layer. Spec for validate-generated.mjs (manifest completeness, provenance comments, must-cover assertions, MDX leakage, cross-link integrity, AGENTS.md round-trip).
Alternative: pointer strategy. Documented as a known fallback only — explicitly not in scope of this plan. If the generation approach disappoints, the team can revisit.

Reviewer notes

The plan does not lock down implementation specifics that should be debated during Phase 0 (manifest YAML shape is illustrative, not final).
Phase 3 is the largest cross-repo commitment and the one most worth a careful read. It assumes the docs team (same person, in this case) is willing to ship a flat-markdown export. If that's contentious, surface it now.
Phase 3 can run in parallel with Phase 2 — Phase 2 candidates all source from .md files.
The plan file itself is a planning artifact; once Phase 5 lands it can be archived. .github/CONTRIBUTING.MD is the long-lived companion.

🤖 Generated with Claude Code

Captures the design for auto-generating Harper skill rules from the documentation repo. Covers concepts (rule/skill/manifest/modes), user stories for automatic and manual workflows, a phased migration starting from today's hand-authored rules, the validation layer, and a documented (but out-of-scope) pointer-strategy fallback. Phase 3 commits to a flat-markdown export from HarperFast/documentation as the source of truth for MDX content, rather than parsing MDX statically from the skills side. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cb1kenobi

Fascinating!

kriszyp

This is fantastic, this is an excellent approach, thank you for putting this together!
I left some comments to consider as you build this, but I certainly think you can move forward with.

kriszyp · 2026-05-23T13:08:03Z

+2. Skills `generate.yaml` runs: reads `rules.manifest.yaml`, fetches docs at that SHA, detects that the input hash for `querying-rest-apis` changed.
+3. The generator calls Claude under the rule template, produces a new `rules/querying-rest-apis.md`, refreshes `AGENTS.md`, and updates the lock file.
+4. Workflow opens a PR: `docs: regenerate rules from documentation@a1b2c3d`. The PR body lists which rules changed and links the upstream docs commit.
+5. A maintainer reviews the diff — agent-facing prose still reads cleanly, the new edge case is mentioned. Merge.


This is the right approach for the beginning, but the goal would be to hopefully eliminate this, based on your guiding principle, right?

Anything that re-renders prose when docs change is automated

You mean like AI reviewed (or just no review if we trust it enough?)

kriszyp · 2026-05-23T13:09:41Z

+   ```
+2. Runs `npm run generate` locally. The script produces `rules/streaming-uploads.md`, rebuilds `AGENTS.md`, updates the lock file.
+3. Opens a PR titled `feat: add streaming-uploads rule`. PR includes the manifest change _and_ the generated body so reviewers can see what the agent will read.
+4. After review and merge, semantic-release publishes a minor version (because `feat:`).


And likewise once we gain confidence in updates, hopefully additions would be fully automated (no human) as well per the guiding principle.

kriszyp · 2026-05-23T13:15:32Z

+
+Work in `HarperFast/documentation`:
+
+- Add a Docusaurus plugin (or remark/rehype pass in the existing build pipeline) that, for each MDX page, walks the AST and emits a flat-markdown rendering at `build/flat/<source-path>.md`. Component handling:


Just to clarify, the analysis here is that our current MDX files have too much noise from JSX components/tags, and that stripping it down to cleaner MD files? And that simply offering agent guidance for the renderer ("please ignore JSX components") is likely to be less efficient (for agents/LLMs) than reading docs with AST cleansing?
I don't know if this influences the technique, but I believe our source files are much closer to what we want agents to read than our generated HTML. A technique than can directly translate source to "flat" markdown without dealing with the HTML seems ideal.
I think this also solves the long-standing question of providing agent-optimized Markdown for public AI crawlers, hopefully in an efficient manner.

Yes and I've started using MDX for reusing certain bits of info. See this example: https://github.com/HarperFast/documentation/blob/547bbfc679fd772f591c88d032be22ab5da67133/learn/getting-started/create-your-first-application.mdx?plain=1#L30

We don't do this in the reference materials yet, but we likely will.

While the <Tab> MDX element isn't much of an issue, the content ones in https://github.com/HarperFast/documentation/tree/main/src/components/learn do need to be like rendered for it to be properly included.

kriszyp · 2026-05-23T13:18:53Z

+
+### Phase 4 — Awkward and MDX-sourced rules + observability
+
+With flat-markdown available, take on the remaining rules — including those that source from `/learn` MDX content — and stand up the observability layer that catches automation failures.


We are also going to consider regenerating existing skills content from the documentation source, right? I believe we should at least try that. Perhaps there are some existing skills (that would be considered "synthesized") that might be deemed high quality, but in general we want to actually replace our existing content in skills with the generated content (otherwise they are stuck in synthesized state hindering more automated regeneration).

Yes - this is part of the plan somewhere. We start with everything synthesized, and then we migrate slowly once things start working.

kriszyp · 2026-05-23T13:21:36Z

+
+These should be resolved before Phase 1 begins:
+
+- Anthropic API key provisioning for the skills repo's Actions runner — who owns it.


We have already gone down the path of acquiring an Anthropic API key for PR reviews, so hopefully that can be followed for this. I believe the API key generation is easy, just making sure we have the secret setup.
It seems like it is also worth considering the use of Gemini, and maybe Claude can build an option to use either. Again, we have tons of credits, so if economics start to play into this, that could be helpful (although I suspect this should be relatively inexpensive).

kriszyp · 2026-05-23T13:27:22Z

+
+This section documents a secondary strategy we may pivot to in the future. **It is not part of the implementation scope of this plan** — we are not building for it, designing flags around it, or constraining the generation work to accommodate it. It exists in this document so the team has a known fallback if the generation approach disappoints.
+
+If after Phase 2 or 3 the team decides generation isn't pulling its weight — auto-PRs are too noisy, prompt tuning never converges, or reviewer fatigue sets in — we pivot to **pointer mode**: embed the docs source directly into the skills repo (git submodule, subtree, or sparse checkout of `HarperFast/documentation`), and have each rule become a thin pointer file (frontmatter + "when to use" + a link into the embedded docs).


Is the hypothesis that generated skills/rules should be more succinct and conducive to LLMs remaining attentive to reading them, rather than LLMs starting to "skim" long embedded/linked documentation? Or is it partly the cleansing of JSX that benefits the skill? A "release-asset" (of cleansed flat markdown) as the source of skills could address that. Perhaps we might also want to consider more flexibility/hybrid-ness and offer "synthesized", "generated", and "flat" (or "direct") with the third option indicating that the source (flat) markdown file should be imported as-is without any LLM summarization.
I will say that I do believe the hypothesis that LLM summarization is likely to be better. But these might be good options to retain and compare.

I love this idea. Really this is just based on the brief conversation in Slack. @dawsontoth seems to prefer we generate rather than use direct. I don't have any reason to back one way or another. I specifically included the direct method as a backup incase we want to go that way. I'll have the plan incorporate it as a third option instead so we can have the best of both worlds immediately.

The particular word I used was "transform", not generate.

If its not direct, then what would be the difference between generate and transform?

These are three modes and the (expanded) alternate names to consider?

synthesized/manual/human-crafted - Source of truth is in /skills

generated/transformed - Source of truth in docs (llms.txt plugin output), but summarized by AI into skills.

direct/flat/linked - Source of truth in docs (llms.txt plugin output), directly copied into skills
I know this is merged, and I am flexible on naming.

yeah this is just tooling so we can change it easily. I like what exists, but open to changing if others prefer some of the alternatives

Significant revisions to the docs-driven skills plan based on review discussion: - Added formal Manifest Schema and Rule Frontmatter Schema sections with field reference tables. Manifest is the declarative source of truth; rule frontmatter (via `metadata` block) snapshots what was last generated. - Added Generation Lifecycle section detailing the step-by-step regen flow: manifest lint, source resolution, hash skip-check, body production, validation, PR open. - Introduced `direct` mode as a third generation mode alongside `generate` and `synthesized`. Verbatim flat-markdown import with no LLM call, for cases where docs prose is already agent-friendly. - Restructured Validation Layer into four named layers (skill schema, manifest lint, manifest↔frontmatter reconciliation, per-mode body checks) with explicit applicability matrix. - Removed the Alternative: Pointer Strategy section. `direct` mode subsumes its purpose without operational complexity. - Phase 3 now commits to adopting `@signalwire/docusaurus-plugin-llms-txt` rather than building a custom MDX→MD pipeline. Multi-instance docs plugin support and rendered-HTML processing make it the right fit. - Reordered phases so plugin adoption (Phase 1) comes before single-rule end-to-end (Phase 2), since all source resolution depends on the plugin's output. - Offline-first data flow: skills workflow checks out and builds the docs repo locally, reads from build/ output. No network calls to fetch docs content. Local contributors point at a sibling docs checkout via --docs-path or DOCS_PATH. - Developer documentation now targets the existing .github/CONTRIBUTING.MD rather than a new doc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ethan-Arrowood · 2026-05-26T16:22:59Z

Update for previously-approving reviewers — second commit (4712e52) summary

This commit substantially revises the plan based on follow-up design discussion. The high-level goal and guiding principle are unchanged; the implementation strategy and structure are meaningfully different. Worth re-reviewing.

TL;DR: The plan grew formal schema sections, added a third generation mode (direct), switched Phase 3 from a custom MDX→MD pipeline to adopting an existing community plugin, reordered phases so plugin adoption comes first, and committed to an offline-first data flow (no network fetches at sync time).

What's new

Manifest Schema section — formal field reference table for rules.manifest.yaml with a worked example. Defines every field (rule, description, category, priority, order, mode, sources[], must_cover, cross_links), which are required, and their types/semantics.
Rule Frontmatter Schema section — formal field reference for the metadata block in each rules/*.md frontmatter. Makes the manifest↔frontmatter relationship explicit: manifest is the declaration; frontmatter is the last-generated snapshot.
Generation Lifecycle section — step-by-step description of what a regen run does: manifest lint → per-rule resolve/hash/skip-check/produce-body → AGENTS.md refresh → validate → diff check → PR open.
direct mode — third generation mode alongside generate and synthesized. Verbatim flat-markdown import with no LLM call, for cases where docs prose is already concise and agent-friendly. Per-rule mode is reversible at any time.
Story 6 — new user story showing direct mode picked for the automatic-apis rule.

What changed in approach

Phase 3 strategy — was "build a custom MDX→MD pipeline with per-component handlers"; now "adopt @signalwire/docusaurus-plugin-llms-txt" (MIT-licensed, ~19k weekly downloads, used by Cedar / MLflow / others; explicitly handles multi-instance docs via Docusaurus's postBuild route data). Backed by extensive evaluation summarized in facebook/docusaurus#10899. The custom-pipeline approach is preserved as a fallback in the same section.
Phase order — plugin adoption moved from Phase 3 to Phase 1. Since all source resolution now depends on the plugin's output, single-rule end-to-end (now Phase 2) can't run until the plugin is installed. Phase 0 (skills-side plumbing) and Phase 1 (docs-side plugin) can run in parallel.
Data flow is now offline-first — skills workflow checks out and builds the docs repo locally, reading from build/ directory. No network calls to fetch docs content. Local contributors point at a sibling docs checkout via --docs-path flag or DOCS_PATH env var. CI cost is a full Docusaurus build per sync; optimizations (caching) deferred.
Provenance moved from HTML comments to YAML frontmatter — replaces  HTML comments with a metadata block in the rule's frontmatter (mode, sources, sourceCommit, inputHash). Reasoning: frontmatter is structured, already parsed by gray-matter, and matches the existing SKILL.md metadata pattern.
Lock file removed — rules.manifest.lock.json is gone. Per-rule input hash now lives in the rule's own frontmatter, removing a sync-concern between two files.
Validation Layer restructured — was a flat list of checks; now four named layers (skill schema → manifest lint → manifest↔frontmatter reconciliation → per-mode body checks) with an explicit per-mode applicability matrix. Layer 3 is the gate that makes the manifest causally authoritative.
Open Questions — each question now annotated with the earliest phase it blocks (Phase 1, Phase 2, or Phase 4), replacing the previous flat "before Phase 1 begins" framing.

What was removed

Alternative: Pointer Strategy section — entirely removed. direct mode subsumes its purpose (offline-safe, deterministic, no LLM call, no variance) without the operational complexity of submodules / sparse checkouts.

What's unchanged

Goal, Guiding Principle ("Humans own the rule taxonomy; automation owns keeping rule bodies in sync").
Rule / Skill / AGENTS.md / Manifest concepts and the rule-vs-skill distinction.
Stories 1–5 (still illustrative; lightly updated to reflect new data-flow and metadata locations).
Developer Documentation deliverable (expanding .github/CONTRIBUTING.MD).
Migration philosophy (Phase 0 lands as no-behavior-change; progressive flip from synthesized to source-backed modes per rule).

Pure formatting pass — oxfmt normalized markdown table alignment in the docs-driven-skills plan. No content changes. Surfaced when the Phase 0 work ran the full validate pipeline (which includes the oxfmt --check step). Landing this on the plan branch directly so stacked phase branches inherit clean baseline state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-26T17:56:10Z

🎉 This PR is included in version 1.4.8 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

The validate pipeline had a structural flaw that silently masked formatting drift in committed source files: Before: validate = build && oxfmt --check && validators build = node scripts/build.mjs (which ran `npm run format` at the end) Effect: every CI run invoked the formatter against source files before checking, so the --check stage always passed against the just-reformatted tree. Committed files with format drift would silently get reformatted in the CI working copy and the check would report green. The drift accumulated until someone noticed in a `git diff` locally. This was discovered when phase-0 (#36) rebased onto main after the plan PR (#35) merged: `programmatic-table-requests.md` from the intervening PR #34 had oxfmt-incompliant markdown tables that were never caught, and `npm run build` kept reformatting it locally on every run. Structural fix: - scripts/build.mjs: remove the trailing `npm run format` call. dist/ is in .gitignore, so the formatter was actually skipping it (oxfmt honors gitignore); the only thing the call did in practice was the unintended source-file side effect described above. dist/ output is machine-generated and consumed by npm consumers; it doesn't need to match the formatter. - package.json: add `format:check` script ("oxfmt --check"), and reorder `validate` to run `format:check` *first*, before build. The gate now sees the committed file content, not a freshly reformatted version of it. Before: validate = build && oxfmt --check && validate-skills && validate-generated After: validate = format:check && build && validate-skills && validate-generated - .github/CONTRIBUTING.MD: document the npm scripts and the format-check-first ordering so contributors know to run `npm run format` before committing. Also includes the one file that had drifted under the old behavior: harper-best-practices/rules/programmatic-table-requests.md — pure formatter change (markdown table column alignment), no content change. Verified end-to-end: - `npm run validate` exits 0 on the current tree. - Introducing a deliberate format violation (trailing whitespace on a heading) causes `npm run validate` to exit 1 with a clear error, blocking the rest of the pipeline. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ethan-Arrowood requested a review from a team as a code owner May 22, 2026 19:02

Ethan-Arrowood requested review from dawsontoth, heskew and kriszyp May 22, 2026 19:03

dawsontoth approved these changes May 22, 2026

View reviewed changes

cb1kenobi approved these changes May 22, 2026

View reviewed changes

kriszyp approved these changes May 23, 2026

View reviewed changes

Ethan-Arrowood mentioned this pull request May 26, 2026

feat: Phase 0 — rules manifest and validation plumbing #36

Merged

Ethan-Arrowood merged commit 32fd798 into main May 26, 2026
2 checks passed

Ethan-Arrowood deleted the plan/docs-driven-skills branch May 26, 2026 17:55

github-actions Bot added the released label May 26, 2026

Ethan-Arrowood mentioned this pull request May 26, 2026

chore: enforce formatting in CI and fix drifted file #37

Merged


		Work in `HarperFast/documentation`:

		- Add a Docusaurus plugin (or remark/rehype pass in the existing build pipeline) that, for each MDX page, walks the AST and emits a flat-markdown rendering at `build/flat/<source-path>.md`. Component handling:


		### Phase 4 — Awkward and MDX-sourced rules + observability

		With flat-markdown available, take on the remaining rules — including those that source from `/learn` MDX content — and stand up the observability layer that catches automation failures.


		These should be resolved before Phase 1 begins:

		- Anthropic API key provisioning for the skills repo's Actions runner — who owns it.


		This section documents a secondary strategy we may pivot to in the future. It is not part of the implementation scope of this plan — we are not building for it, designing flags around it, or constraining the generation work to accommodate it. It exists in this document so the team has a known fallback if the generation approach disappoints.

		If after Phase 2 or 3 the team decides generation isn't pulling its weight — auto-PRs are too noisy, prompt tuning never converges, or reviewer fatigue sets in — we pivot to pointer mode: embed the docs source directly into the skills repo (git submodule, subtree, or sparse checkout of `HarperFast/documentation`), and have each rule become a thin pointer file (frontmatter + "when to use" + a link into the embedded docs).

Conversation

Ethan-Arrowood commented May 22, 2026

Summary

Background

What the plan covers

Reviewer notes

Uh oh!

cb1kenobi left a comment

Choose a reason for hiding this comment

Uh oh!

kriszyp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ethan-Arrowood commented May 26, 2026

What's new

What changed in approach

What was removed

What's unchanged

Uh oh!

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants