Skip to content

docs: add plan for docs-driven skill generation#35

Merged
Ethan-Arrowood merged 3 commits into
mainfrom
plan/docs-driven-skills
May 26, 2026
Merged

docs: add plan for docs-driven skill generation#35
Ethan-Arrowood merged 3 commits into
mainfrom
plan/docs-driven-skills

Conversation

@Ethan-Arrowood
Copy link
Copy Markdown
Member

Summary

Adds docs/plans/docs-driven-skills.md — the design for auto-generating Harper skill rules from the documentation repo so they stop drifting from the source of truth.

This PR is the plan document only. No implementation, no behavior change. Merging it commits the team to the approach; the work itself is broken into phases inside the doc.

Background

Today the 20 rule files under harper-best-practices/rules/ are maintained by hand — sometimes with agent assistance, but a human still has to notice a docs change, prompt the rewrite, and open a PR. Drift example: the rest: true config prereq existed in reference/rest/overview.md long before the skill was patched to mention it.

What the plan covers

  • Concepts. Defines rule vs. skill vs. AGENTS.md vs. manifest, so the file purposes are explicit. Two generation modes: generate (auto-produced from docs) and synthesized (hand-authored, for content with no canonical docs source).
  • Guiding principle. Humans own the rule taxonomy; automation owns keeping rule bodies in sync with their declared sources.
  • User stories. Five workflows covering automated regen on docs prose changes, adding a new rule manually, authoring synthesized rules, fixing the manifest when docs structure changes, and adding a whole new skill.
  • Phased migration.
    • Phase 0 — manifest + lightweight validator, every existing rule mapped as synthesized (no behavior change today)
    • Phase 1 — one rule end-to-end (vector-indexing)
    • Phase 2 — expand to obvious .md-only rules
    • Phase 3 — flat-markdown export in HarperFast/documentation (Docusaurus plugin that flattens MDX components to plain markdown alongside the HTML build)
    • Phase 4 — MDX-sourced rules + observability
    • Phase 5 — steady state
  • Developer documentation. Substantially expands the existing .github/CONTRIBUTING.MD to explain repo anatomy, the generation pipeline, common contributor tasks, and what's automated vs. manual.
  • Validation layer. Spec for validate-generated.mjs (manifest completeness, provenance comments, must-cover assertions, MDX leakage, cross-link integrity, AGENTS.md round-trip).
  • Alternative: pointer strategy. Documented as a known fallback only — explicitly not in scope of this plan. If the generation approach disappoints, the team can revisit.

Reviewer notes

  • The plan does not lock down implementation specifics that should be debated during Phase 0 (manifest YAML shape is illustrative, not final).
  • Phase 3 is the largest cross-repo commitment and the one most worth a careful read. It assumes the docs team (same person, in this case) is willing to ship a flat-markdown export. If that's contentious, surface it now.
  • Phase 3 can run in parallel with Phase 2 — Phase 2 candidates all source from .md files.
  • The plan file itself is a planning artifact; once Phase 5 lands it can be archived. .github/CONTRIBUTING.MD is the long-lived companion.

🤖 Generated with Claude Code

Captures the design for auto-generating Harper skill rules from the
documentation repo. Covers concepts (rule/skill/manifest/modes), user
stories for automatic and manual workflows, a phased migration starting
from today's hand-authored rules, the validation layer, and a
documented (but out-of-scope) pointer-strategy fallback.

Phase 3 commits to a flat-markdown export from HarperFast/documentation
as the source of truth for MDX content, rather than parsing MDX
statically from the skills side.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Ethan-Arrowood Ethan-Arrowood requested a review from a team as a code owner May 22, 2026 19:02
Copy link
Copy Markdown
Member

@cb1kenobi cb1kenobi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fascinating!

Copy link
Copy Markdown
Member

@kriszyp kriszyp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic, this is an excellent approach, thank you for putting this together!
I left some comments to consider as you build this, but I certainly think you can move forward with.

2. Skills `generate.yaml` runs: reads `rules.manifest.yaml`, fetches docs at that SHA, detects that the input hash for `querying-rest-apis` changed.
3. The generator calls Claude under the rule template, produces a new `rules/querying-rest-apis.md`, refreshes `AGENTS.md`, and updates the lock file.
4. Workflow opens a PR: `docs: regenerate rules from documentation@a1b2c3d`. The PR body lists which rules changed and links the upstream docs commit.
5. A maintainer reviews the diff — agent-facing prose still reads cleanly, the new edge case is mentioned. Merge.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the right approach for the beginning, but the goal would be to hopefully eliminate this, based on your guiding principle, right?

Anything that re-renders prose when docs change is automated

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean like AI reviewed (or just no review if we trust it enough?)

```
2. Runs `npm run generate` locally. The script produces `rules/streaming-uploads.md`, rebuilds `AGENTS.md`, updates the lock file.
3. Opens a PR titled `feat: add streaming-uploads rule`. PR includes the manifest change _and_ the generated body so reviewers can see what the agent will read.
4. After review and merge, semantic-release publishes a minor version (because `feat:`).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And likewise once we gain confidence in updates, hopefully additions would be fully automated (no human) as well per the guiding principle.

Comment thread docs/plans/docs-driven-skills.md Outdated

Work in `HarperFast/documentation`:

- Add a Docusaurus plugin (or remark/rehype pass in the existing build pipeline) that, for each MDX page, walks the AST and emits a flat-markdown rendering at `build/flat/<source-path>.md`. Component handling:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, the analysis here is that our current MDX files have too much noise from JSX components/tags, and that stripping it down to cleaner MD files? And that simply offering agent guidance for the renderer ("please ignore JSX components") is likely to be less efficient (for agents/LLMs) than reading docs with AST cleansing?
I don't know if this influences the technique, but I believe our source files are much closer to what we want agents to read than our generated HTML. A technique than can directly translate source to "flat" markdown without dealing with the HTML seems ideal.
I think this also solves the long-standing question of providing agent-optimized Markdown for public AI crawlers, hopefully in an efficient manner.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and I've started using MDX for reusing certain bits of info. See this example: https://github.com/HarperFast/documentation/blob/547bbfc679fd772f591c88d032be22ab5da67133/learn/getting-started/create-your-first-application.mdx?plain=1#L30

We don't do this in the reference materials yet, but we likely will.

While the <Tab> MDX element isn't much of an issue, the content ones in https://github.com/HarperFast/documentation/tree/main/src/components/learn do need to be like rendered for it to be properly included.

Comment thread docs/plans/docs-driven-skills.md Outdated

### Phase 4 — Awkward and MDX-sourced rules + observability

With flat-markdown available, take on the remaining rules — including those that source from `/learn` MDX content — and stand up the observability layer that catches automation failures.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are also going to consider regenerating existing skills content from the documentation source, right? I believe we should at least try that. Perhaps there are some existing skills (that would be considered "synthesized") that might be deemed high quality, but in general we want to actually replace our existing content in skills with the generated content (otherwise they are stuck in synthesized state hindering more automated regeneration).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - this is part of the plan somewhere. We start with everything synthesized, and then we migrate slowly once things start working.

Comment thread docs/plans/docs-driven-skills.md Outdated

These should be resolved before Phase 1 begins:

- Anthropic API key provisioning for the skills repo's Actions runner — who owns it.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have already gone down the path of acquiring an Anthropic API key for PR reviews, so hopefully that can be followed for this. I believe the API key generation is easy, just making sure we have the secret setup.
It seems like it is also worth considering the use of Gemini, and maybe Claude can build an option to use either. Again, we have tons of credits, so if economics start to play into this, that could be helpful (although I suspect this should be relatively inexpensive).

Comment thread docs/plans/docs-driven-skills.md Outdated

This section documents a secondary strategy we may pivot to in the future. **It is not part of the implementation scope of this plan** — we are not building for it, designing flags around it, or constraining the generation work to accommodate it. It exists in this document so the team has a known fallback if the generation approach disappoints.

If after Phase 2 or 3 the team decides generation isn't pulling its weight — auto-PRs are too noisy, prompt tuning never converges, or reviewer fatigue sets in — we pivot to **pointer mode**: embed the docs source directly into the skills repo (git submodule, subtree, or sparse checkout of `HarperFast/documentation`), and have each rule become a thin pointer file (frontmatter + "when to use" + a link into the embedded docs).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the hypothesis that generated skills/rules should be more succinct and conducive to LLMs remaining attentive to reading them, rather than LLMs starting to "skim" long embedded/linked documentation? Or is it partly the cleansing of JSX that benefits the skill? A "release-asset" (of cleansed flat markdown) as the source of skills could address that. Perhaps we might also want to consider more flexibility/hybrid-ness and offer "synthesized", "generated", and "flat" (or "direct") with the third option indicating that the source (flat) markdown file should be imported as-is without any LLM summarization.
I will say that I do believe the hypothesis that LLM summarization is likely to be better. But these might be good options to retain and compare.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this idea. Really this is just based on the brief conversation in Slack. @dawsontoth seems to prefer we generate rather than use direct. I don't have any reason to back one way or another. I specifically included the direct method as a backup incase we want to go that way. I'll have the plan incorporate it as a third option instead so we can have the best of both worlds immediately.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The particular word I used was "transform", not generate.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If its not direct, then what would be the difference between generate and transform?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are three modes and the (expanded) alternate names to consider?

  • synthesized/manual/human-crafted - Source of truth is in /skills
  • generated/transformed - Source of truth in docs (llms.txt plugin output), but summarized by AI into skills.
  • direct/flat/linked - Source of truth in docs (llms.txt plugin output), directly copied into skills
    I know this is merged, and I am flexible on naming.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is just tooling so we can change it easily. I like what exists, but open to changing if others prefer some of the alternatives

Significant revisions to the docs-driven skills plan based on review
discussion:

- Added formal Manifest Schema and Rule Frontmatter Schema sections
  with field reference tables. Manifest is the declarative source of
  truth; rule frontmatter (via `metadata` block) snapshots what was
  last generated.
- Added Generation Lifecycle section detailing the step-by-step regen
  flow: manifest lint, source resolution, hash skip-check, body
  production, validation, PR open.
- Introduced `direct` mode as a third generation mode alongside
  `generate` and `synthesized`. Verbatim flat-markdown import with no
  LLM call, for cases where docs prose is already agent-friendly.
- Restructured Validation Layer into four named layers (skill schema,
  manifest lint, manifest↔frontmatter reconciliation, per-mode body
  checks) with explicit applicability matrix.
- Removed the Alternative: Pointer Strategy section. `direct` mode
  subsumes its purpose without operational complexity.
- Phase 3 now commits to adopting `@signalwire/docusaurus-plugin-llms-txt`
  rather than building a custom MDX→MD pipeline. Multi-instance docs
  plugin support and rendered-HTML processing make it the right fit.
- Reordered phases so plugin adoption (Phase 1) comes before single-rule
  end-to-end (Phase 2), since all source resolution depends on the
  plugin's output.
- Offline-first data flow: skills workflow checks out and builds the
  docs repo locally, reads from build/ output. No network calls to
  fetch docs content. Local contributors point at a sibling docs
  checkout via --docs-path or DOCS_PATH.
- Developer documentation now targets the existing .github/CONTRIBUTING.MD
  rather than a new doc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Ethan-Arrowood
Copy link
Copy Markdown
Member Author

Update for previously-approving reviewers — second commit (4712e52) summary

This commit substantially revises the plan based on follow-up design discussion. The high-level goal and guiding principle are unchanged; the implementation strategy and structure are meaningfully different. Worth re-reviewing.

TL;DR: The plan grew formal schema sections, added a third generation mode (direct), switched Phase 3 from a custom MDX→MD pipeline to adopting an existing community plugin, reordered phases so plugin adoption comes first, and committed to an offline-first data flow (no network fetches at sync time).

What's new

  • Manifest Schema section — formal field reference table for rules.manifest.yaml with a worked example. Defines every field (rule, description, category, priority, order, mode, sources[], must_cover, cross_links), which are required, and their types/semantics.
  • Rule Frontmatter Schema section — formal field reference for the metadata block in each rules/*.md frontmatter. Makes the manifest↔frontmatter relationship explicit: manifest is the declaration; frontmatter is the last-generated snapshot.
  • Generation Lifecycle section — step-by-step description of what a regen run does: manifest lint → per-rule resolve/hash/skip-check/produce-body → AGENTS.md refresh → validate → diff check → PR open.
  • direct mode — third generation mode alongside generate and synthesized. Verbatim flat-markdown import with no LLM call, for cases where docs prose is already concise and agent-friendly. Per-rule mode is reversible at any time.
  • Story 6 — new user story showing direct mode picked for the automatic-apis rule.

What changed in approach

  • Phase 3 strategy — was "build a custom MDX→MD pipeline with per-component handlers"; now "adopt @signalwire/docusaurus-plugin-llms-txt" (MIT-licensed, ~19k weekly downloads, used by Cedar / MLflow / others; explicitly handles multi-instance docs via Docusaurus's postBuild route data). Backed by extensive evaluation summarized in facebook/docusaurus#10899. The custom-pipeline approach is preserved as a fallback in the same section.
  • Phase order — plugin adoption moved from Phase 3 to Phase 1. Since all source resolution now depends on the plugin's output, single-rule end-to-end (now Phase 2) can't run until the plugin is installed. Phase 0 (skills-side plumbing) and Phase 1 (docs-side plugin) can run in parallel.
  • Data flow is now offline-first — skills workflow checks out and builds the docs repo locally, reading from build/ directory. No network calls to fetch docs content. Local contributors point at a sibling docs checkout via --docs-path flag or DOCS_PATH env var. CI cost is a full Docusaurus build per sync; optimizations (caching) deferred.
  • Provenance moved from HTML comments to YAML frontmatter — replaces <!-- generated-from: ... --> HTML comments with a metadata block in the rule's frontmatter (mode, sources, sourceCommit, inputHash). Reasoning: frontmatter is structured, already parsed by gray-matter, and matches the existing SKILL.md metadata pattern.
  • Lock file removedrules.manifest.lock.json is gone. Per-rule input hash now lives in the rule's own frontmatter, removing a sync-concern between two files.
  • Validation Layer restructured — was a flat list of checks; now four named layers (skill schema → manifest lint → manifest↔frontmatter reconciliation → per-mode body checks) with an explicit per-mode applicability matrix. Layer 3 is the gate that makes the manifest causally authoritative.
  • Open Questions — each question now annotated with the earliest phase it blocks (Phase 1, Phase 2, or Phase 4), replacing the previous flat "before Phase 1 begins" framing.

What was removed

  • Alternative: Pointer Strategy section — entirely removed. direct mode subsumes its purpose (offline-safe, deterministic, no LLM call, no variance) without the operational complexity of submodules / sparse checkouts.

What's unchanged

  • Goal, Guiding Principle ("Humans own the rule taxonomy; automation owns keeping rule bodies in sync").
  • Rule / Skill / AGENTS.md / Manifest concepts and the rule-vs-skill distinction.
  • Stories 1–5 (still illustrative; lightly updated to reflect new data-flow and metadata locations).
  • Developer Documentation deliverable (expanding .github/CONTRIBUTING.MD).
  • Migration philosophy (Phase 0 lands as no-behavior-change; progressive flip from synthesized to source-backed modes per rule).

Pure formatting pass — oxfmt normalized markdown table alignment in
the docs-driven-skills plan. No content changes. Surfaced when the
Phase 0 work ran the full validate pipeline (which includes the
oxfmt --check step). Landing this on the plan branch directly so
stacked phase branches inherit clean baseline state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Ethan-Arrowood Ethan-Arrowood merged commit 32fd798 into main May 26, 2026
2 checks passed
@Ethan-Arrowood Ethan-Arrowood deleted the plan/docs-driven-skills branch May 26, 2026 17:55
@github-actions
Copy link
Copy Markdown

🎉 This PR is included in version 1.4.8 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Ethan-Arrowood added a commit that referenced this pull request May 26, 2026
The validate pipeline had a structural flaw that silently masked
formatting drift in committed source files:

  Before:
    validate = build && oxfmt --check && validators
    build    = node scripts/build.mjs (which ran `npm run format` at the end)

  Effect: every CI run invoked the formatter against source files
  before checking, so the --check stage always passed against the
  just-reformatted tree. Committed files with format drift would
  silently get reformatted in the CI working copy and the check
  would report green. The drift accumulated until someone noticed
  in a `git diff` locally.

This was discovered when phase-0 (#36) rebased onto main after the
plan PR (#35) merged: `programmatic-table-requests.md` from the
intervening PR #34 had oxfmt-incompliant markdown tables that were
never caught, and `npm run build` kept reformatting it locally on
every run.

Structural fix:

- scripts/build.mjs: remove the trailing `npm run format` call. dist/
  is in .gitignore, so the formatter was actually skipping it (oxfmt
  honors gitignore); the only thing the call did in practice was the
  unintended source-file side effect described above. dist/ output is
  machine-generated and consumed by npm consumers; it doesn't need to
  match the formatter.

- package.json: add `format:check` script ("oxfmt --check"), and
  reorder `validate` to run `format:check` *first*, before build.
  The gate now sees the committed file content, not a freshly
  reformatted version of it.

  Before: validate = build && oxfmt --check && validate-skills && validate-generated
  After:  validate = format:check && build && validate-skills && validate-generated

- .github/CONTRIBUTING.MD: document the npm scripts and the
  format-check-first ordering so contributors know to run
  `npm run format` before committing.

Also includes the one file that had drifted under the old behavior:
harper-best-practices/rules/programmatic-table-requests.md — pure
formatter change (markdown table column alignment), no content change.

Verified end-to-end:
- `npm run validate` exits 0 on the current tree.
- Introducing a deliberate format violation (trailing whitespace on a
  heading) causes `npm run validate` to exit 1 with a clear error,
  blocking the rest of the pipeline.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants