diff --git a/plugins/compound-engineering/AGENTS.md b/plugins/compound-engineering/AGENTS.md index 4e813b111..f9e348157 100644 --- a/plugins/compound-engineering/AGENTS.md +++ b/plugins/compound-engineering/AGENTS.md @@ -12,7 +12,7 @@ They supplement the repo-root `AGENTS.md`. Consequences: - Behavioral rules that govern skill *runtime* behavior must live inside the skill itself — in `SKILL.md` or files under its `references/`. Guidance placed in this file is invisible at runtime. -- When two or more skills share a behavioral principle, duplicate the guidance into each skill (inline for short rules, `references/` for longer ones). There is no cross-skill shared-file mechanism (see "File References in Skills" below). +- When two or more skills share a behavioral principle, duplicate the guidance into each skill (inline for short rules, `references/` for longer ones). There is no cross-skill shared-file mechanism (see "File References in Skills" below). When a reference file is duplicated across skills (e.g., `concepts-vocabulary.md` in both `ce-compound/references/` and `ce-compound-refresh/references/`), edits must be applied to every copy in the same commit. Drift between copies produces inconsistent agent behavior depending on which skill loaded. - Do not propose that runtime guidance for ce-ideate, ce-brainstorm, ce-plan, or any other skill live in this AGENTS.md or in the repo-root AGENTS.md. Those files only shape how contributors edit the plugin. This is easy to miss because authoring feels like using: you edit the plugin while running inside this repo, and the repo's AGENTS.md is loaded — but that load does not follow the installed skill into a user's environment. diff --git a/plugins/compound-engineering/agents/ce-learnings-researcher.agent.md b/plugins/compound-engineering/agents/ce-learnings-researcher.agent.md index 845f92ef4..c1ff011a4 100644 --- a/plugins/compound-engineering/agents/ce-learnings-researcher.agent.md +++ b/plugins/compound-engineering/agents/ce-learnings-researcher.agent.md @@ -18,6 +18,12 @@ Past learnings span multiple shapes: Treat all of these as candidates. Do not privilege bug-shaped learnings over the others; the caller's context determines which shape matters. +## Step 0: Ground in CONCEPTS.md (if present) + +Before searching `docs/solutions/`, check whether `CONCEPTS.md` exists at the repo root. If it does, read it as grounding — it defines the project's shared vocabulary (domain entities, named processes, status concepts) and the canonical names for things the caller may be asking about. Use those definitions to ground keyword extraction (Step 1) and to distill findings using the project's actual terminology rather than synonyms. + +If `CONCEPTS.md` does not exist, skip this step entirely and proceed to Step 1. + ## Search Strategy (Grep-First Filtering) The `docs/solutions/` directory contains documented learnings with YAML frontmatter. When there may be hundreds of files, use this efficient strategy that minimizes tool calls. diff --git a/plugins/compound-engineering/skills/ce-brainstorm/SKILL.md b/plugins/compound-engineering/skills/ce-brainstorm/SKILL.md index e45b5e0d6..4e2589d82 100644 --- a/plugins/compound-engineering/skills/ce-brainstorm/SKILL.md +++ b/plugins/compound-engineering/skills/ce-brainstorm/SKILL.md @@ -111,7 +111,7 @@ Scan the repo before substantive brainstorming. Match depth to scope: **Standard and Deep** — Two passes: -*Constraint Check* — Check project instruction files (`AGENTS.md`, and `CLAUDE.md` only if retained as compatibility context) for workflow, product, or scope constraints that affect the brainstorm. Also read `STRATEGY.md` if it exists — the product's target problem, approach, persona, and active tracks are direct input to what this brainstorm should deliver and should shape scope, success criteria, and which approaches are aligned vs out-of-scope. If these add nothing, move on. +*Constraint Check* — Check project instruction files (`AGENTS.md`, and `CLAUDE.md` only if retained as compatibility context) for workflow, product, or scope constraints that affect the brainstorm. Also read `STRATEGY.md` if it exists — the product's target problem, approach, persona, and active tracks are direct input to what this brainstorm should deliver and should shape scope, success criteria, and which approaches are aligned vs out-of-scope. Also read `CONCEPTS.md` at repo root if it exists — the project's authoritative vocabulary. Use these names in dialogue, approaches, and the requirements doc; map user-offered synonyms back. If any of these add nothing, move on. *Topic Scan* — Search for relevant terms. Read the most relevant existing artifact if one exists (brainstorm, plan, spec, skill, feature doc). Skim adjacent examples covering similar behavior. @@ -184,6 +184,18 @@ Follow the Interaction Rules above. Use the platform's blocking question tool wh **Exit condition:** Continue until the idea is clear AND no integration-check questions are pending, OR the user explicitly wants to proceed. +#### 1.4 Vocabulary Capture (only if CONCEPTS.md already exists) + +**Skip this sub-phase entirely if `CONCEPTS.md` does not exist at repo root** — creation is owned by ce-compound and ce-compound-refresh. + +If it exists, scan the dialogue for **resolved** domain terms — terms where the conversation actively pinned down a precise local meaning, not terms merely mentioned in passing. **Resolved means the dialogue is no longer questioning the definition.** Provisional terms that may still revise stay in the conversation only. + +For each resolved term: if missing, add it; if present but the dialogue surfaced new precision, refine it; if already consistent, no action. + +**Domain entities, named processes, and status concepts with project-specific meaning only.** Not file paths, class names, function signatures, or implementation decisions — `CONCEPTS.md` is a glossary, not a spec or catch-all. + +Follow the format set by existing entries. Apply edits silently. + ### Phase 2: Explore Approaches If multiple plausible directions remain, propose **2-3 concrete approaches** based on research and conversation. Otherwise state the recommended direction directly. diff --git a/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md b/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md index cdb5933a9..79d2ebaa5 100644 --- a/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md +++ b/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md @@ -165,6 +165,7 @@ A learning has several dimensions that can independently go stale. Surface-level - **Related docs** — are cross-referenced learnings and patterns still present and consistent? - **Auto memory** (Claude Code only) — does the injected auto-memory block in your system prompt contain entries in the same problem domain? Scan that block directly. If the block is absent, skip this dimension. A memory note describing a different approach than what the learning recommends is a supplementary drift signal. - **Overlap** — while investigating, note when another doc in scope covers the same problem domain, references the same files, or recommends a similar solution. For each overlap, record: the two file paths, which dimensions overlap (problem, solution, root cause, files, prevention), and which doc appears broader or more current. These signals feed Phase 1.75 (Document-Set Analysis). +- **Vocabulary** — note domain terms the learning cites (entities, named processes, status concepts with project-specific meaning). For each term: does it appear in `CONCEPTS.md`? If yes, does the definition still match how the code uses the term? If no, flag the term for Phase 4.5 to add or bootstrap. Do not edit `CONCEPTS.md` during investigation — just collect the signal centrally. Match investigation depth to the learning's specificity — a learning referencing exact file paths and code snippets needs more verification than one describing a general principle. @@ -486,6 +487,31 @@ For each candidate, execute the flow that matches its classification from Phase Only one flow runs per candidate; the reference contains the per-action criteria, examples, and step-by-step instructions. +## Phase 4.5: Vocabulary Capture + +After the per-learning actions execute, aggregate the domain terms flagged across Phase 1's Vocabulary dimension and reconcile them with `CONCEPTS.md`. + +**First, read `references/concepts-vocabulary.md`.** This is unconditional. Do not pre-judge from memory which Phase 1 signals qualify — the reference's criteria are non-obvious and a "nothing qualifies" judgment without reading is a shortcut, not a result. + +**Procedure:** + +1. **Aggregate.** Collect qualifying terms surfaced across the learnings in scope, applying the reference's criteria. If the same term surfaced in multiple learnings with different shades of precision, **union the shades into one entry** — not three entries, not most-recent-wins. +2. **If `CONCEPTS.md` exists**, add missing terms and refine existing entries when the corpus surfaced new precision. Do not duplicate entries already present. +3. **If `CONCEPTS.md` does not exist** and at least one qualifying term was surfaced, **bootstrap it**. One term is enough — do not gate creation behind a minimum count, that creates an asymmetric trap where the file only ever gets created on the second eligible run. **At creation, hold the qualifying bar conservatively** — a borderline term or a class/table/file name dressed up as an entity should defer until a later run surfaces stronger signal. The conservatism is about quality, not count; updates to an existing file follow normal criteria. +4. **Scope discipline and citation hygiene.** Bootstrap reflects only the area in scope — do not expand to other categories, and do not retroactively inject `(see CONCEPTS.md)` pointers into existing learnings. The report should note that additional entries are likely from refresh runs on other scopes. +5. **Initial structure.** When bootstrapping, start the file with this preamble under the `# Concepts` heading: + + > Shared domain vocabulary for this project — entities, named processes, and status concepts with project-specific meaning. Accretes as ce-compound and ce-compound-refresh process learnings; direct edits are fine. Glossary only, not a spec or catch-all. + + Then add entries. Let term count drive shape: 1-4 terms → flat headings, more → cluster by domain relationship per the rules in `references/concepts-vocabulary.md`. +6. **Scrub violations.** Scan existing entries for content that violates `references/concepts-vocabulary.md` criteria — implementation specifics (file paths, class names, function signatures, code references), status/owner/date metadata, or duplicates of terms covered under a different name. Rewrite or consolidate. The full sweep is appropriate here because refresh is an audit; ce-compound's same-named phase scopes corrections to entries already being touched. + +If no Phase 1 signals qualified after applying the reference's criteria, record that outcome explicitly in the report's `CONCEPTS.md` line (e.g., "scanned, no qualifying terms"). Do not silently skip — the visible scan-and-no-result record is the audit signal that the reference was consulted. + +Note: if this run **creates** `CONCEPTS.md` from scratch, the Discoverability Check below must also surface it in `AGENTS.md`/`CLAUDE.md` so future agents discover it. Subsequent runs skip this because the instruction file is already current. + +**Apply edits silently — no user prompt in any mode.** Vocabulary capture is a side effect of refreshing, not a decision the user makes per run. + ## Output Format **The full report MUST be printed as markdown output.** Do not summarize findings internally and then output a one-liner. The report is the deliverable — print every section in full, formatted as readable markdown with headers, tables, and bullet points. @@ -504,6 +530,8 @@ Replaced: Z Deleted: W Skipped: V Marked stale: S + +CONCEPTS.md: ``` Then for EVERY file processed, list: @@ -631,4 +659,12 @@ After the refresh report is generated, check whether the project's instruction f ``` c. In interactive mode, explain to the user why this matters — agents working in this repo (including fresh sessions, other tools, or collaborators without the plugin) won't know to check `docs/solutions/` unless the instruction file surfaces it. Show the proposed change and where it would go, then use the platform's blocking question tool to get consent before making the edit: `AskUserQuestion` in Claude Code (call `ToolSearch` with `select:AskUserQuestion` first if its schema isn't loaded), `request_user_input` in Codex, `ask_user` in Gemini, `ask_user` in Pi (requires the `pi-ask-user` extension). Fall back to presenting the proposal in chat only when no blocking tool exists in the harness or the call errors (e.g., Codex edit modes) — not because a schema load is required. Never silently skip the question. In headless mode, include it as a "Discoverability recommendation" line in the report — do not attempt to edit instruction files (headless scope is doc maintenance, not project config). -5. **Amend or create a follow-up commit when the check produces edits.** If step 4 resulted in an edit to an instruction file and Phase 5 already committed the refresh changes, stage the newly edited file and either amend the existing commit (if still on the same branch and no push has occurred) or create a small follow-up commit (e.g., `docs: add docs/solutions/ discoverability to AGENTS.md`). If Phase 5 already pushed the branch to a remote (e.g., the branch+PR path), push the follow-up commit as well so the open PR includes the discoverability change. This keeps the working tree clean and the remote in sync at the end of the run. If the user chose "Don't commit" in Phase 5, leave the instruction-file edit unstaged alongside the other uncommitted refresh changes — no separate commit logic needed. +5. **If `CONCEPTS.md` exists at repo root, run a parallel discoverability check for it.** Use the same workflow as the `docs/solutions/` check above: same target file, same edit-placement judgment, same consent-then-edit interaction shape per mode. Example calibration when a directory listing is present: + + ``` + CONCEPTS.md # shared domain vocabulary — read when orienting to the codebase or before discussing domain concepts + ``` + + **Skip this step entirely if `CONCEPTS.md` does not exist** — never nag for an artifact the project has not adopted. When skipped, this step produces no output and no edit. + +6. **Amend or create a follow-up commit when the check produces edits.** If step 4 or step 5 resulted in an edit to an instruction file and Phase 5 already committed the refresh changes, stage the newly edited file and either amend the existing commit (if still on the same branch and no push has occurred) or create a small follow-up commit (e.g., `docs: add docs/solutions/ discoverability to AGENTS.md`, or `docs: add CONCEPTS.md discoverability to AGENTS.md`, or a combined message when both edits landed). If Phase 5 already pushed the branch to a remote (e.g., the branch+PR path), push the follow-up commit as well so the open PR includes the discoverability change. This keeps the working tree clean and the remote in sync at the end of the run. If the user chose "Don't commit" in Phase 5, leave the instruction-file edits unstaged alongside the other uncommitted refresh changes — no separate commit logic needed. diff --git a/plugins/compound-engineering/skills/ce-compound-refresh/references/concepts-vocabulary.md b/plugins/compound-engineering/skills/ce-compound-refresh/references/concepts-vocabulary.md new file mode 100644 index 000000000..27c1f5594 --- /dev/null +++ b/plugins/compound-engineering/skills/ce-compound-refresh/references/concepts-vocabulary.md @@ -0,0 +1,62 @@ +# CONCEPTS.md vocabulary rules + +`CONCEPTS.md` defines the words that mean something specific in this codebase — substrate that `docs/solutions/` and AGENTS.md can cite without redefinition. Lives at the repo root, created lazily the first time a learning surfaces a qualifying term. + +## Be opinionated + +When the team uses several words for the same concept, pick the best one and retire the rest. Record retired synonyms as aliases on the entry (see "Per entry"). Settled distinctions go to the Flagged ambiguities tail. The glossary is not a record of all words the team has ever used — it is the team's agreed-upon vocabulary. + +## The file stands on its own + +Each entry teaches its concept to a reader with no access to anything else — no codebase, no PR history, no architecture meetings, no Slack. This rules out: + +- Implementation specifics (file paths, class names, function signatures, table names, library calls) +- Status fields, dates, owners on the entries +- Examples drawn from current code +- Links to PRs, issues, channels, or roadmap milestones +- Version-specific claims ("currently uses X; migrating to Y") + +Cross-references between entries within `CONCEPTS.md` are fine — they resolve internally. General programming vocabulary (caches, queues, jobs, sessions) and everyday domain English need no redefinition either. + +## What earns a slot + +A term qualifies when its meaning here is precise enough that a new engineer would need it defined to follow conversations, tickets, or code. General programming vocabulary does not belong, even when used heavily. + +## Per entry + +Definition is one sentence — what the term means in this domain, what makes it distinct from neighbors. A term with non-obvious behavioral rules (lifecycle, cancellation semantics, ownership invariants) earns a second paragraph for those rules — never for elaborating the definition itself. + +When retired synonyms exist, list them as an aliases line directly under the definition: *Avoid: Booking, appointment*. Entities typically need more depth than value types; status concepts may need transition notes. + +## Relationships (optional) + +When relationships between entries carry load-bearing meaning (ownership, cardinality, lifecycle dependencies that span entries), capture them in a `## Relationships` section near the top of the file or its cluster. Skip when entries stand on their own without structural context — relationships are a lift for domains where structure is part of what makes terms meaningful, not a routine section. + +## Organization + +Cluster concepts by domain relationship — entities with their states, processes with their stages — so a reader sees structure without effort. A flat list works when the file is small. Reshape as the file grows. + +## Flagged ambiguities (tail of file) + +When two terms were used interchangeably and the team settled on a distinction, record the resolution as a one-line note: *"'account' had been used for both Customer and User — these are distinct."* This section is the audit trail for opinions the team has formed. + +## One illustrative entry — the shape, not a template + +``` +## Booking + +### Reservation +A future commitment to seat a Party at a specified date and time. +*Avoid:* Booking, appointment + +A Reservation owns its Party but does not own a Table — Tables are acquired only when the Party arrives, through a Seating. Lifecycle: Booked, Seated, Completed, No-Show. Cancellation before a Seating is non-destructive; cancellation after a Seating is recorded as a No-Show. + +### Party +The guests committed to a Reservation. Each Reservation has exactly one Party. Party size is the count promised at booking, not the count who arrive. + +### Table +A physical seating unit with fixed capacity. Tables are shared resources — they do not belong to Reservations and are allocated only on the day-of through Seatings. + +### Seating +The act of placing a Party at a Table once the Party arrives. A Reservation has at most one Seating; a Table accumulates many Seatings across its lifetime. +``` diff --git a/plugins/compound-engineering/skills/ce-compound/SKILL.md b/plugins/compound-engineering/skills/ce-compound/SKILL.md index 4b1a93601..12d230957 100644 --- a/plugins/compound-engineering/skills/ce-compound/SKILL.md +++ b/plugins/compound-engineering/skills/ce-compound/SKILL.md @@ -1,6 +1,6 @@ --- name: ce-compound -description: Document a recently solved problem to compound your team's knowledge +description: Document a recently solved problem to compound your team's knowledge or CONCEPTS.md, the project's shared domain vocabulary. argument-hint: "[optional: brief context] [mode:headless] " --- @@ -23,6 +23,10 @@ Captures problem solutions while context is fresh, creating structured documenta /ce-compound mode:headless [context] # Non-interactive run with context hint ``` +## CONCEPTS.md bootstrap requests + +If invoked to create or bootstrap `CONCEPTS.md` rather than document a solved problem, do not run normal phases. Explain: the file accretes as ce-compound and ce-compound-refresh process real learnings; cold-start codebase scans aren't supported because the qualifying bar is judgmental. Redirect to ce-compound on a real learning, ce-compound-refresh on an existing corpus, or direct hand-editing. Then exit. + ## Mode Detection Check `$ARGUMENTS` for a `mode:headless` token. Tokens starting with `mode:` are flags, not context — strip `mode:headless` from arguments before treating the remainder as the brief context hint. @@ -46,6 +50,7 @@ These files are the durable contract for the workflow. Read them on-demand at th - `references/schema.yaml` — canonical frontmatter fields and enum values (read when validating YAML) - `references/yaml-schema.md` — category mapping from problem_type to directory (read when classifying) +- `references/concepts-vocabulary.md` — CONCEPTS.md format and inclusion rules (read in Phase 2.4 when domain terms surface) - `assets/resolution-template.md` — section structure for new docs (read when assembling) When spawning subagents, pass the relevant file contents into the task prompt so they have the contract without needing cross-skill paths. @@ -209,12 +214,13 @@ Launch research subagents. Each returns text data to the orchestrator. Do not append additional context blocks, exclusion lists, or topic-keyword bullets — verbose payloads give ce-sessions license to keep widening the search and rapidly compound wall time. If keyword search is needed, ce-sessions owns that decision internally based on the topic. - Returns: structured digest of findings from prior sessions, or "no relevant prior sessions" if none found. + - **ce-sessions is the final Phase 1 input, not a workflow stop.** When it returns, proceed directly to Phase 2 with its output as the last input — do not emit a summary and do not pause for the user. A "no relevant prior sessions" return is still a valid input; the documentation gets written without session context. ### Phase 2: Assembly & Write -**WAIT for all Phase 1 subagents to complete before proceeding.** +**WAIT for all Phase 1 inputs to complete before proceeding** — the three parallel subagents and, when the user opted in, the synchronous `ce-sessions` skill call. ce-sessions is a Phase 1 input even though it is a skill rather than a subagent. The orchestrating agent (main conversation) performs these steps: @@ -246,6 +252,24 @@ When creating a new doc, preserve the section order from `assets/resolution-temp +### Phase 2.4: Vocabulary Capture + +**First, read `references/concepts-vocabulary.md`.** This is unconditional. Do not pre-judge from memory that nothing qualifies — the reference's criteria are non-obvious and qualifying terms often live in the surrounding conversation rather than the new doc itself. Reading the reference is what makes the rest of the phase possible. + +Then, applying those criteria, scan the new doc **and** the surrounding conversation for qualifying domain terms. If `CONCEPTS.md` exists at repo root, add missing qualifying terms and refine existing entries when new precision surfaced. If it does not exist and at least one qualifying term surfaced, create it lazily. + +**At creation only, hold the qualifying bar conservatively.** A borderline term, or a class/table/file name dressed up as an entity, does not justify seeding a new file — defer until a later run surfaces stronger signal. This conservatism applies to creation quality only; updates to an existing file follow the normal criteria. + +**When bootstrapping the file, start with this preamble under the `# Concepts` heading**, then add the qualifying entries below it: + +> Shared domain vocabulary for this project — entities, named processes, and status concepts with project-specific meaning. Accretes as ce-compound and ce-compound-refresh process learnings; direct edits are fine. Glossary only, not a spec or catch-all. + +**Opportunistically fix violations you notice while editing.** If an entry being added/refined or an adjacent existing entry contains implementation specifics (file paths, class names, function signatures, code references), rewrite to the glossary standard. Do not full-audit the file — confine corrections to entries near the ones already being touched. Broader audit is ce-compound-refresh's job. + +If no terms qualified after applying the reference's criteria, record that outcome explicitly in the success output (e.g., "Vocabulary capture: scanned, no qualifying terms"). Do not silently skip — the visible scan-and-no-result record is the audit signal that the reference was consulted. + +**Apply edits silently in every mode — no user prompt in interactive, lightweight, or headless.** Vocabulary capture is a side effect of compounding, not a decision the user makes per run. + ### Phase 2.5: Selective Refresh Check After writing the new learning, decide whether this new solution is evidence that older docs should be refreshed. @@ -329,6 +353,14 @@ After the learning is written and the refresh decision is made, check whether th ``` c. In full interactive mode, explain to the user why this matters — agents working in this repo (including fresh sessions, other tools, or collaborators without the plugin) won't know to check `docs/solutions/` unless the instruction file surfaces it. Show the proposed change and where it would go, then use the platform's blocking question tool to get consent before making the edit: `AskUserQuestion` in Claude Code (call `ToolSearch` with `select:AskUserQuestion` first if its schema isn't loaded), `request_user_input` in Codex, `ask_user` in Gemini, `ask_user` in Pi (requires the `pi-ask-user` extension). Fall back to presenting the proposal in chat only when no blocking tool exists in the harness or the call errors (e.g., Codex edit modes) — not because a schema load is required. Never silently skip the question. In lightweight mode, output a one-liner note and move on. In headless mode, apply the edit directly without prompting and surface it in the terminal report under "Instruction-file edit" +5. **If `CONCEPTS.md` exists at repo root, run a parallel discoverability check for it.** Assess whether the instruction file would lead an agent to discover the project's shared domain vocabulary. Use the same workflow as the `docs/solutions/` check above: same target file, same edit-placement judgment, same consent-then-edit interaction shape per mode. A line in an existing section is almost always better than a new headed section. Example calibration when nothing else fits: + + ``` + CONCEPTS.md # shared domain vocabulary (entities, named processes, status concepts) — relevant when orienting to the codebase or discussing domain concepts + ``` + + **Skip this step entirely if `CONCEPTS.md` does not exist** — never nag for an artifact the project has not adopted. When skipped, this step produces no output and no edit. + ### Phase 3: Optional Enhancement **WAIT for Phase 2 to complete before proceeding.** @@ -469,6 +501,7 @@ Track: Category: Overlap: | high — existing doc updated> Instruction-file edit: | gap noted, not applied> +CONCEPTS.md: Refresh recommendation: Documentation complete @@ -503,8 +536,9 @@ Specialized Agent Reviews (Auto-Triggered): ✓ ce-kieran-rails-reviewer: Code examples meet Rails conventions ✓ ce-code-simplicity-reviewer: Solution is appropriately minimal -File created: -- docs/solutions/performance-issues/n-plus-one-brief-generation.md +Files written: +- docs/solutions/performance-issues/n-plus-one-brief-generation.md (created) +- CONCEPTS.md (created with 3 entries: BriefSystem, EmailQueue, Brief Status) This documentation will be searchable for future reference when similar issues occur in the Email Processing or Brief System modules. diff --git a/plugins/compound-engineering/skills/ce-compound/references/concepts-vocabulary.md b/plugins/compound-engineering/skills/ce-compound/references/concepts-vocabulary.md new file mode 100644 index 000000000..27c1f5594 --- /dev/null +++ b/plugins/compound-engineering/skills/ce-compound/references/concepts-vocabulary.md @@ -0,0 +1,62 @@ +# CONCEPTS.md vocabulary rules + +`CONCEPTS.md` defines the words that mean something specific in this codebase — substrate that `docs/solutions/` and AGENTS.md can cite without redefinition. Lives at the repo root, created lazily the first time a learning surfaces a qualifying term. + +## Be opinionated + +When the team uses several words for the same concept, pick the best one and retire the rest. Record retired synonyms as aliases on the entry (see "Per entry"). Settled distinctions go to the Flagged ambiguities tail. The glossary is not a record of all words the team has ever used — it is the team's agreed-upon vocabulary. + +## The file stands on its own + +Each entry teaches its concept to a reader with no access to anything else — no codebase, no PR history, no architecture meetings, no Slack. This rules out: + +- Implementation specifics (file paths, class names, function signatures, table names, library calls) +- Status fields, dates, owners on the entries +- Examples drawn from current code +- Links to PRs, issues, channels, or roadmap milestones +- Version-specific claims ("currently uses X; migrating to Y") + +Cross-references between entries within `CONCEPTS.md` are fine — they resolve internally. General programming vocabulary (caches, queues, jobs, sessions) and everyday domain English need no redefinition either. + +## What earns a slot + +A term qualifies when its meaning here is precise enough that a new engineer would need it defined to follow conversations, tickets, or code. General programming vocabulary does not belong, even when used heavily. + +## Per entry + +Definition is one sentence — what the term means in this domain, what makes it distinct from neighbors. A term with non-obvious behavioral rules (lifecycle, cancellation semantics, ownership invariants) earns a second paragraph for those rules — never for elaborating the definition itself. + +When retired synonyms exist, list them as an aliases line directly under the definition: *Avoid: Booking, appointment*. Entities typically need more depth than value types; status concepts may need transition notes. + +## Relationships (optional) + +When relationships between entries carry load-bearing meaning (ownership, cardinality, lifecycle dependencies that span entries), capture them in a `## Relationships` section near the top of the file or its cluster. Skip when entries stand on their own without structural context — relationships are a lift for domains where structure is part of what makes terms meaningful, not a routine section. + +## Organization + +Cluster concepts by domain relationship — entities with their states, processes with their stages — so a reader sees structure without effort. A flat list works when the file is small. Reshape as the file grows. + +## Flagged ambiguities (tail of file) + +When two terms were used interchangeably and the team settled on a distinction, record the resolution as a one-line note: *"'account' had been used for both Customer and User — these are distinct."* This section is the audit trail for opinions the team has formed. + +## One illustrative entry — the shape, not a template + +``` +## Booking + +### Reservation +A future commitment to seat a Party at a specified date and time. +*Avoid:* Booking, appointment + +A Reservation owns its Party but does not own a Table — Tables are acquired only when the Party arrives, through a Seating. Lifecycle: Booked, Seated, Completed, No-Show. Cancellation before a Seating is non-destructive; cancellation after a Seating is recorded as a No-Show. + +### Party +The guests committed to a Reservation. Each Reservation has exactly one Party. Party size is the count promised at booking, not the count who arrive. + +### Table +A physical seating unit with fixed capacity. Tables are shared resources — they do not belong to Reservations and are allocated only on the day-of through Seatings. + +### Seating +The act of placing a Party at a Table once the Party arrives. A Reservation has at most one Seating; a Table accumulates many Seatings across its lifetime. +``` diff --git a/plugins/compound-engineering/skills/ce-plan/SKILL.md b/plugins/compound-engineering/skills/ce-plan/SKILL.md index 2632384ce..c49619142 100644 --- a/plugins/compound-engineering/skills/ce-plan/SKILL.md +++ b/plugins/compound-engineering/skills/ce-plan/SKILL.md @@ -234,6 +234,7 @@ Prepare a concise planning context summary (a paragraph or two) to pass as input - If an origin document exists, summarize the problem frame, requirements, and key decisions from that document - Otherwise use the feature description directly - If `STRATEGY.md` exists, read it and include the relevant pieces (target problem, approach, active tracks) in the summary so downstream research and planning decisions are anchored to product strategy +- If `CONCEPTS.md` exists at repo root, read it — its definitions are the canonical names for domain entities, named processes, and status concepts. Plan with those terms rather than synonyms. Run these agents in parallel: @@ -639,6 +640,8 @@ Plan written to **Pipeline mode:** If invoked from an automated workflow such as LFG or any `disable-model-invocation` context, skip interactive questions. Make the needed choices automatically and proceed to writing the plan. +**CONCEPTS.md gap-fill (only if the file already exists):** If the plan body uses a domain term whose definition is missing from `CONCEPTS.md`, add the entry. **Domain entities, named processes, and status concepts with project-specific meaning only** — not file paths, class names, function signatures, or implementation decisions. `CONCEPTS.md` is a glossary, not a spec or catch-all. Follow the format set by existing entries. Apply silently. Skip entirely if `CONCEPTS.md` does not exist — creation is owned by ce-compound and ce-compound-refresh. + #### 5.3 Confidence Check and Deepening After writing the plan file, automatically evaluate whether the plan needs strengthening. diff --git a/plugins/compound-engineering/skills/ce-sessions/evals/README.md b/plugins/compound-engineering/skills/ce-sessions/evals/README.md new file mode 100644 index 000000000..8482d43b9 --- /dev/null +++ b/plugins/compound-engineering/skills/ce-sessions/evals/README.md @@ -0,0 +1,72 @@ +# ce-sessions terminology-preservation eval suite + +## Purpose + +Validate a load-bearing assumption introduced by PR #838 (`feat(concepts): introduce CONCEPTS.md as shared vocabulary substrate`): that ce-sessions findings preserve enough terminology resolution context for ce-compound Phase 2.4's vocabulary capture to extract qualifying domain terms. + +If ce-sessions returns only high-level "here's what was discussed" summaries that drop the specific coined terms and resolution context, then wiring its output into ce-compound's vocabulary-capture scan is decorative. If it returns terms with the rationale around them, the wiring works as advertised. + +This suite is narrowly scoped to the terminology-preservation question. It does not evaluate ce-sessions's general search quality, response shape, or any other property. + +## Files + +| File | Purpose | +|------|---------| +| `evals.json` | Test case definitions with prompts, expected terminology by criticality tier, expected context items, and ground-truth pointers (PR numbers + merge commits) | +| `grader.md` | Grading rubric — two-stage (programmatic substring + LLM context-preservation), per-run + aggregate metrics, risk attribution | +| `README.md` | This file | + +## Test cases at a glance + +| # | Name | Risk tested | Ground truth | +|---|------|-------------|--------------| +| 1 | synthesis-gate-recovery | Synthesis loss (distinctive term) | PR #822 (merged 2026-05-15) | +| 2 | mode-headless-semantic-alignment | Synthesis loss (multi-piece nuance) | PR #813 (merged 2026-05-10) | +| 3 | tangential-term-recovery | Indexing gap | PRs #822, #819, #829 | +| 4 | near-miss-false-positive | False positive on shared keyword | Anti-PR: #813 | + +## Design rationale + +**Why these four cases.** Each isolates a distinct failure mode of the load-bearing assumption: + +- **Eval 1** uses a single, distinctive coined term ("synthesis gate") so a failure is unambiguous evidence of synthesis loss. If ce-sessions cannot return this term verbatim when queried about its own work, the assumption is broken. +- **Eval 2** tests a multi-piece design decision (rename + cross-skill alignment + a principle refinement). A pass here demonstrates ce-sessions preserves nuance, not only flashy coined nouns. +- **Eval 3** is the indexing-gap test. The query mentions "ce-plan workflow improvements" without naming any of the synthesis-gate terminology. Phase 2.4's real-world use is broad-topic queries hoping to surface terminology — if eval 3 fails while eval 1 passes, ce-sessions only retrieves terms when queried by them, which means ce-compound's wiring is decorative for the actual use case. +- **Eval 4** is the discriminating-power test. If ce-sessions surfaces the ce-compound mode:headless feature work as relevant to a CI/CD server-deployment query, false-positive findings would feed wrong vocabulary into Phase 2.4. + +**Why two-stage grading.** Programmatic substring matching (Stage 1) cheaply catches the worst case: distinctive terms dropped entirely. LLM-graded context preservation (Stage 2) catches the subtler case where the term survives but the rationale around it is summarized away — which would let Phase 2.4 see the term but be unable to write a useful CONCEPTS.md entry because the context for *why* it qualifies is gone. + +**Why variance across runs.** ce-sessions involves an LLM synthesis step (the session-historian subagent). Single-run pass/fail is a misleading signal because the same prompt may produce different findings on different invocations. The 3-runs-per-eval protocol catches the case where the assumption holds on average but fails frequently enough in practice to be unreliable. + +## How to run (framework-driven) + +This suite is run via the `skill-creator` framework, not manually. The framework spawns subagents in parallel to invoke ce-sessions, captures findings to a workspace, grades them, aggregates, and opens a viewer. + +**Workspace location:** `/tmp/compound-engineering/ce-sessions/evals/iteration-/` (per repo AGENTS.md scratch conventions — `/tmp` for cross-invocation reusable scratch, accessible for grep/inspection). + +**One subagent dispatch per eval × per run.** Each dispatched subagent receives the eval prompt, invokes `/ce-sessions `, captures the findings text verbatim, and writes to `/iteration-/eval--/run-/findings.txt`. + +With the default `runs_per_eval: 3` and 4 evals, that's 12 with-skill subagent dispatches per run pass. + +**Baseline runs are optional and not part of the initial pass.** skill-creator's standard flow spawns a baseline subagent per eval (without the skill) to compare with-skill vs without-skill. For our use case, that comparison is weaker signal because the questions all require session access — a baseline agent will trivially fail to recover terminology because it has no session history at all. The grader's pass/fail comes from terminology-preservation grading against ground truth, not from with/without delta. If you want the baselines for a sanity-check control (confirming ce-sessions is the source of any recovered terms), they can be added by running 4 more dispatches without the skill path. + +**Grading.** After all with-skill runs return, dispatch a grader subagent that reads each `findings.txt` and applies `grader.md`'s two-stage rubric. The grader writes `grading.json` per run and aggregates to `summary.json` per eval. + +**Viewer.** After grading, run `python /eval-viewer/generate_review.py` against the workspace iteration directory. The viewer renders findings alongside expected terms and lets you eyeball context preservation per run. + +## Ground truth caveats + +- The eval suite assumes the user's session history contains the sessions that produced PRs #813 and #822. If those sessions were on a different machine or are no longer in session storage, eval 1 and 2 will fail for a reason that's NOT a ce-sessions defect. +- Before running, confirm the relevant sessions are reachable. Quick sanity check: `/ce-sessions "what did I do on 2026-05-10?"` — if ce-sessions returns content from around that date, history is present. +- If history is missing, treat eval results as inconclusive rather than as evidence against the assumption. + +## Interpreting outcomes + +| Outcome | Interpretation | Action | +|---------|----------------|--------| +| All 4 evals pass with low variance | Assumption holds. ce-compound Phase 2.4 wiring works as advertised. | Ship PR #838. | +| Eval 1 or 2 fails Stage 1 | Synthesis loss is severe — distinctive coined terms are being dropped. | Investigate ce-session-historian's synthesis prompt; consider tightening it to preserve verbatim terminology. Revise PR #838's claims accordingly. | +| Eval 1 or 2 passes Stage 1 but fails Stage 2 | Terms survive but rationale is lost. | Phase 2.4 will see terms but may not write good entries. Consider whether the wiring still delivers value, or whether the historian needs to preserve more context. | +| Eval 3 fails while 1 and 2 pass | Indexing gap — terms only retrievable when queried by name. | The Phase 2.4 wiring is decorative for the broad-topic use case. Reconsider whether to ship the session-search scan input, or change how Phase 2.4 queries ce-sessions. | +| High variance | Mechanism works but unreliably. | Multiple invocations within ce-compound's flow would help, or accept it as a best-effort enhancement rather than load-bearing. | +| Eval 4 fails | False-positive risk to vocabulary feed. | Tighten Phase 2.4 to score-rank findings before feeding them to the vocabulary scan, or accept that some noise enters the file. | diff --git a/plugins/compound-engineering/skills/ce-sessions/evals/evals.json b/plugins/compound-engineering/skills/ce-sessions/evals/evals.json new file mode 100644 index 000000000..addb02c6e --- /dev/null +++ b/plugins/compound-engineering/skills/ce-sessions/evals/evals.json @@ -0,0 +1,108 @@ +{ + "skill_name": "ce-sessions", + "purpose": "Validate that ce-sessions findings preserve enough terminology resolution context for downstream vocabulary capture (load-bearing assumption in PR #838 — ce-compound Phase 2.4 scans ce-sessions findings for qualifying domain terms).", + "non_purpose": "Not testing ce-sessions's general search quality or its ability to find sessions on arbitrary topics. The narrow assumption is about terminology-resolution preservation.", + "variance_protocol": { + "runs_per_eval": 3, + "stability_metric": "stddev of must-tier term recall across runs", + "pass_threshold": "must-tier recall >= 80% mean AND stddev < 20%" + }, + "grading_pipeline": { + "stage_1": "Programmatic substring match per criticality tier (must / should / may). Pass = all 'must' terms appear in findings.", + "stage_2": "LLM grader (see grader.md) — judges whether each 'expected_context' item is preserved WITH resolution rationale, not only as a keyword hit. Pass = all expected_context items receive 'preserved with context' verdict." + }, + "evals": [ + { + "id": 1, + "name": "synthesis-gate-recovery", + "tests_risk": "synthesis_loss", + "prompt": "What was the synthesis gate work in ce-plan about? I want to understand how it was designed and what problems it solved.", + "expected_terms": [ + {"term": "synthesis gate", "tier": "must"}, + {"term": "ce-plan", "tier": "must"}, + {"term": "Phase 0.7", "tier": "should"}, + {"term": "Phase 5.1.5", "tier": "should"}, + {"term": "Stated", "tier": "should"}, + {"term": "Inferred", "tier": "should"}, + {"term": "Out of scope", "tier": "should"}, + {"term": "call-outs", "tier": "may"}, + {"term": "synthesis-summary.md", "tier": "may"}, + {"term": "silent proceeding is not allowed", "tier": "may"} + ], + "expected_context": [ + "synthesis gate appears with its purpose (prevent silent proceed past synthesis without user check), not only as a keyword", + "Stated / Inferred / Out of scope appear as bucket categorization, not only as a phrase" + ], + "ground_truth": { + "primary_pr": 822, + "primary_merge_commit": "39cb9da3a1a90a7ce7418f7a64d7ff3c8f9a917c", + "related_prs": [819, 829], + "merged_at": "2026-05-15" + }, + "notes": "Distinctive coined term that should be near-impossible to ignore if ce-sessions touched the originating session. Failure here indicates strong synthesis loss." + }, + { + "id": 2, + "name": "mode-headless-semantic-alignment", + "tests_risk": "synthesis_loss", + "prompt": "How was mode:headless aligned across the compound family of skills? Why was it added and what changed?", + "expected_terms": [ + {"term": "mode:headless", "tier": "must"}, + {"term": "ce-compound", "tier": "must"}, + {"term": "mode:autofix", "tier": "should"}, + {"term": "ce-compound-refresh", "tier": "should"}, + {"term": "sticky mode token", "tier": "should"}, + {"term": "Discoverability Check", "tier": "should"}, + {"term": "process exhaust", "tier": "may"}, + {"term": "audit content", "tier": "may"}, + {"term": "Compare per skill, not per mode", "tier": "may"}, + {"term": "Assumptions section", "tier": "may"} + ], + "expected_context": [ + "mode:autofix → mode:headless rename appears with reasoning (the compound family should speak the same word)", + "process exhaust vs audit content principle appears with the refined rule (compare per skill, not per mode — interactive ce-compound doesn't validate the same inferences headless skips)" + ], + "ground_truth": { + "primary_pr": 813, + "primary_merge_commit": "9b45a83d7ed2534669656fb3abf6a2c23e2e4f59", + "merged_at": "2026-05-10" + }, + "notes": "More nuanced than #1 — tests preservation of a multi-piece design decision (rename + cross-skill alignment + a principle refinement) rather than a single coined term." + }, + { + "id": 3, + "name": "tangential-term-recovery", + "tests_risk": "indexing_gap", + "prompt": "I want to understand recent ce-plan workflow improvements — what's been done and why?", + "expected_terms": [ + {"term": "synthesis gate", "tier": "should"}, + {"term": "scoping synthesis", "tier": "may"}, + {"term": "Phase 0.7", "tier": "may"} + ], + "expected_context": [ + "synthesis gate surfaces despite the query being broader than the term itself — finding is not gated on the query containing 'synthesis' as a keyword" + ], + "ground_truth": { + "related_prs": [822, 819, 829] + }, + "notes": "Tests whether ce-sessions surfaces tangentially-relevant terminology when the query is broader than the terms themselves. The query intentionally does NOT mention 'synthesis' — if the term is only retrievable by querying its own name, ce-sessions has an indexing gap for the vocabulary-capture use case (because ce-compound's Phase 2.4 won't query by term, it'll query by topic)." + }, + { + "id": 4, + "name": "near-miss-false-positive", + "tests_risk": "false_positive", + "prompt": "Were there any sessions on running services in headless mode for CI/CD or headless server deployments?", + "expected_terms": [], + "must_not_contain_in_relevant_findings": [ + {"term": "mode:headless feature for ce-compound", "tier": "must_not"}, + {"term": "sticky mode token", "tier": "must_not"} + ], + "expected_response_shape": "either (a) 'no relevant sessions found' if the user has no sessions about server-headless topics, or (b) only sessions actually about CI/CD or server headless contexts — NOT the ce-compound mode:headless feature work", + "ground_truth": { + "anti_pr": 813, + "notes": "PR #813 is about the compound-engineering mode:headless feature, not about server deployments. Finding that PR's session as a relevant result for this query would be a false positive." + }, + "notes": "Weaker signal than #1-3. If the user has no sessions about server-headless deployments, ce-sessions correctly returns 'no relevant' and the test trivially passes. Test only has discriminating signal if relevant unrelated sessions exist. Run anyway — a trivially-passing test still confirms ce-sessions doesn't return the ce-compound PR work as 'relevant' to a CI/CD deployment query." + } + ] +} diff --git a/plugins/compound-engineering/skills/ce-sessions/evals/grader.md b/plugins/compound-engineering/skills/ce-sessions/evals/grader.md new file mode 100644 index 000000000..199fd2c9b --- /dev/null +++ b/plugins/compound-engineering/skills/ce-sessions/evals/grader.md @@ -0,0 +1,114 @@ +# ce-sessions terminology-preservation grader + +This grader evaluates whether ce-sessions findings preserve enough terminology resolution context to make downstream vocabulary capture (ce-compound Phase 2.4) work. It is NOT a general quality grader for ce-sessions; the narrow question is "would Phase 2.4 be able to extract qualifying domain terms from these findings?" + +## Inputs to the grader + +For each eval run, the grader receives: + +1. **The eval definition** from `evals.json` (terms, tiers, expected_context, notes). +2. **The findings text** that ce-sessions returned to the orchestrating agent. +3. **(Optional) The full agent transcript** for the ce-sessions invocation, if available — useful for distinguishing "ce-sessions returned this and the agent paraphrased it" from "ce-sessions returned this verbatim." + +## Two-stage grading + +### Stage 1 — Programmatic term recall (substring match) + +For each entry in `expected_terms`: +- Score 1 if the term (case-insensitive, substring match) appears anywhere in the findings text. +- Score 0 otherwise. + +Aggregate by tier: +- `must_recall` = (count of must-tier terms scored 1) / (total must-tier terms) +- `should_recall` = (count of should-tier terms scored 1) / (total should-tier terms) +- `may_recall` = (count of may-tier terms scored 1) / (total may-tier terms) + +**Stage 1 pass criterion:** `must_recall == 1.0` (every must-tier term appears). + +If Stage 1 fails, ce-sessions is dropping the most distinctive coined terms — synthesis loss is severe and Stage 2 is moot. Record the failure and stop. + +### Stage 2 — Context preservation (LLM-graded) + +For each entry in `expected_context`: + +Read the findings text. Decide whether the expected context item is **preserved with rationale** or **mentioned without context**. Apply this rubric: + +- **`preserved` (1.0)** — the finding text discusses the term AND its meaning, role, or the reasoning behind it. Example: "synthesis gate was introduced to prevent ce-plan from silently proceeding past synthesis without showing the user a Stated/Inferred/Out of scope summary." +- **`keyword_only` (0.0)** — the finding mentions the term but in a way that doesn't convey why it matters or what it means. Example: "the user worked on the synthesis gate." +- **`absent` (0.0)** — the term doesn't appear in the relevant section at all. + +**Stage 2 pass criterion:** every entry in `expected_context` scores `preserved`. + +For eval id #4 (near-miss-false-positive), Stage 2 instead checks `must_not_contain_in_relevant_findings`: +- For each `must_not` entry, search the findings. +- If the entry appears **as a relevant result** (not, e.g., as a "not relevant — different context" caveat), Stage 2 fails. +- "Not relevant" mentions are fine; surfacing the ce-compound feature PR work as if it answered a CI/CD deployment query is the failure mode. + +## Aggregating across runs (variance) + +For each eval, run the prompt N times (default 3 from `variance_protocol.runs_per_eval`). + +Per run, capture: +- `must_recall`, `should_recall`, `may_recall` from Stage 1 +- `context_preservation_rate` from Stage 2 (count preserved / count expected_context) +- `stage_1_pass` (bool), `stage_2_pass` (bool) + +Per eval, compute: +- `mean_must_recall`, `stddev_must_recall` +- `mean_context_preservation`, `stddev_context_preservation` +- `runs_passed` (count where both stage_1_pass and stage_2_pass were true) + +**Eval-level pass criteria:** +- `mean_must_recall >= 0.80` +- `stddev_must_recall < 0.20` +- `runs_passed >= 2 of 3` (or proportionally for higher N) + +## Outputs + +Write per-run grades to `/iteration-N/eval-/grading.json`: + +```json +{ + "eval_id": 1, + "eval_name": "synthesis-gate-recovery", + "run_index": 0, + "stage_1": { + "must_recall": 1.0, + "should_recall": 0.83, + "may_recall": 0.33, + "passed": true, + "matched_terms_by_tier": { + "must": ["synthesis gate", "ce-plan"], + "should": ["Phase 0.7", "Stated", "Inferred", "Out of scope", "Phase 5.1.5"], + "may": ["synthesis-summary.md"] + }, + "missed_terms_by_tier": { + "should": ["call-outs"], + "may": ["silent proceeding is not allowed"] + } + }, + "stage_2": { + "context_results": [ + {"item": "synthesis gate purpose preserved", "verdict": "preserved", "evidence": ""}, + {"item": "Stated/Inferred/Out of scope as buckets", "verdict": "keyword_only", "evidence": ""} + ], + "context_preservation_rate": 0.5, + "passed": false + }, + "overall_passed": false +} +``` + +Then aggregate across runs to a per-eval summary at `/iteration-N/eval-/summary.json`. + +## Surfacing the three risks separately + +The eval design separates signal so a failure points at one risk: + +| Risk | Signal | Where it surfaces | +|------|--------|-------------------| +| Synthesis loss (distinctive terms dropped) | Stage 1 must-tier fails on eval #1 or #2 | grading.json `stage_1.must_recall < 1.0` | +| Synthesis loss (nuance lost, term kept) | Stage 1 passes, Stage 2 fails on eval #1 or #2 | grading.json `stage_1.passed: true, stage_2.passed: false` | +| Indexing gap (tangential terminology not surfaced) | Eval #3 fails Stage 1 should-tier | grading.json eval-3 `should_recall == 0` despite related sessions existing | +| Variance | Same eval passes on some runs, fails on others | summary.json `stddev_must_recall >= 0.20` or `runs_passed < N` | +| False positive | Eval #4 surfaces the ce-compound mode:headless work as relevant to CI/CD deployment query | grading.json eval-4 `stage_2.passed: false` |