From 4225fa13d440e823f1fd6086f4ea40a6c4b941f5 Mon Sep 17 00:00:00 2001 From: Trevin Chow Date: Sat, 16 May 2026 00:49:35 -0700 Subject: [PATCH 01/11] feat(concepts): introduce CONCEPTS.md as shared vocabulary substrate MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds a domain-vocabulary artifact maintained as a side effect of compounding. CONCEPTS.md is the substrate that learnings cite — entities, named processes, and status concepts with project-specific precise meaning. Lazy creation, opportunistic AGENTS.md discoverability, no user prompts. Ownership model: - ce-compound and ce-compound-refresh create and maintain the file. Both also surface CONCEPTS.md to AGENTS.md/CLAUDE.md on first creation via the existing Discoverability Check, so future agents discover the file. - ce-brainstorm and ce-plan are contributors only — they add to or refine CONCEPTS.md when terms surface, but skip writes entirely when the file doesn't exist. Avoids speculative bootstrapping from pre-implementation work. - ce-learnings-researcher reads CONCEPTS.md as grounding before keyword extraction so result distillation uses canonical terminology. ce-compound and ce-compound-refresh both bundle a concepts-vocabulary.md reference with inclusion criteria, format rules, and an illustrative example. ce-brainstorm and ce-plan intentionally do not — they learn format from the existing file's contents. Plugin AGENTS.md gains a note that the two reference copies must stay in sync. --- plugins/compound-engineering/AGENTS.md | 2 +- .../agents/ce-learnings-researcher.agent.md | 6 +++ .../skills/ce-brainstorm/SKILL.md | 14 +++++- .../skills/ce-compound-refresh/SKILL.md | 31 ++++++++++++- .../references/concepts-vocabulary.md | 43 +++++++++++++++++++ .../skills/ce-compound/SKILL.md | 23 +++++++++- .../references/concepts-vocabulary.md | 43 +++++++++++++++++++ .../skills/ce-plan/SKILL.md | 3 ++ 8 files changed, 160 insertions(+), 5 deletions(-) create mode 100644 plugins/compound-engineering/skills/ce-compound-refresh/references/concepts-vocabulary.md create mode 100644 plugins/compound-engineering/skills/ce-compound/references/concepts-vocabulary.md diff --git a/plugins/compound-engineering/AGENTS.md b/plugins/compound-engineering/AGENTS.md index 4e813b111..f9e348157 100644 --- a/plugins/compound-engineering/AGENTS.md +++ b/plugins/compound-engineering/AGENTS.md @@ -12,7 +12,7 @@ They supplement the repo-root `AGENTS.md`. Consequences: - Behavioral rules that govern skill *runtime* behavior must live inside the skill itself — in `SKILL.md` or files under its `references/`. Guidance placed in this file is invisible at runtime. -- When two or more skills share a behavioral principle, duplicate the guidance into each skill (inline for short rules, `references/` for longer ones). There is no cross-skill shared-file mechanism (see "File References in Skills" below). +- When two or more skills share a behavioral principle, duplicate the guidance into each skill (inline for short rules, `references/` for longer ones). There is no cross-skill shared-file mechanism (see "File References in Skills" below). When a reference file is duplicated across skills (e.g., `concepts-vocabulary.md` in both `ce-compound/references/` and `ce-compound-refresh/references/`), edits must be applied to every copy in the same commit. Drift between copies produces inconsistent agent behavior depending on which skill loaded. - Do not propose that runtime guidance for ce-ideate, ce-brainstorm, ce-plan, or any other skill live in this AGENTS.md or in the repo-root AGENTS.md. Those files only shape how contributors edit the plugin. This is easy to miss because authoring feels like using: you edit the plugin while running inside this repo, and the repo's AGENTS.md is loaded — but that load does not follow the installed skill into a user's environment. diff --git a/plugins/compound-engineering/agents/ce-learnings-researcher.agent.md b/plugins/compound-engineering/agents/ce-learnings-researcher.agent.md index 845f92ef4..c1ff011a4 100644 --- a/plugins/compound-engineering/agents/ce-learnings-researcher.agent.md +++ b/plugins/compound-engineering/agents/ce-learnings-researcher.agent.md @@ -18,6 +18,12 @@ Past learnings span multiple shapes: Treat all of these as candidates. Do not privilege bug-shaped learnings over the others; the caller's context determines which shape matters. +## Step 0: Ground in CONCEPTS.md (if present) + +Before searching `docs/solutions/`, check whether `CONCEPTS.md` exists at the repo root. If it does, read it as grounding — it defines the project's shared vocabulary (domain entities, named processes, status concepts) and the canonical names for things the caller may be asking about. Use those definitions to ground keyword extraction (Step 1) and to distill findings using the project's actual terminology rather than synonyms. + +If `CONCEPTS.md` does not exist, skip this step entirely and proceed to Step 1. + ## Search Strategy (Grep-First Filtering) The `docs/solutions/` directory contains documented learnings with YAML frontmatter. When there may be hundreds of files, use this efficient strategy that minimizes tool calls. diff --git a/plugins/compound-engineering/skills/ce-brainstorm/SKILL.md b/plugins/compound-engineering/skills/ce-brainstorm/SKILL.md index e45b5e0d6..f6c546a4a 100644 --- a/plugins/compound-engineering/skills/ce-brainstorm/SKILL.md +++ b/plugins/compound-engineering/skills/ce-brainstorm/SKILL.md @@ -111,7 +111,7 @@ Scan the repo before substantive brainstorming. Match depth to scope: **Standard and Deep** — Two passes: -*Constraint Check* — Check project instruction files (`AGENTS.md`, and `CLAUDE.md` only if retained as compatibility context) for workflow, product, or scope constraints that affect the brainstorm. Also read `STRATEGY.md` if it exists — the product's target problem, approach, persona, and active tracks are direct input to what this brainstorm should deliver and should shape scope, success criteria, and which approaches are aligned vs out-of-scope. If these add nothing, move on. +*Constraint Check* — Check project instruction files (`AGENTS.md`, and `CLAUDE.md` only if retained as compatibility context) for workflow, product, or scope constraints that affect the brainstorm. Also read `STRATEGY.md` if it exists — the product's target problem, approach, persona, and active tracks are direct input to what this brainstorm should deliver and should shape scope, success criteria, and which approaches are aligned vs out-of-scope. Also read `CONCEPTS.md` at repo root if it exists — the project's shared domain vocabulary anchors what terms mean here, and the dialogue should use those canonical names rather than synonyms. If any of these add nothing, move on. *Topic Scan* — Search for relevant terms. Read the most relevant existing artifact if one exists (brainstorm, plan, spec, skill, feature doc). Skim adjacent examples covering similar behavior. @@ -184,6 +184,18 @@ Follow the Interaction Rules above. Use the platform's blocking question tool wh **Exit condition:** Continue until the idea is clear AND no integration-check questions are pending, OR the user explicitly wants to proceed. +#### 1.4 Vocabulary Capture (only if CONCEPTS.md already exists) + +**Skip this sub-phase entirely if `CONCEPTS.md` does not exist at repo root.** ce-brainstorm is a contributor to existing vocabulary, not a creator of the file — creation is owned by ce-compound and ce-compound-refresh, which also handle the AGENTS.md discoverability surfacing in the same run. + +If `CONCEPTS.md` exists, scan the dialogue for **resolved** domain terms — terms where the conversation actively pinned down a precise local meaning, not terms merely mentioned in passing. For each resolved term: if missing, add it; if present but the dialogue surfaced new precision, refine it; if already consistent, no action. + +**Resolved means the dialogue is no longer questioning the definition.** Provisional terms that the conversation may still revise stay in the conversation only. + +Follow the format and quality bar set by the existing entries in `CONCEPTS.md` — heading levels, length norms, definition style. The file teaches its own conventions by example. + +Apply edits silently. Vocabulary capture is a side effect of converging on shared language, not a decision the user makes per session. + ### Phase 2: Explore Approaches If multiple plausible directions remain, propose **2-3 concrete approaches** based on research and conversation. Otherwise state the recommended direction directly. diff --git a/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md b/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md index cdb5933a9..8204ee9aa 100644 --- a/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md +++ b/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md @@ -165,6 +165,7 @@ A learning has several dimensions that can independently go stale. Surface-level - **Related docs** — are cross-referenced learnings and patterns still present and consistent? - **Auto memory** (Claude Code only) — does the injected auto-memory block in your system prompt contain entries in the same problem domain? Scan that block directly. If the block is absent, skip this dimension. A memory note describing a different approach than what the learning recommends is a supplementary drift signal. - **Overlap** — while investigating, note when another doc in scope covers the same problem domain, references the same files, or recommends a similar solution. For each overlap, record: the two file paths, which dimensions overlap (problem, solution, root cause, files, prevention), and which doc appears broader or more current. These signals feed Phase 1.75 (Document-Set Analysis). +- **Vocabulary** — note domain terms the learning cites (entities, named processes, status concepts with project-specific meaning). For each term: does it appear in `CONCEPTS.md`? If yes, does the definition still match how the code uses the term? If no, flag the term for Phase 4.5 to add or bootstrap. Do not edit `CONCEPTS.md` during investigation — just collect the signal centrally. Match investigation depth to the learning's specificity — a learning referencing exact file paths and code snippets needs more verification than one describing a general principle. @@ -486,6 +487,24 @@ For each candidate, execute the flow that matches its classification from Phase Only one flow runs per candidate; the reference contains the per-action criteria, examples, and step-by-step instructions. +## Phase 4.5: Vocabulary Capture + +After the per-learning actions execute, aggregate the domain terms flagged across Phase 1's Vocabulary dimension and reconcile them with `CONCEPTS.md`. + +**Procedure:** + +1. **Aggregate.** Collect qualifying terms surfaced across the learnings in scope. If the same term surfaced in multiple learnings with different shades of precision, **union the shades into one entry** — not three entries, not most-recent-wins. +2. **If `CONCEPTS.md` exists**, add missing terms and refine existing entries when the corpus surfaced new precision. Do not duplicate entries already present. +3. **If `CONCEPTS.md` does not exist** and at least one qualifying term was surfaced, **bootstrap it**. One term is enough — do not gate creation behind a minimum count, that creates an asymmetric trap where the file only ever gets created on the second eligible run. +4. **Scope discipline and citation hygiene.** Bootstrap reflects only the area in scope — do not expand to other categories, and do not retroactively inject `(see CONCEPTS.md)` pointers into existing learnings. The report should note that additional entries are likely from refresh runs on other scopes. +5. **Initial structure.** When bootstrapping, let term count drive shape: 1-4 terms → flat headings, more → cluster by domain relationship per the rules in `references/concepts-vocabulary.md`. + +Note: if this run **creates** `CONCEPTS.md` from scratch, the Discoverability Check below must also surface it in `AGENTS.md`/`CLAUDE.md` so future agents discover it. Subsequent runs skip this because the instruction file is already current. + +**Apply edits silently — no user prompt in any mode.** Vocabulary capture is a side effect of refreshing, not a decision the user makes per run. + +Read `references/concepts-vocabulary.md` for the inclusion criteria, what never belongs, per-entry quality bar, organization principle, and an illustrative example. Do not infer the format from memory — read the reference each time vocabulary capture fires. + ## Output Format **The full report MUST be printed as markdown output.** Do not summarize findings internally and then output a one-liner. The report is the deliverable — print every section in full, formatted as readable markdown with headers, tables, and bullet points. @@ -504,6 +523,8 @@ Replaced: Z Deleted: W Skipped: V Marked stale: S + +CONCEPTS.md: ``` Then for EVERY file processed, list: @@ -631,4 +652,12 @@ After the refresh report is generated, check whether the project's instruction f ``` c. In interactive mode, explain to the user why this matters — agents working in this repo (including fresh sessions, other tools, or collaborators without the plugin) won't know to check `docs/solutions/` unless the instruction file surfaces it. Show the proposed change and where it would go, then use the platform's blocking question tool to get consent before making the edit: `AskUserQuestion` in Claude Code (call `ToolSearch` with `select:AskUserQuestion` first if its schema isn't loaded), `request_user_input` in Codex, `ask_user` in Gemini, `ask_user` in Pi (requires the `pi-ask-user` extension). Fall back to presenting the proposal in chat only when no blocking tool exists in the harness or the call errors (e.g., Codex edit modes) — not because a schema load is required. Never silently skip the question. In headless mode, include it as a "Discoverability recommendation" line in the report — do not attempt to edit instruction files (headless scope is doc maintenance, not project config). -5. **Amend or create a follow-up commit when the check produces edits.** If step 4 resulted in an edit to an instruction file and Phase 5 already committed the refresh changes, stage the newly edited file and either amend the existing commit (if still on the same branch and no push has occurred) or create a small follow-up commit (e.g., `docs: add docs/solutions/ discoverability to AGENTS.md`). If Phase 5 already pushed the branch to a remote (e.g., the branch+PR path), push the follow-up commit as well so the open PR includes the discoverability change. This keeps the working tree clean and the remote in sync at the end of the run. If the user chose "Don't commit" in Phase 5, leave the instruction-file edit unstaged alongside the other uncommitted refresh changes — no separate commit logic needed. +5. **If `CONCEPTS.md` exists at repo root, run a parallel discoverability check for it.** Use the same workflow as the `docs/solutions/` check above: same target file, same edit-placement judgment, same consent-then-edit interaction shape per mode. Example calibration when a directory listing is present: + + ``` + CONCEPTS.md # shared domain vocabulary — read when orienting to the codebase or before discussing domain concepts + ``` + + **Skip this step entirely if `CONCEPTS.md` does not exist** — never nag for an artifact the project has not adopted. When skipped, this step produces no output and no edit. + +6. **Amend or create a follow-up commit when the check produces edits.** If step 4 resulted in an edit to an instruction file and Phase 5 already committed the refresh changes, stage the newly edited file and either amend the existing commit (if still on the same branch and no push has occurred) or create a small follow-up commit (e.g., `docs: add docs/solutions/ discoverability to AGENTS.md`). If Phase 5 already pushed the branch to a remote (e.g., the branch+PR path), push the follow-up commit as well so the open PR includes the discoverability change. This keeps the working tree clean and the remote in sync at the end of the run. If the user chose "Don't commit" in Phase 5, leave the instruction-file edit unstaged alongside the other uncommitted refresh changes — no separate commit logic needed. diff --git a/plugins/compound-engineering/skills/ce-compound-refresh/references/concepts-vocabulary.md b/plugins/compound-engineering/skills/ce-compound-refresh/references/concepts-vocabulary.md new file mode 100644 index 000000000..e986c1ad2 --- /dev/null +++ b/plugins/compound-engineering/skills/ce-compound-refresh/references/concepts-vocabulary.md @@ -0,0 +1,43 @@ +# CONCEPTS.md vocabulary rules + +`CONCEPTS.md` defines the words that mean something specific in this codebase — substrate that `docs/solutions/` and AGENTS.md can cite without redefinition. Lives at the repo root, created lazily the first time a learning surfaces a qualifying term. + +## What earns a slot + +A term qualifies when its meaning here is precise enough that a new engineer would need it defined to follow conversations, tickets, or code. General programming vocabulary — caches, queues, jobs, sessions — does not belong, even when used heavily. + +## What never appears + +Implementation specifics — file paths, class names, table names, function signatures. Status fields, dates, owners on the entries. Examples drawn from current code. Anything git, the codebase, or the learnings store would tell you. + +## Per entry + +Lead with identity, not behavior — what kind of thing it is, what makes it distinct, what it stands in relation to. Length tracks complexity: most entries are one sentence; a term with non-obvious rules earns a second paragraph for them. Entities typically need more depth than value types; status concepts may need transition notes. When the team uses several words for the same concept, choose one and retire the rest. + +## Organization + +Cluster concepts by domain relationship — entities with their states, processes with their stages — so a reader sees structure without effort. A flat list works when the file is small. Reshape as the file grows. + +## Resolved ambiguities (optional, tail of file) + +When two terms were used interchangeably and the team settled on a distinction, record the resolution as a one-line note: *"'account' had been used for both Customer and User — these are distinct."* + +## One illustrative entry — the shape, not a template + +``` +## Booking + +### Reservation +A future commitment to seat a Party at a specified date and time. A Reservation owns its Party but does not own a Table — Tables are acquired only when the Party arrives, through a Seating. Lifecycle: Booked, Seated, Completed, No-Show. + +Cancellation before a Seating is non-destructive. Cancellation after a Seating is recorded as a No-Show. + +### Party +The guests committed to a Reservation. Each Reservation has exactly one Party. Party size is the count promised at booking, not the count who arrive. + +### Table +A physical seating unit with fixed capacity. Tables are shared resources — they do not belong to Reservations and are allocated only on the day-of through Seatings. + +### Seating +The act of placing a Party at a Table once the Party arrives. A Reservation has at most one Seating; a Table accumulates many Seatings across its lifetime. +``` diff --git a/plugins/compound-engineering/skills/ce-compound/SKILL.md b/plugins/compound-engineering/skills/ce-compound/SKILL.md index 4b1a93601..e2c0c33ea 100644 --- a/plugins/compound-engineering/skills/ce-compound/SKILL.md +++ b/plugins/compound-engineering/skills/ce-compound/SKILL.md @@ -46,6 +46,7 @@ These files are the durable contract for the workflow. Read them on-demand at th - `references/schema.yaml` — canonical frontmatter fields and enum values (read when validating YAML) - `references/yaml-schema.md` — category mapping from problem_type to directory (read when classifying) +- `references/concepts-vocabulary.md` — CONCEPTS.md format and inclusion rules (read in Phase 2.4 when domain terms surface) - `assets/resolution-template.md` — section structure for new docs (read when assembling) When spawning subagents, pass the relevant file contents into the task prompt so they have the contract without needing cross-skill paths. @@ -246,6 +247,14 @@ When creating a new doc, preserve the section order from `assets/resolution-temp +### Phase 2.4: Vocabulary Capture + +After writing the learning, scan the new doc and the surrounding conversation for **domain terms** — words used with project-specific precise meaning (entities, named processes, status concepts). If `CONCEPTS.md` exists at repo root, add missing qualifying terms and refine existing entries when new precision surfaced. If it does not exist and the learning surfaced at least one qualifying term, create it lazily. If no terms qualify, skip this phase entirely. + +**Apply edits silently in every mode — no user prompt in interactive, lightweight, or headless.** Vocabulary capture is a side effect of compounding, not a decision the user makes per run. + +Read `references/concepts-vocabulary.md` for the inclusion criteria, what never belongs, per-entry quality bar, organization principle, and an illustrative example. Do not infer the format from memory — read the reference each time vocabulary capture fires, so the rules and example stay anchored. + ### Phase 2.5: Selective Refresh Check After writing the new learning, decide whether this new solution is evidence that older docs should be refreshed. @@ -329,6 +338,14 @@ After the learning is written and the refresh decision is made, check whether th ``` c. In full interactive mode, explain to the user why this matters — agents working in this repo (including fresh sessions, other tools, or collaborators without the plugin) won't know to check `docs/solutions/` unless the instruction file surfaces it. Show the proposed change and where it would go, then use the platform's blocking question tool to get consent before making the edit: `AskUserQuestion` in Claude Code (call `ToolSearch` with `select:AskUserQuestion` first if its schema isn't loaded), `request_user_input` in Codex, `ask_user` in Gemini, `ask_user` in Pi (requires the `pi-ask-user` extension). Fall back to presenting the proposal in chat only when no blocking tool exists in the harness or the call errors (e.g., Codex edit modes) — not because a schema load is required. Never silently skip the question. In lightweight mode, output a one-liner note and move on. In headless mode, apply the edit directly without prompting and surface it in the terminal report under "Instruction-file edit" +5. **If `CONCEPTS.md` exists at repo root, run a parallel discoverability check for it.** Assess whether the instruction file would lead an agent to discover the project's shared domain vocabulary. Use the same workflow as the `docs/solutions/` check above: same target file, same edit-placement judgment, same consent-then-edit interaction shape per mode. A line in an existing section is almost always better than a new headed section. Example calibration when nothing else fits: + + ``` + CONCEPTS.md # shared domain vocabulary (entities, named processes, status concepts) — relevant when orienting to the codebase or discussing domain concepts + ``` + + **Skip this step entirely if `CONCEPTS.md` does not exist** — never nag for an artifact the project has not adopted. When skipped, this step produces no output and no edit. + ### Phase 3: Optional Enhancement **WAIT for Phase 2 to complete before proceeding.** @@ -469,6 +486,7 @@ Track: Category: Overlap: | high — existing doc updated> Instruction-file edit: | gap noted, not applied> +CONCEPTS.md: Refresh recommendation: Documentation complete @@ -503,8 +521,9 @@ Specialized Agent Reviews (Auto-Triggered): ✓ ce-kieran-rails-reviewer: Code examples meet Rails conventions ✓ ce-code-simplicity-reviewer: Solution is appropriately minimal -File created: -- docs/solutions/performance-issues/n-plus-one-brief-generation.md +Files written: +- docs/solutions/performance-issues/n-plus-one-brief-generation.md (created) +- CONCEPTS.md (created with 3 entries: BriefSystem, EmailQueue, Brief Status) This documentation will be searchable for future reference when similar issues occur in the Email Processing or Brief System modules. diff --git a/plugins/compound-engineering/skills/ce-compound/references/concepts-vocabulary.md b/plugins/compound-engineering/skills/ce-compound/references/concepts-vocabulary.md new file mode 100644 index 000000000..e986c1ad2 --- /dev/null +++ b/plugins/compound-engineering/skills/ce-compound/references/concepts-vocabulary.md @@ -0,0 +1,43 @@ +# CONCEPTS.md vocabulary rules + +`CONCEPTS.md` defines the words that mean something specific in this codebase — substrate that `docs/solutions/` and AGENTS.md can cite without redefinition. Lives at the repo root, created lazily the first time a learning surfaces a qualifying term. + +## What earns a slot + +A term qualifies when its meaning here is precise enough that a new engineer would need it defined to follow conversations, tickets, or code. General programming vocabulary — caches, queues, jobs, sessions — does not belong, even when used heavily. + +## What never appears + +Implementation specifics — file paths, class names, table names, function signatures. Status fields, dates, owners on the entries. Examples drawn from current code. Anything git, the codebase, or the learnings store would tell you. + +## Per entry + +Lead with identity, not behavior — what kind of thing it is, what makes it distinct, what it stands in relation to. Length tracks complexity: most entries are one sentence; a term with non-obvious rules earns a second paragraph for them. Entities typically need more depth than value types; status concepts may need transition notes. When the team uses several words for the same concept, choose one and retire the rest. + +## Organization + +Cluster concepts by domain relationship — entities with their states, processes with their stages — so a reader sees structure without effort. A flat list works when the file is small. Reshape as the file grows. + +## Resolved ambiguities (optional, tail of file) + +When two terms were used interchangeably and the team settled on a distinction, record the resolution as a one-line note: *"'account' had been used for both Customer and User — these are distinct."* + +## One illustrative entry — the shape, not a template + +``` +## Booking + +### Reservation +A future commitment to seat a Party at a specified date and time. A Reservation owns its Party but does not own a Table — Tables are acquired only when the Party arrives, through a Seating. Lifecycle: Booked, Seated, Completed, No-Show. + +Cancellation before a Seating is non-destructive. Cancellation after a Seating is recorded as a No-Show. + +### Party +The guests committed to a Reservation. Each Reservation has exactly one Party. Party size is the count promised at booking, not the count who arrive. + +### Table +A physical seating unit with fixed capacity. Tables are shared resources — they do not belong to Reservations and are allocated only on the day-of through Seatings. + +### Seating +The act of placing a Party at a Table once the Party arrives. A Reservation has at most one Seating; a Table accumulates many Seatings across its lifetime. +``` diff --git a/plugins/compound-engineering/skills/ce-plan/SKILL.md b/plugins/compound-engineering/skills/ce-plan/SKILL.md index 2632384ce..0ee59fb0c 100644 --- a/plugins/compound-engineering/skills/ce-plan/SKILL.md +++ b/plugins/compound-engineering/skills/ce-plan/SKILL.md @@ -234,6 +234,7 @@ Prepare a concise planning context summary (a paragraph or two) to pass as input - If an origin document exists, summarize the problem frame, requirements, and key decisions from that document - Otherwise use the feature description directly - If `STRATEGY.md` exists, read it and include the relevant pieces (target problem, approach, active tracks) in the summary so downstream research and planning decisions are anchored to product strategy +- If `CONCEPTS.md` exists at repo root, read it directly — its definitions are the canonical names for domain entities, named processes, and status concepts. Plan with those terms rather than synonyms. (ce-learnings-researcher also reads it as grounding, but a direct read at plan-start ensures the plan body uses the exact vocabulary.) Run these agents in parallel: @@ -639,6 +640,8 @@ Plan written to **Pipeline mode:** If invoked from an automated workflow such as LFG or any `disable-model-invocation` context, skip interactive questions. Make the needed choices automatically and proceed to writing the plan. +**CONCEPTS.md gap-fill (only if the file already exists):** If `CONCEPTS.md` exists at repo root and the plan body uses a domain term whose definition is missing from `CONCEPTS.md`, add the entry. Follow the format and quality bar set by the existing entries — the file teaches its own conventions by example. Skip this step entirely if `CONCEPTS.md` does not exist; ce-plan is a contributor to existing vocabulary, not a creator of the file. Creation is owned by ce-compound and ce-compound-refresh. Apply silently — no user prompt. + #### 5.3 Confidence Check and Deepening After writing the plan file, automatically evaluate whether the plan needs strengthening. From 10ddbe51ef42b900d538ae2dee84124da22e34d2 Mon Sep 17 00:00:00 2001 From: Trevin Chow Date: Sat, 16 May 2026 01:13:34 -0700 Subject: [PATCH 02/11] fix(ce-compound-refresh): include step-5 edits in follow-up commit flow Step 6's amend/follow-up commit logic only mentioned step 4 (docs/solutions discoverability edit). When step 4 produces no edit but step 5 (the new CONCEPTS.md discoverability path) does, the new instruction-file change would be left out of the commit sequence and end up as a dirty worktree or an omitted edit. Cover both edit paths in step 6. --- .../compound-engineering/skills/ce-compound-refresh/SKILL.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md b/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md index 8204ee9aa..4bfd4f58c 100644 --- a/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md +++ b/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md @@ -660,4 +660,4 @@ After the refresh report is generated, check whether the project's instruction f **Skip this step entirely if `CONCEPTS.md` does not exist** — never nag for an artifact the project has not adopted. When skipped, this step produces no output and no edit. -6. **Amend or create a follow-up commit when the check produces edits.** If step 4 resulted in an edit to an instruction file and Phase 5 already committed the refresh changes, stage the newly edited file and either amend the existing commit (if still on the same branch and no push has occurred) or create a small follow-up commit (e.g., `docs: add docs/solutions/ discoverability to AGENTS.md`). If Phase 5 already pushed the branch to a remote (e.g., the branch+PR path), push the follow-up commit as well so the open PR includes the discoverability change. This keeps the working tree clean and the remote in sync at the end of the run. If the user chose "Don't commit" in Phase 5, leave the instruction-file edit unstaged alongside the other uncommitted refresh changes — no separate commit logic needed. +6. **Amend or create a follow-up commit when the check produces edits.** If step 4 or step 5 resulted in an edit to an instruction file and Phase 5 already committed the refresh changes, stage the newly edited file and either amend the existing commit (if still on the same branch and no push has occurred) or create a small follow-up commit (e.g., `docs: add docs/solutions/ discoverability to AGENTS.md`, or `docs: add CONCEPTS.md discoverability to AGENTS.md`, or a combined message when both edits landed). If Phase 5 already pushed the branch to a remote (e.g., the branch+PR path), push the follow-up commit as well so the open PR includes the discoverability change. This keeps the working tree clean and the remote in sync at the end of the run. If the user chose "Don't commit" in Phase 5, leave the instruction-file edits unstaged alongside the other uncommitted refresh changes — no separate commit logic needed. From eff84b790c8353c0fd85da604aedb2dccb7c914f Mon Sep 17 00:00:00 2001 From: Trevin Chow Date: Sat, 16 May 2026 19:16:47 -0700 Subject: [PATCH 03/11] fix(ce-compound): close Phase 2 hand-off and vocab-capture loopholes External test surfaced two structural failures in ce-compound that an LLM orchestrator can hit even when following the skill text: 1. ce-sessions return read as a terminus. Phase 1's parallel block ended on three subagents, then ce-sessions ran synchronously as the final input. Phase 2 said "WAIT for all Phase 1 subagents" -- which an LLM could read as not including the skill call. The agent emitted ce-sessions's output to the user and stopped. Fix: add a forward-edge sentence at the end of step 4 ("ce-sessions is the final Phase 1 input, not a workflow stop"), and broaden the Phase 2 WAIT line to "all Phase 1 inputs" with an explicit note that ce-sessions counts despite being a skill rather than a subagent. 2. Phase 2.4's "skip entirely if no terms qualify" let agents vibe-judge "nothing qualifies" from the inline criteria teaser and skip reading references/concepts-vocabulary.md entirely -- the opposite of the stated intent. Fix: invert the phase so "First, read the reference" is the unconditional opener, drop the inline criteria teaser (per the no-duplication-with-references principle), and replace the silent- skip path with a visible "Vocabulary capture: scanned, no qualifying terms" outcome the agent must record. Propagated the Phase 2.4 fix to ce-compound-refresh's Phase 4.5 -- same structural risk, same shared reference, both phases introduced on this branch. Tightened both success-output templates from the ambiguous "skipped (no qualifying terms)" to the unambiguous "scanned, no qualifying terms" so the audit signal cannot be confused with "didn't bother to check". --- .../skills/ce-compound-refresh/SKILL.md | 10 ++++++---- .../skills/ce-compound/SKILL.md | 13 ++++++++----- 2 files changed, 14 insertions(+), 9 deletions(-) diff --git a/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md b/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md index 4bfd4f58c..5339832de 100644 --- a/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md +++ b/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md @@ -491,20 +491,22 @@ Only one flow runs per candidate; the reference contains the per-action criteria After the per-learning actions execute, aggregate the domain terms flagged across Phase 1's Vocabulary dimension and reconcile them with `CONCEPTS.md`. +**First, read `references/concepts-vocabulary.md`.** This is unconditional. Do not pre-judge from memory which Phase 1 signals qualify — the reference's criteria are non-obvious and a "nothing qualifies" judgment without reading is a shortcut, not a result. + **Procedure:** -1. **Aggregate.** Collect qualifying terms surfaced across the learnings in scope. If the same term surfaced in multiple learnings with different shades of precision, **union the shades into one entry** — not three entries, not most-recent-wins. +1. **Aggregate.** Collect qualifying terms surfaced across the learnings in scope, applying the reference's criteria. If the same term surfaced in multiple learnings with different shades of precision, **union the shades into one entry** — not three entries, not most-recent-wins. 2. **If `CONCEPTS.md` exists**, add missing terms and refine existing entries when the corpus surfaced new precision. Do not duplicate entries already present. 3. **If `CONCEPTS.md` does not exist** and at least one qualifying term was surfaced, **bootstrap it**. One term is enough — do not gate creation behind a minimum count, that creates an asymmetric trap where the file only ever gets created on the second eligible run. 4. **Scope discipline and citation hygiene.** Bootstrap reflects only the area in scope — do not expand to other categories, and do not retroactively inject `(see CONCEPTS.md)` pointers into existing learnings. The report should note that additional entries are likely from refresh runs on other scopes. 5. **Initial structure.** When bootstrapping, let term count drive shape: 1-4 terms → flat headings, more → cluster by domain relationship per the rules in `references/concepts-vocabulary.md`. +If no Phase 1 signals qualified after applying the reference's criteria, record that outcome explicitly in the report's `CONCEPTS.md` line (e.g., "scanned, no qualifying terms"). Do not silently skip — the visible scan-and-no-result record is the audit signal that the reference was consulted. + Note: if this run **creates** `CONCEPTS.md` from scratch, the Discoverability Check below must also surface it in `AGENTS.md`/`CLAUDE.md` so future agents discover it. Subsequent runs skip this because the instruction file is already current. **Apply edits silently — no user prompt in any mode.** Vocabulary capture is a side effect of refreshing, not a decision the user makes per run. -Read `references/concepts-vocabulary.md` for the inclusion criteria, what never belongs, per-entry quality bar, organization principle, and an illustrative example. Do not infer the format from memory — read the reference each time vocabulary capture fires. - ## Output Format **The full report MUST be printed as markdown output.** Do not summarize findings internally and then output a one-liner. The report is the deliverable — print every section in full, formatted as readable markdown with headers, tables, and bullet points. @@ -524,7 +526,7 @@ Deleted: W Skipped: V Marked stale: S -CONCEPTS.md: +CONCEPTS.md: ``` Then for EVERY file processed, list: diff --git a/plugins/compound-engineering/skills/ce-compound/SKILL.md b/plugins/compound-engineering/skills/ce-compound/SKILL.md index e2c0c33ea..fedfdf3a7 100644 --- a/plugins/compound-engineering/skills/ce-compound/SKILL.md +++ b/plugins/compound-engineering/skills/ce-compound/SKILL.md @@ -210,12 +210,13 @@ Launch research subagents. Each returns text data to the orchestrator. Do not append additional context blocks, exclusion lists, or topic-keyword bullets — verbose payloads give ce-sessions license to keep widening the search and rapidly compound wall time. If keyword search is needed, ce-sessions owns that decision internally based on the topic. - Returns: structured digest of findings from prior sessions, or "no relevant prior sessions" if none found. + - **ce-sessions is the final Phase 1 input, not a workflow stop.** When it returns, proceed directly to Phase 2 with its output as the last input — do not emit a summary and do not pause for the user. A "no relevant prior sessions" return is still a valid input; the documentation gets written without session context. ### Phase 2: Assembly & Write -**WAIT for all Phase 1 subagents to complete before proceeding.** +**WAIT for all Phase 1 inputs to complete before proceeding** — the three parallel subagents and, when the user opted in, the synchronous `ce-sessions` skill call. ce-sessions is a Phase 1 input even though it is a skill rather than a subagent. The orchestrating agent (main conversation) performs these steps: @@ -249,11 +250,13 @@ When creating a new doc, preserve the section order from `assets/resolution-temp ### Phase 2.4: Vocabulary Capture -After writing the learning, scan the new doc and the surrounding conversation for **domain terms** — words used with project-specific precise meaning (entities, named processes, status concepts). If `CONCEPTS.md` exists at repo root, add missing qualifying terms and refine existing entries when new precision surfaced. If it does not exist and the learning surfaced at least one qualifying term, create it lazily. If no terms qualify, skip this phase entirely. +**First, read `references/concepts-vocabulary.md`.** This is unconditional. Do not pre-judge from memory that nothing qualifies — the reference's criteria are non-obvious and qualifying terms often live in the surrounding conversation rather than the new doc itself. Reading the reference is what makes the rest of the phase possible. -**Apply edits silently in every mode — no user prompt in interactive, lightweight, or headless.** Vocabulary capture is a side effect of compounding, not a decision the user makes per run. +Then, applying those criteria, scan the new doc **and** the surrounding conversation for qualifying domain terms. If `CONCEPTS.md` exists at repo root, add missing qualifying terms and refine existing entries when new precision surfaced. If it does not exist and at least one qualifying term surfaced, create it lazily. + +If no terms qualified after applying the reference's criteria, record that outcome explicitly in the success output (e.g., "Vocabulary capture: scanned, no qualifying terms"). Do not silently skip — the visible scan-and-no-result record is the audit signal that the reference was consulted. -Read `references/concepts-vocabulary.md` for the inclusion criteria, what never belongs, per-entry quality bar, organization principle, and an illustrative example. Do not infer the format from memory — read the reference each time vocabulary capture fires, so the rules and example stay anchored. +**Apply edits silently in every mode — no user prompt in interactive, lightweight, or headless.** Vocabulary capture is a side effect of compounding, not a decision the user makes per run. ### Phase 2.5: Selective Refresh Check @@ -486,7 +489,7 @@ Track: Category: Overlap: | high — existing doc updated> Instruction-file edit: | gap noted, not applied> -CONCEPTS.md: +CONCEPTS.md: Refresh recommendation: Documentation complete From f87d4c17586d4c34c76880f367f27f95b87b913f Mon Sep 17 00:00:00 2001 From: Trevin Chow Date: Sat, 16 May 2026 22:50:00 -0700 Subject: [PATCH 04/11] fix(concepts): add glossary-only boundary to contributor skills MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ce-brainstorm Phase 1.4 and ce-plan §5 gap-fill are contributors to CONCEPTS.md but neither loads concepts-vocabulary.md, so the criteria preventing implementation details from creeping in lived only where the contributors couldn't see them. Add an inline negative-framing line to both ("domain entities, named processes, and status concepts with project-specific meaning only — not file paths, class names, or implementation decisions"). Also drop rationale tails that did not change agent behavior at runtime. --- .../compound-engineering/skills/ce-brainstorm/SKILL.md | 10 +++++----- plugins/compound-engineering/skills/ce-plan/SKILL.md | 4 ++-- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/plugins/compound-engineering/skills/ce-brainstorm/SKILL.md b/plugins/compound-engineering/skills/ce-brainstorm/SKILL.md index f6c546a4a..74b322bbe 100644 --- a/plugins/compound-engineering/skills/ce-brainstorm/SKILL.md +++ b/plugins/compound-engineering/skills/ce-brainstorm/SKILL.md @@ -186,15 +186,15 @@ Follow the Interaction Rules above. Use the platform's blocking question tool wh #### 1.4 Vocabulary Capture (only if CONCEPTS.md already exists) -**Skip this sub-phase entirely if `CONCEPTS.md` does not exist at repo root.** ce-brainstorm is a contributor to existing vocabulary, not a creator of the file — creation is owned by ce-compound and ce-compound-refresh, which also handle the AGENTS.md discoverability surfacing in the same run. +**Skip this sub-phase entirely if `CONCEPTS.md` does not exist at repo root** — creation is owned by ce-compound and ce-compound-refresh. -If `CONCEPTS.md` exists, scan the dialogue for **resolved** domain terms — terms where the conversation actively pinned down a precise local meaning, not terms merely mentioned in passing. For each resolved term: if missing, add it; if present but the dialogue surfaced new precision, refine it; if already consistent, no action. +If it exists, scan the dialogue for **resolved** domain terms — terms where the conversation actively pinned down a precise local meaning, not terms merely mentioned in passing. **Resolved means the dialogue is no longer questioning the definition.** Provisional terms that may still revise stay in the conversation only. -**Resolved means the dialogue is no longer questioning the definition.** Provisional terms that the conversation may still revise stay in the conversation only. +For each resolved term: if missing, add it; if present but the dialogue surfaced new precision, refine it; if already consistent, no action. -Follow the format and quality bar set by the existing entries in `CONCEPTS.md` — heading levels, length norms, definition style. The file teaches its own conventions by example. +**Domain entities, named processes, and status concepts with project-specific meaning only.** Not file paths, class names, function signatures, or implementation decisions — `CONCEPTS.md` is a glossary, not a spec or scratch pad. -Apply edits silently. Vocabulary capture is a side effect of converging on shared language, not a decision the user makes per session. +Follow the format set by existing entries. Apply edits silently. ### Phase 2: Explore Approaches diff --git a/plugins/compound-engineering/skills/ce-plan/SKILL.md b/plugins/compound-engineering/skills/ce-plan/SKILL.md index 0ee59fb0c..008afccf2 100644 --- a/plugins/compound-engineering/skills/ce-plan/SKILL.md +++ b/plugins/compound-engineering/skills/ce-plan/SKILL.md @@ -234,7 +234,7 @@ Prepare a concise planning context summary (a paragraph or two) to pass as input - If an origin document exists, summarize the problem frame, requirements, and key decisions from that document - Otherwise use the feature description directly - If `STRATEGY.md` exists, read it and include the relevant pieces (target problem, approach, active tracks) in the summary so downstream research and planning decisions are anchored to product strategy -- If `CONCEPTS.md` exists at repo root, read it directly — its definitions are the canonical names for domain entities, named processes, and status concepts. Plan with those terms rather than synonyms. (ce-learnings-researcher also reads it as grounding, but a direct read at plan-start ensures the plan body uses the exact vocabulary.) +- If `CONCEPTS.md` exists at repo root, read it — its definitions are the canonical names for domain entities, named processes, and status concepts. Plan with those terms rather than synonyms. Run these agents in parallel: @@ -640,7 +640,7 @@ Plan written to **Pipeline mode:** If invoked from an automated workflow such as LFG or any `disable-model-invocation` context, skip interactive questions. Make the needed choices automatically and proceed to writing the plan. -**CONCEPTS.md gap-fill (only if the file already exists):** If `CONCEPTS.md` exists at repo root and the plan body uses a domain term whose definition is missing from `CONCEPTS.md`, add the entry. Follow the format and quality bar set by the existing entries — the file teaches its own conventions by example. Skip this step entirely if `CONCEPTS.md` does not exist; ce-plan is a contributor to existing vocabulary, not a creator of the file. Creation is owned by ce-compound and ce-compound-refresh. Apply silently — no user prompt. +**CONCEPTS.md gap-fill (only if the file already exists):** If the plan body uses a domain term whose definition is missing from `CONCEPTS.md`, add the entry. **Domain entities, named processes, and status concepts with project-specific meaning only** — not file paths, class names, function signatures, or implementation decisions. `CONCEPTS.md` is a glossary, not a spec or scratch pad. Follow the format set by existing entries. Apply silently. Skip entirely if `CONCEPTS.md` does not exist — creation is owned by ce-compound and ce-compound-refresh. #### 5.3 Confidence Check and Deepening From fba38cee030f400d98f31018d9c951f43e92e3d1 Mon Sep 17 00:00:00 2001 From: Trevin Chow Date: Sat, 16 May 2026 22:50:07 -0700 Subject: [PATCH 05/11] feat(ce-compound): intercept CONCEPTS.md bootstrap requests MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Users may type "create my CONCEPTS.md" without an existing learning corpus, particularly in cold repos. Previously this had no clean routing path — ce-compound's description didn't match the request, so the main agent ad-hoc'd a response. Update ce-compound's description to declare CONCEPTS.md as a stated responsibility, and add a short intercept block near the top of the skill body. The block redirects without performing a bootstrap: explains the accretion model, notes that cold-start codebase scans are intentionally unsupported (the qualifying bar is judgmental), and offers three real next steps — run ce-compound on a real learning, ce-compound-refresh on an existing corpus, or hand-edit directly. --- plugins/compound-engineering/skills/ce-compound/SKILL.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/plugins/compound-engineering/skills/ce-compound/SKILL.md b/plugins/compound-engineering/skills/ce-compound/SKILL.md index fedfdf3a7..75fc31595 100644 --- a/plugins/compound-engineering/skills/ce-compound/SKILL.md +++ b/plugins/compound-engineering/skills/ce-compound/SKILL.md @@ -1,6 +1,6 @@ --- name: ce-compound -description: Document a recently solved problem to compound your team's knowledge +description: Document a recently solved problem to compound your team's knowledge or CONCEPTS.md, the project's shared domain vocabulary. argument-hint: "[optional: brief context] [mode:headless] " --- @@ -23,6 +23,10 @@ Captures problem solutions while context is fresh, creating structured documenta /ce-compound mode:headless [context] # Non-interactive run with context hint ``` +## CONCEPTS.md bootstrap requests + +If invoked to create or bootstrap `CONCEPTS.md` rather than document a solved problem, do not run normal phases. Explain: the file accretes as ce-compound and ce-compound-refresh process real learnings; cold-start codebase scans aren't supported because the qualifying bar is judgmental. Redirect to ce-compound on a real learning, ce-compound-refresh on an existing corpus, or direct hand-editing. Then exit. + ## Mode Detection Check `$ARGUMENTS` for a `mode:headless` token. Tokens starting with `mode:` are flags, not context — strip `mode:headless` from arguments before treating the remainder as the brief context hint. From 43521396199ff7315fa99f1611b569c986bfe3f7 Mon Sep 17 00:00:00 2001 From: Trevin Chow Date: Sat, 16 May 2026 23:55:41 -0700 Subject: [PATCH 06/11] fix(concepts): self-correct violations during compound and refresh MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ce-compound Phase 2.4 and ce-compound-refresh Phase 4.5 establish the glossary-only rule for CONCEPTS.md but only apply it prospectively to new entries. Existing drift (file paths, class names, function signatures, status/owner metadata) survived every run. Add active correction at two scopes matched to each skill's character. ce-compound fixes opportunistically — only entries being touched or adjacent to them — because compound is not an audit. ce-compound-refresh runs a full sweep as Phase 4.5 step 6 because refresh is an audit. Extend the refresh report's CONCEPTS.md line to surface the scrubbed count alongside added and refined. --- .../compound-engineering/skills/ce-compound-refresh/SKILL.md | 3 ++- plugins/compound-engineering/skills/ce-compound/SKILL.md | 2 ++ 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md b/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md index 5339832de..3adeac28d 100644 --- a/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md +++ b/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md @@ -500,6 +500,7 @@ After the per-learning actions execute, aggregate the domain terms flagged acros 3. **If `CONCEPTS.md` does not exist** and at least one qualifying term was surfaced, **bootstrap it**. One term is enough — do not gate creation behind a minimum count, that creates an asymmetric trap where the file only ever gets created on the second eligible run. 4. **Scope discipline and citation hygiene.** Bootstrap reflects only the area in scope — do not expand to other categories, and do not retroactively inject `(see CONCEPTS.md)` pointers into existing learnings. The report should note that additional entries are likely from refresh runs on other scopes. 5. **Initial structure.** When bootstrapping, let term count drive shape: 1-4 terms → flat headings, more → cluster by domain relationship per the rules in `references/concepts-vocabulary.md`. +6. **Scrub violations.** Scan existing entries for content that violates `references/concepts-vocabulary.md` criteria — implementation specifics (file paths, class names, function signatures, code references), status/owner/date metadata, or duplicates of terms covered under a different name. Rewrite or consolidate. The full sweep is appropriate here because refresh is an audit; ce-compound's same-named phase scopes corrections to entries already being touched. If no Phase 1 signals qualified after applying the reference's criteria, record that outcome explicitly in the report's `CONCEPTS.md` line (e.g., "scanned, no qualifying terms"). Do not silently skip — the visible scan-and-no-result record is the audit signal that the reference was consulted. @@ -526,7 +527,7 @@ Deleted: W Skipped: V Marked stale: S -CONCEPTS.md: +CONCEPTS.md: ``` Then for EVERY file processed, list: diff --git a/plugins/compound-engineering/skills/ce-compound/SKILL.md b/plugins/compound-engineering/skills/ce-compound/SKILL.md index 75fc31595..3aa94f7c7 100644 --- a/plugins/compound-engineering/skills/ce-compound/SKILL.md +++ b/plugins/compound-engineering/skills/ce-compound/SKILL.md @@ -258,6 +258,8 @@ When creating a new doc, preserve the section order from `assets/resolution-temp Then, applying those criteria, scan the new doc **and** the surrounding conversation for qualifying domain terms. If `CONCEPTS.md` exists at repo root, add missing qualifying terms and refine existing entries when new precision surfaced. If it does not exist and at least one qualifying term surfaced, create it lazily. +**Opportunistically fix violations you notice while editing.** If an entry being added/refined or an adjacent existing entry contains implementation specifics (file paths, class names, function signatures, code references), rewrite to the glossary standard. Do not full-audit the file — confine corrections to entries near the ones already being touched. Broader audit is ce-compound-refresh's job. + If no terms qualified after applying the reference's criteria, record that outcome explicitly in the success output (e.g., "Vocabulary capture: scanned, no qualifying terms"). Do not silently skip — the visible scan-and-no-result record is the audit signal that the reference was consulted. **Apply edits silently in every mode — no user prompt in interactive, lightweight, or headless.** Vocabulary capture is a side effect of compounding, not a decision the user makes per run. From 8a6e461b8810b42346a17043bd93647fb73fc38e Mon Sep 17 00:00:00 2001 From: Trevin Chow Date: Sun, 17 May 2026 00:55:22 -0700 Subject: [PATCH 07/11] fix(concepts): write visible preamble when bootstrapping CONCEPTS.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When ce-compound or ce-compound-refresh first creates CONCEPTS.md, write a short preamble at the top explaining what the file is, how it accretes, and what it isn't (glossary only, not a spec or scratchpad). Visible prose under the # Concepts heading so both humans browsing the rendered file and agents reading the raw file see the same framing — an HTML comment would have hidden the model from human readers on GitHub for no real gain. --- .../skills/ce-compound-refresh/SKILL.md | 6 +++++- plugins/compound-engineering/skills/ce-compound/SKILL.md | 4 ++++ 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md b/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md index 3adeac28d..30d8ea1cb 100644 --- a/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md +++ b/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md @@ -499,7 +499,11 @@ After the per-learning actions execute, aggregate the domain terms flagged acros 2. **If `CONCEPTS.md` exists**, add missing terms and refine existing entries when the corpus surfaced new precision. Do not duplicate entries already present. 3. **If `CONCEPTS.md` does not exist** and at least one qualifying term was surfaced, **bootstrap it**. One term is enough — do not gate creation behind a minimum count, that creates an asymmetric trap where the file only ever gets created on the second eligible run. 4. **Scope discipline and citation hygiene.** Bootstrap reflects only the area in scope — do not expand to other categories, and do not retroactively inject `(see CONCEPTS.md)` pointers into existing learnings. The report should note that additional entries are likely from refresh runs on other scopes. -5. **Initial structure.** When bootstrapping, let term count drive shape: 1-4 terms → flat headings, more → cluster by domain relationship per the rules in `references/concepts-vocabulary.md`. +5. **Initial structure.** When bootstrapping, start the file with this preamble under the `# Concepts` heading: + + > Shared domain vocabulary for this project — entities, named processes, and status concepts with project-specific meaning. Accretes as ce-compound and ce-compound-refresh process learnings; direct edits are fine. Glossary only, not a spec or scratchpad. + + Then add entries. Let term count drive shape: 1-4 terms → flat headings, more → cluster by domain relationship per the rules in `references/concepts-vocabulary.md`. 6. **Scrub violations.** Scan existing entries for content that violates `references/concepts-vocabulary.md` criteria — implementation specifics (file paths, class names, function signatures, code references), status/owner/date metadata, or duplicates of terms covered under a different name. Rewrite or consolidate. The full sweep is appropriate here because refresh is an audit; ce-compound's same-named phase scopes corrections to entries already being touched. If no Phase 1 signals qualified after applying the reference's criteria, record that outcome explicitly in the report's `CONCEPTS.md` line (e.g., "scanned, no qualifying terms"). Do not silently skip — the visible scan-and-no-result record is the audit signal that the reference was consulted. diff --git a/plugins/compound-engineering/skills/ce-compound/SKILL.md b/plugins/compound-engineering/skills/ce-compound/SKILL.md index 3aa94f7c7..742c5b345 100644 --- a/plugins/compound-engineering/skills/ce-compound/SKILL.md +++ b/plugins/compound-engineering/skills/ce-compound/SKILL.md @@ -258,6 +258,10 @@ When creating a new doc, preserve the section order from `assets/resolution-temp Then, applying those criteria, scan the new doc **and** the surrounding conversation for qualifying domain terms. If `CONCEPTS.md` exists at repo root, add missing qualifying terms and refine existing entries when new precision surfaced. If it does not exist and at least one qualifying term surfaced, create it lazily. +**When bootstrapping the file, start with this preamble under the `# Concepts` heading**, then add the qualifying entries below it: + +> Shared domain vocabulary for this project — entities, named processes, and status concepts with project-specific meaning. Accretes as ce-compound and ce-compound-refresh process learnings; direct edits are fine. Glossary only, not a spec or scratchpad. + **Opportunistically fix violations you notice while editing.** If an entry being added/refined or an adjacent existing entry contains implementation specifics (file paths, class names, function signatures, code references), rewrite to the glossary standard. Do not full-audit the file — confine corrections to entries near the ones already being touched. Broader audit is ce-compound-refresh's job. If no terms qualified after applying the reference's criteria, record that outcome explicitly in the success output (e.g., "Vocabulary capture: scanned, no qualifying terms"). Do not silently skip — the visible scan-and-no-result record is the audit signal that the reference was consulted. From dd00dd4609d627f15a6889b4b1fc028812df7c5f Mon Sep 17 00:00:00 2001 From: Trevin Chow Date: Sun, 17 May 2026 13:36:50 -0700 Subject: [PATCH 08/11] fix(concepts): tighten qualifying bar at CONCEPTS.md creation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The "at least one qualifying term" gate in ce-compound Phase 2.4 and ce-compound-refresh Phase 4.5 step 3 could allow a permissive agent to seed CONCEPTS.md from a routine bug fix that only surfaced class or table names dressed up as entities. The criteria in concepts-vocabulary.md are correct but judgmental, and lenience at the creation moment seeds a thin file the team didn't actually need. Add an explicit "hold the qualifying bar conservatively at creation" rule to both skills. Borderline terms defer to a later run with stronger signal. The conservatism is quality, not count — the asymmetric-trap defense against minimum-count gating is preserved. Updates to an existing file continue to follow normal criteria. --- .../compound-engineering/skills/ce-compound-refresh/SKILL.md | 2 +- plugins/compound-engineering/skills/ce-compound/SKILL.md | 2 ++ 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md b/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md index 30d8ea1cb..28f84f97f 100644 --- a/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md +++ b/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md @@ -497,7 +497,7 @@ After the per-learning actions execute, aggregate the domain terms flagged acros 1. **Aggregate.** Collect qualifying terms surfaced across the learnings in scope, applying the reference's criteria. If the same term surfaced in multiple learnings with different shades of precision, **union the shades into one entry** — not three entries, not most-recent-wins. 2. **If `CONCEPTS.md` exists**, add missing terms and refine existing entries when the corpus surfaced new precision. Do not duplicate entries already present. -3. **If `CONCEPTS.md` does not exist** and at least one qualifying term was surfaced, **bootstrap it**. One term is enough — do not gate creation behind a minimum count, that creates an asymmetric trap where the file only ever gets created on the second eligible run. +3. **If `CONCEPTS.md` does not exist** and at least one qualifying term was surfaced, **bootstrap it**. One term is enough — do not gate creation behind a minimum count, that creates an asymmetric trap where the file only ever gets created on the second eligible run. **At creation, hold the qualifying bar conservatively** — a borderline term or a class/table/file name dressed up as an entity should defer until a later run surfaces stronger signal. The conservatism is about quality, not count; updates to an existing file follow normal criteria. 4. **Scope discipline and citation hygiene.** Bootstrap reflects only the area in scope — do not expand to other categories, and do not retroactively inject `(see CONCEPTS.md)` pointers into existing learnings. The report should note that additional entries are likely from refresh runs on other scopes. 5. **Initial structure.** When bootstrapping, start the file with this preamble under the `# Concepts` heading: diff --git a/plugins/compound-engineering/skills/ce-compound/SKILL.md b/plugins/compound-engineering/skills/ce-compound/SKILL.md index 742c5b345..accea0781 100644 --- a/plugins/compound-engineering/skills/ce-compound/SKILL.md +++ b/plugins/compound-engineering/skills/ce-compound/SKILL.md @@ -258,6 +258,8 @@ When creating a new doc, preserve the section order from `assets/resolution-temp Then, applying those criteria, scan the new doc **and** the surrounding conversation for qualifying domain terms. If `CONCEPTS.md` exists at repo root, add missing qualifying terms and refine existing entries when new precision surfaced. If it does not exist and at least one qualifying term surfaced, create it lazily. +**At creation only, hold the qualifying bar conservatively.** A borderline term, or a class/table/file name dressed up as an entity, does not justify seeding a new file — defer until a later run surfaces stronger signal. This conservatism applies to creation quality only; updates to an existing file follow the normal criteria. + **When bootstrapping the file, start with this preamble under the `# Concepts` heading**, then add the qualifying entries below it: > Shared domain vocabulary for this project — entities, named processes, and status concepts with project-specific meaning. Accretes as ce-compound and ce-compound-refresh process learnings; direct edits are fine. Glossary only, not a spec or scratchpad. From a200dea56926f9f09fe2ebe61514b38a2664ff41 Mon Sep 17 00:00:00 2001 From: Trevin Chow Date: Sun, 17 May 2026 17:04:44 -0700 Subject: [PATCH 09/11] fix(concepts): sharpen CONCEPTS.md framing and capture from sessions MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit After comparing against grill-with-docs (third-party skill for a similar artifact), sharpen how CONCEPTS.md is framed across the plugin and close a terminology-capture gap. In references/concepts-vocabulary.md (both copies): - Lead with "Be opinionated" as the file's stance. - Replace the enumerated "What never appears" list with the principle "The file stands on its own" — one mental test that subsumes the existing exclusions and extends to cases we hadn't enumerated. - Add aliases-per-entry format (*Avoid: X, Y*) so retired synonyms ride alongside their canonical term. - Tighten "Per entry" to one-sentence base definition; explicit second-paragraph allowance for non-obvious behavioral rules only. - Add optional Relationships section when structure is load-bearing. - Rename "Resolved ambiguities" to "Flagged ambiguities." In ce-brainstorm Phase 1.1: reframe CONCEPTS.md as the project's authoritative vocabulary (was: shared domain vocabulary that anchors terms here). Carries authority across the whole session without needing to restate "use canonical names" at every downstream phase. In ce-compound Phase 2.4: extend the vocabulary scan to include ce-sessions findings when Full mode runs. Session findings carry terminology resolution context from prior brainstorm, plan, and work dialogues; without this, that context was being pulled in for research but ignored at capture time. Also replace "scratchpad" with "catch-all" across four locations — clearer naming of the failure mode (dumping ground for things that don't fit elsewhere). --- .../skills/ce-brainstorm/SKILL.md | 4 +- .../skills/ce-compound-refresh/SKILL.md | 2 +- .../references/concepts-vocabulary.md | 37 ++++++++++++++----- .../skills/ce-compound/SKILL.md | 4 +- .../references/concepts-vocabulary.md | 37 ++++++++++++++----- .../skills/ce-plan/SKILL.md | 2 +- 6 files changed, 62 insertions(+), 24 deletions(-) diff --git a/plugins/compound-engineering/skills/ce-brainstorm/SKILL.md b/plugins/compound-engineering/skills/ce-brainstorm/SKILL.md index 74b322bbe..4e2589d82 100644 --- a/plugins/compound-engineering/skills/ce-brainstorm/SKILL.md +++ b/plugins/compound-engineering/skills/ce-brainstorm/SKILL.md @@ -111,7 +111,7 @@ Scan the repo before substantive brainstorming. Match depth to scope: **Standard and Deep** — Two passes: -*Constraint Check* — Check project instruction files (`AGENTS.md`, and `CLAUDE.md` only if retained as compatibility context) for workflow, product, or scope constraints that affect the brainstorm. Also read `STRATEGY.md` if it exists — the product's target problem, approach, persona, and active tracks are direct input to what this brainstorm should deliver and should shape scope, success criteria, and which approaches are aligned vs out-of-scope. Also read `CONCEPTS.md` at repo root if it exists — the project's shared domain vocabulary anchors what terms mean here, and the dialogue should use those canonical names rather than synonyms. If any of these add nothing, move on. +*Constraint Check* — Check project instruction files (`AGENTS.md`, and `CLAUDE.md` only if retained as compatibility context) for workflow, product, or scope constraints that affect the brainstorm. Also read `STRATEGY.md` if it exists — the product's target problem, approach, persona, and active tracks are direct input to what this brainstorm should deliver and should shape scope, success criteria, and which approaches are aligned vs out-of-scope. Also read `CONCEPTS.md` at repo root if it exists — the project's authoritative vocabulary. Use these names in dialogue, approaches, and the requirements doc; map user-offered synonyms back. If any of these add nothing, move on. *Topic Scan* — Search for relevant terms. Read the most relevant existing artifact if one exists (brainstorm, plan, spec, skill, feature doc). Skim adjacent examples covering similar behavior. @@ -192,7 +192,7 @@ If it exists, scan the dialogue for **resolved** domain terms — terms where th For each resolved term: if missing, add it; if present but the dialogue surfaced new precision, refine it; if already consistent, no action. -**Domain entities, named processes, and status concepts with project-specific meaning only.** Not file paths, class names, function signatures, or implementation decisions — `CONCEPTS.md` is a glossary, not a spec or scratch pad. +**Domain entities, named processes, and status concepts with project-specific meaning only.** Not file paths, class names, function signatures, or implementation decisions — `CONCEPTS.md` is a glossary, not a spec or catch-all. Follow the format set by existing entries. Apply edits silently. diff --git a/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md b/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md index 28f84f97f..79d2ebaa5 100644 --- a/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md +++ b/plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md @@ -501,7 +501,7 @@ After the per-learning actions execute, aggregate the domain terms flagged acros 4. **Scope discipline and citation hygiene.** Bootstrap reflects only the area in scope — do not expand to other categories, and do not retroactively inject `(see CONCEPTS.md)` pointers into existing learnings. The report should note that additional entries are likely from refresh runs on other scopes. 5. **Initial structure.** When bootstrapping, start the file with this preamble under the `# Concepts` heading: - > Shared domain vocabulary for this project — entities, named processes, and status concepts with project-specific meaning. Accretes as ce-compound and ce-compound-refresh process learnings; direct edits are fine. Glossary only, not a spec or scratchpad. + > Shared domain vocabulary for this project — entities, named processes, and status concepts with project-specific meaning. Accretes as ce-compound and ce-compound-refresh process learnings; direct edits are fine. Glossary only, not a spec or catch-all. Then add entries. Let term count drive shape: 1-4 terms → flat headings, more → cluster by domain relationship per the rules in `references/concepts-vocabulary.md`. 6. **Scrub violations.** Scan existing entries for content that violates `references/concepts-vocabulary.md` criteria — implementation specifics (file paths, class names, function signatures, code references), status/owner/date metadata, or duplicates of terms covered under a different name. Rewrite or consolidate. The full sweep is appropriate here because refresh is an audit; ce-compound's same-named phase scopes corrections to entries already being touched. diff --git a/plugins/compound-engineering/skills/ce-compound-refresh/references/concepts-vocabulary.md b/plugins/compound-engineering/skills/ce-compound-refresh/references/concepts-vocabulary.md index e986c1ad2..27c1f5594 100644 --- a/plugins/compound-engineering/skills/ce-compound-refresh/references/concepts-vocabulary.md +++ b/plugins/compound-engineering/skills/ce-compound-refresh/references/concepts-vocabulary.md @@ -2,25 +2,43 @@ `CONCEPTS.md` defines the words that mean something specific in this codebase — substrate that `docs/solutions/` and AGENTS.md can cite without redefinition. Lives at the repo root, created lazily the first time a learning surfaces a qualifying term. -## What earns a slot +## Be opinionated + +When the team uses several words for the same concept, pick the best one and retire the rest. Record retired synonyms as aliases on the entry (see "Per entry"). Settled distinctions go to the Flagged ambiguities tail. The glossary is not a record of all words the team has ever used — it is the team's agreed-upon vocabulary. + +## The file stands on its own -A term qualifies when its meaning here is precise enough that a new engineer would need it defined to follow conversations, tickets, or code. General programming vocabulary — caches, queues, jobs, sessions — does not belong, even when used heavily. +Each entry teaches its concept to a reader with no access to anything else — no codebase, no PR history, no architecture meetings, no Slack. This rules out: -## What never appears +- Implementation specifics (file paths, class names, function signatures, table names, library calls) +- Status fields, dates, owners on the entries +- Examples drawn from current code +- Links to PRs, issues, channels, or roadmap milestones +- Version-specific claims ("currently uses X; migrating to Y") -Implementation specifics — file paths, class names, table names, function signatures. Status fields, dates, owners on the entries. Examples drawn from current code. Anything git, the codebase, or the learnings store would tell you. +Cross-references between entries within `CONCEPTS.md` are fine — they resolve internally. General programming vocabulary (caches, queues, jobs, sessions) and everyday domain English need no redefinition either. + +## What earns a slot + +A term qualifies when its meaning here is precise enough that a new engineer would need it defined to follow conversations, tickets, or code. General programming vocabulary does not belong, even when used heavily. ## Per entry -Lead with identity, not behavior — what kind of thing it is, what makes it distinct, what it stands in relation to. Length tracks complexity: most entries are one sentence; a term with non-obvious rules earns a second paragraph for them. Entities typically need more depth than value types; status concepts may need transition notes. When the team uses several words for the same concept, choose one and retire the rest. +Definition is one sentence — what the term means in this domain, what makes it distinct from neighbors. A term with non-obvious behavioral rules (lifecycle, cancellation semantics, ownership invariants) earns a second paragraph for those rules — never for elaborating the definition itself. + +When retired synonyms exist, list them as an aliases line directly under the definition: *Avoid: Booking, appointment*. Entities typically need more depth than value types; status concepts may need transition notes. + +## Relationships (optional) + +When relationships between entries carry load-bearing meaning (ownership, cardinality, lifecycle dependencies that span entries), capture them in a `## Relationships` section near the top of the file or its cluster. Skip when entries stand on their own without structural context — relationships are a lift for domains where structure is part of what makes terms meaningful, not a routine section. ## Organization Cluster concepts by domain relationship — entities with their states, processes with their stages — so a reader sees structure without effort. A flat list works when the file is small. Reshape as the file grows. -## Resolved ambiguities (optional, tail of file) +## Flagged ambiguities (tail of file) -When two terms were used interchangeably and the team settled on a distinction, record the resolution as a one-line note: *"'account' had been used for both Customer and User — these are distinct."* +When two terms were used interchangeably and the team settled on a distinction, record the resolution as a one-line note: *"'account' had been used for both Customer and User — these are distinct."* This section is the audit trail for opinions the team has formed. ## One illustrative entry — the shape, not a template @@ -28,9 +46,10 @@ When two terms were used interchangeably and the team settled on a distinction, ## Booking ### Reservation -A future commitment to seat a Party at a specified date and time. A Reservation owns its Party but does not own a Table — Tables are acquired only when the Party arrives, through a Seating. Lifecycle: Booked, Seated, Completed, No-Show. +A future commitment to seat a Party at a specified date and time. +*Avoid:* Booking, appointment -Cancellation before a Seating is non-destructive. Cancellation after a Seating is recorded as a No-Show. +A Reservation owns its Party but does not own a Table — Tables are acquired only when the Party arrives, through a Seating. Lifecycle: Booked, Seated, Completed, No-Show. Cancellation before a Seating is non-destructive; cancellation after a Seating is recorded as a No-Show. ### Party The guests committed to a Reservation. Each Reservation has exactly one Party. Party size is the count promised at booking, not the count who arrive. diff --git a/plugins/compound-engineering/skills/ce-compound/SKILL.md b/plugins/compound-engineering/skills/ce-compound/SKILL.md index accea0781..366059e1e 100644 --- a/plugins/compound-engineering/skills/ce-compound/SKILL.md +++ b/plugins/compound-engineering/skills/ce-compound/SKILL.md @@ -256,13 +256,13 @@ When creating a new doc, preserve the section order from `assets/resolution-temp **First, read `references/concepts-vocabulary.md`.** This is unconditional. Do not pre-judge from memory that nothing qualifies — the reference's criteria are non-obvious and qualifying terms often live in the surrounding conversation rather than the new doc itself. Reading the reference is what makes the rest of the phase possible. -Then, applying those criteria, scan the new doc **and** the surrounding conversation for qualifying domain terms. If `CONCEPTS.md` exists at repo root, add missing qualifying terms and refine existing entries when new precision surfaced. If it does not exist and at least one qualifying term surfaced, create it lazily. +Then, applying those criteria, scan three inputs for qualifying domain terms: the new doc, the surrounding conversation, and — in Full mode — the `ce-sessions` findings returned in Phase 1. Session findings carry terminology context from past brainstorm, plan, and work dialogues where terms were actively being resolved; include them in the scan so that resolution context isn't lost when the file is finally written. If `CONCEPTS.md` exists at repo root, add missing qualifying terms and refine existing entries when new precision surfaced. If it does not exist and at least one qualifying term surfaced, create it lazily. **At creation only, hold the qualifying bar conservatively.** A borderline term, or a class/table/file name dressed up as an entity, does not justify seeding a new file — defer until a later run surfaces stronger signal. This conservatism applies to creation quality only; updates to an existing file follow the normal criteria. **When bootstrapping the file, start with this preamble under the `# Concepts` heading**, then add the qualifying entries below it: -> Shared domain vocabulary for this project — entities, named processes, and status concepts with project-specific meaning. Accretes as ce-compound and ce-compound-refresh process learnings; direct edits are fine. Glossary only, not a spec or scratchpad. +> Shared domain vocabulary for this project — entities, named processes, and status concepts with project-specific meaning. Accretes as ce-compound and ce-compound-refresh process learnings; direct edits are fine. Glossary only, not a spec or catch-all. **Opportunistically fix violations you notice while editing.** If an entry being added/refined or an adjacent existing entry contains implementation specifics (file paths, class names, function signatures, code references), rewrite to the glossary standard. Do not full-audit the file — confine corrections to entries near the ones already being touched. Broader audit is ce-compound-refresh's job. diff --git a/plugins/compound-engineering/skills/ce-compound/references/concepts-vocabulary.md b/plugins/compound-engineering/skills/ce-compound/references/concepts-vocabulary.md index e986c1ad2..27c1f5594 100644 --- a/plugins/compound-engineering/skills/ce-compound/references/concepts-vocabulary.md +++ b/plugins/compound-engineering/skills/ce-compound/references/concepts-vocabulary.md @@ -2,25 +2,43 @@ `CONCEPTS.md` defines the words that mean something specific in this codebase — substrate that `docs/solutions/` and AGENTS.md can cite without redefinition. Lives at the repo root, created lazily the first time a learning surfaces a qualifying term. -## What earns a slot +## Be opinionated + +When the team uses several words for the same concept, pick the best one and retire the rest. Record retired synonyms as aliases on the entry (see "Per entry"). Settled distinctions go to the Flagged ambiguities tail. The glossary is not a record of all words the team has ever used — it is the team's agreed-upon vocabulary. + +## The file stands on its own -A term qualifies when its meaning here is precise enough that a new engineer would need it defined to follow conversations, tickets, or code. General programming vocabulary — caches, queues, jobs, sessions — does not belong, even when used heavily. +Each entry teaches its concept to a reader with no access to anything else — no codebase, no PR history, no architecture meetings, no Slack. This rules out: -## What never appears +- Implementation specifics (file paths, class names, function signatures, table names, library calls) +- Status fields, dates, owners on the entries +- Examples drawn from current code +- Links to PRs, issues, channels, or roadmap milestones +- Version-specific claims ("currently uses X; migrating to Y") -Implementation specifics — file paths, class names, table names, function signatures. Status fields, dates, owners on the entries. Examples drawn from current code. Anything git, the codebase, or the learnings store would tell you. +Cross-references between entries within `CONCEPTS.md` are fine — they resolve internally. General programming vocabulary (caches, queues, jobs, sessions) and everyday domain English need no redefinition either. + +## What earns a slot + +A term qualifies when its meaning here is precise enough that a new engineer would need it defined to follow conversations, tickets, or code. General programming vocabulary does not belong, even when used heavily. ## Per entry -Lead with identity, not behavior — what kind of thing it is, what makes it distinct, what it stands in relation to. Length tracks complexity: most entries are one sentence; a term with non-obvious rules earns a second paragraph for them. Entities typically need more depth than value types; status concepts may need transition notes. When the team uses several words for the same concept, choose one and retire the rest. +Definition is one sentence — what the term means in this domain, what makes it distinct from neighbors. A term with non-obvious behavioral rules (lifecycle, cancellation semantics, ownership invariants) earns a second paragraph for those rules — never for elaborating the definition itself. + +When retired synonyms exist, list them as an aliases line directly under the definition: *Avoid: Booking, appointment*. Entities typically need more depth than value types; status concepts may need transition notes. + +## Relationships (optional) + +When relationships between entries carry load-bearing meaning (ownership, cardinality, lifecycle dependencies that span entries), capture them in a `## Relationships` section near the top of the file or its cluster. Skip when entries stand on their own without structural context — relationships are a lift for domains where structure is part of what makes terms meaningful, not a routine section. ## Organization Cluster concepts by domain relationship — entities with their states, processes with their stages — so a reader sees structure without effort. A flat list works when the file is small. Reshape as the file grows. -## Resolved ambiguities (optional, tail of file) +## Flagged ambiguities (tail of file) -When two terms were used interchangeably and the team settled on a distinction, record the resolution as a one-line note: *"'account' had been used for both Customer and User — these are distinct."* +When two terms were used interchangeably and the team settled on a distinction, record the resolution as a one-line note: *"'account' had been used for both Customer and User — these are distinct."* This section is the audit trail for opinions the team has formed. ## One illustrative entry — the shape, not a template @@ -28,9 +46,10 @@ When two terms were used interchangeably and the team settled on a distinction, ## Booking ### Reservation -A future commitment to seat a Party at a specified date and time. A Reservation owns its Party but does not own a Table — Tables are acquired only when the Party arrives, through a Seating. Lifecycle: Booked, Seated, Completed, No-Show. +A future commitment to seat a Party at a specified date and time. +*Avoid:* Booking, appointment -Cancellation before a Seating is non-destructive. Cancellation after a Seating is recorded as a No-Show. +A Reservation owns its Party but does not own a Table — Tables are acquired only when the Party arrives, through a Seating. Lifecycle: Booked, Seated, Completed, No-Show. Cancellation before a Seating is non-destructive; cancellation after a Seating is recorded as a No-Show. ### Party The guests committed to a Reservation. Each Reservation has exactly one Party. Party size is the count promised at booking, not the count who arrive. diff --git a/plugins/compound-engineering/skills/ce-plan/SKILL.md b/plugins/compound-engineering/skills/ce-plan/SKILL.md index 008afccf2..c49619142 100644 --- a/plugins/compound-engineering/skills/ce-plan/SKILL.md +++ b/plugins/compound-engineering/skills/ce-plan/SKILL.md @@ -640,7 +640,7 @@ Plan written to **Pipeline mode:** If invoked from an automated workflow such as LFG or any `disable-model-invocation` context, skip interactive questions. Make the needed choices automatically and proceed to writing the plan. -**CONCEPTS.md gap-fill (only if the file already exists):** If the plan body uses a domain term whose definition is missing from `CONCEPTS.md`, add the entry. **Domain entities, named processes, and status concepts with project-specific meaning only** — not file paths, class names, function signatures, or implementation decisions. `CONCEPTS.md` is a glossary, not a spec or scratch pad. Follow the format set by existing entries. Apply silently. Skip entirely if `CONCEPTS.md` does not exist — creation is owned by ce-compound and ce-compound-refresh. +**CONCEPTS.md gap-fill (only if the file already exists):** If the plan body uses a domain term whose definition is missing from `CONCEPTS.md`, add the entry. **Domain entities, named processes, and status concepts with project-specific meaning only** — not file paths, class names, function signatures, or implementation decisions. `CONCEPTS.md` is a glossary, not a spec or catch-all. Follow the format set by existing entries. Apply silently. Skip entirely if `CONCEPTS.md` does not exist — creation is owned by ce-compound and ce-compound-refresh. #### 5.3 Confidence Check and Deepening From b5ee4ca5c5a9f6280a83ac65162ae65e71c2062d Mon Sep 17 00:00:00 2001 From: Trevin Chow Date: Sun, 17 May 2026 20:14:48 -0700 Subject: [PATCH 10/11] fix(concepts): drop ce-sessions as scan input for Phase 2.4 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Earlier in this branch, Phase 2.4's vocabulary scan was extended to include ce-sessions findings as a third input. Architectural review surfaced two problems with that wiring: - ce-compound's payload to ce-sessions includes a "directly relevant to this specific problem; ignore unrelated work" filter rule, which actively suppresses the tangential context where vocabulary often lives. The filter is correct for fix-context retrieval but wrong for vocabulary capture — the two needs pull in opposite directions. - Wiring named external sources into Phase 2.4 creates maintenance debt: every new research input (future Slack research, Linear context, etc.) requires updating the scan input list. Revert to scanning only the new doc and the surrounding conversation. Both are always available to the orchestrating agent — no plumbing, no filter-rule mismatch. Conversation catches mid-dialogue vocabulary resolutions that didn't make the doc; the doc captures terms the writer judged worth recording. Terms that emerged only in non-conversation sources (research subagents, ce-sessions) flow into Phase 2.4 indirectly via the doc-writer's synthesis, which is the right level of curation. If external-source vocabulary mining ever becomes a real need, design it as a dedicated dispatch with a vocabulary-tuned payload, not as a Phase 2.4 scan input. --- plugins/compound-engineering/skills/ce-compound/SKILL.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/plugins/compound-engineering/skills/ce-compound/SKILL.md b/plugins/compound-engineering/skills/ce-compound/SKILL.md index 366059e1e..12d230957 100644 --- a/plugins/compound-engineering/skills/ce-compound/SKILL.md +++ b/plugins/compound-engineering/skills/ce-compound/SKILL.md @@ -256,7 +256,7 @@ When creating a new doc, preserve the section order from `assets/resolution-temp **First, read `references/concepts-vocabulary.md`.** This is unconditional. Do not pre-judge from memory that nothing qualifies — the reference's criteria are non-obvious and qualifying terms often live in the surrounding conversation rather than the new doc itself. Reading the reference is what makes the rest of the phase possible. -Then, applying those criteria, scan three inputs for qualifying domain terms: the new doc, the surrounding conversation, and — in Full mode — the `ce-sessions` findings returned in Phase 1. Session findings carry terminology context from past brainstorm, plan, and work dialogues where terms were actively being resolved; include them in the scan so that resolution context isn't lost when the file is finally written. If `CONCEPTS.md` exists at repo root, add missing qualifying terms and refine existing entries when new precision surfaced. If it does not exist and at least one qualifying term surfaced, create it lazily. +Then, applying those criteria, scan the new doc **and** the surrounding conversation for qualifying domain terms. If `CONCEPTS.md` exists at repo root, add missing qualifying terms and refine existing entries when new precision surfaced. If it does not exist and at least one qualifying term surfaced, create it lazily. **At creation only, hold the qualifying bar conservatively.** A borderline term, or a class/table/file name dressed up as an entity, does not justify seeding a new file — defer until a later run surfaces stronger signal. This conservatism applies to creation quality only; updates to an existing file follow the normal criteria. From 9afd41088f093f35e4818992922693be21cb6647 Mon Sep 17 00:00:00 2001 From: Trevin Chow Date: Sun, 17 May 2026 20:15:00 -0700 Subject: [PATCH 11/11] test(ce-sessions): add terminology-preservation eval suite MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds an eval suite that tests whether ce-sessions findings preserve terminology resolution context — specifically, whether distinctive coined terms and their resolution rationale survive the session-historian synthesis step intact. Four test cases with ground truth from recently merged PRs: - synthesis-gate-recovery (PR #822) — distinctive term recovery - mode-headless-semantic-alignment (PR #813) — multi-piece nuance - tangential-term-recovery — indexing-gap test - near-miss-false-positive — discriminating-power test Two-stage grader: programmatic substring match per criticality tier, plus LLM-graded context preservation. Variance protocol: 3 runs per eval. This suite was built during PR #838's design exploration to validate a load-bearing assumption (that ce-sessions findings could feed ce-compound Phase 2.4's vocabulary scan). That assumption was ultimately retired in favor of doc-and-conversation-only scanning, so the suite is not load-bearing for PR #838. Kept as future infrastructure for validating ce-sessions's behavior as the skill evolves — e.g., when changing the session-historian synthesis prompt or adjusting scan-window defaults. Iteration-1 results (executed via skill-creator framework, captured to /tmp/compound-engineering/ce-sessions/evals/iteration-1/) showed ce-sessions preserved terminology strongly across all 4 evals with 100% must-tier recall and 0% stddev — but this is a capability test of the skill in isolation, not a test of any specific integration. --- .../skills/ce-sessions/evals/README.md | 72 +++++++++++ .../skills/ce-sessions/evals/evals.json | 108 +++++++++++++++++ .../skills/ce-sessions/evals/grader.md | 114 ++++++++++++++++++ 3 files changed, 294 insertions(+) create mode 100644 plugins/compound-engineering/skills/ce-sessions/evals/README.md create mode 100644 plugins/compound-engineering/skills/ce-sessions/evals/evals.json create mode 100644 plugins/compound-engineering/skills/ce-sessions/evals/grader.md diff --git a/plugins/compound-engineering/skills/ce-sessions/evals/README.md b/plugins/compound-engineering/skills/ce-sessions/evals/README.md new file mode 100644 index 000000000..8482d43b9 --- /dev/null +++ b/plugins/compound-engineering/skills/ce-sessions/evals/README.md @@ -0,0 +1,72 @@ +# ce-sessions terminology-preservation eval suite + +## Purpose + +Validate a load-bearing assumption introduced by PR #838 (`feat(concepts): introduce CONCEPTS.md as shared vocabulary substrate`): that ce-sessions findings preserve enough terminology resolution context for ce-compound Phase 2.4's vocabulary capture to extract qualifying domain terms. + +If ce-sessions returns only high-level "here's what was discussed" summaries that drop the specific coined terms and resolution context, then wiring its output into ce-compound's vocabulary-capture scan is decorative. If it returns terms with the rationale around them, the wiring works as advertised. + +This suite is narrowly scoped to the terminology-preservation question. It does not evaluate ce-sessions's general search quality, response shape, or any other property. + +## Files + +| File | Purpose | +|------|---------| +| `evals.json` | Test case definitions with prompts, expected terminology by criticality tier, expected context items, and ground-truth pointers (PR numbers + merge commits) | +| `grader.md` | Grading rubric — two-stage (programmatic substring + LLM context-preservation), per-run + aggregate metrics, risk attribution | +| `README.md` | This file | + +## Test cases at a glance + +| # | Name | Risk tested | Ground truth | +|---|------|-------------|--------------| +| 1 | synthesis-gate-recovery | Synthesis loss (distinctive term) | PR #822 (merged 2026-05-15) | +| 2 | mode-headless-semantic-alignment | Synthesis loss (multi-piece nuance) | PR #813 (merged 2026-05-10) | +| 3 | tangential-term-recovery | Indexing gap | PRs #822, #819, #829 | +| 4 | near-miss-false-positive | False positive on shared keyword | Anti-PR: #813 | + +## Design rationale + +**Why these four cases.** Each isolates a distinct failure mode of the load-bearing assumption: + +- **Eval 1** uses a single, distinctive coined term ("synthesis gate") so a failure is unambiguous evidence of synthesis loss. If ce-sessions cannot return this term verbatim when queried about its own work, the assumption is broken. +- **Eval 2** tests a multi-piece design decision (rename + cross-skill alignment + a principle refinement). A pass here demonstrates ce-sessions preserves nuance, not only flashy coined nouns. +- **Eval 3** is the indexing-gap test. The query mentions "ce-plan workflow improvements" without naming any of the synthesis-gate terminology. Phase 2.4's real-world use is broad-topic queries hoping to surface terminology — if eval 3 fails while eval 1 passes, ce-sessions only retrieves terms when queried by them, which means ce-compound's wiring is decorative for the actual use case. +- **Eval 4** is the discriminating-power test. If ce-sessions surfaces the ce-compound mode:headless feature work as relevant to a CI/CD server-deployment query, false-positive findings would feed wrong vocabulary into Phase 2.4. + +**Why two-stage grading.** Programmatic substring matching (Stage 1) cheaply catches the worst case: distinctive terms dropped entirely. LLM-graded context preservation (Stage 2) catches the subtler case where the term survives but the rationale around it is summarized away — which would let Phase 2.4 see the term but be unable to write a useful CONCEPTS.md entry because the context for *why* it qualifies is gone. + +**Why variance across runs.** ce-sessions involves an LLM synthesis step (the session-historian subagent). Single-run pass/fail is a misleading signal because the same prompt may produce different findings on different invocations. The 3-runs-per-eval protocol catches the case where the assumption holds on average but fails frequently enough in practice to be unreliable. + +## How to run (framework-driven) + +This suite is run via the `skill-creator` framework, not manually. The framework spawns subagents in parallel to invoke ce-sessions, captures findings to a workspace, grades them, aggregates, and opens a viewer. + +**Workspace location:** `/tmp/compound-engineering/ce-sessions/evals/iteration-/` (per repo AGENTS.md scratch conventions — `/tmp` for cross-invocation reusable scratch, accessible for grep/inspection). + +**One subagent dispatch per eval × per run.** Each dispatched subagent receives the eval prompt, invokes `/ce-sessions `, captures the findings text verbatim, and writes to `/iteration-/eval--/run-/findings.txt`. + +With the default `runs_per_eval: 3` and 4 evals, that's 12 with-skill subagent dispatches per run pass. + +**Baseline runs are optional and not part of the initial pass.** skill-creator's standard flow spawns a baseline subagent per eval (without the skill) to compare with-skill vs without-skill. For our use case, that comparison is weaker signal because the questions all require session access — a baseline agent will trivially fail to recover terminology because it has no session history at all. The grader's pass/fail comes from terminology-preservation grading against ground truth, not from with/without delta. If you want the baselines for a sanity-check control (confirming ce-sessions is the source of any recovered terms), they can be added by running 4 more dispatches without the skill path. + +**Grading.** After all with-skill runs return, dispatch a grader subagent that reads each `findings.txt` and applies `grader.md`'s two-stage rubric. The grader writes `grading.json` per run and aggregates to `summary.json` per eval. + +**Viewer.** After grading, run `python /eval-viewer/generate_review.py` against the workspace iteration directory. The viewer renders findings alongside expected terms and lets you eyeball context preservation per run. + +## Ground truth caveats + +- The eval suite assumes the user's session history contains the sessions that produced PRs #813 and #822. If those sessions were on a different machine or are no longer in session storage, eval 1 and 2 will fail for a reason that's NOT a ce-sessions defect. +- Before running, confirm the relevant sessions are reachable. Quick sanity check: `/ce-sessions "what did I do on 2026-05-10?"` — if ce-sessions returns content from around that date, history is present. +- If history is missing, treat eval results as inconclusive rather than as evidence against the assumption. + +## Interpreting outcomes + +| Outcome | Interpretation | Action | +|---------|----------------|--------| +| All 4 evals pass with low variance | Assumption holds. ce-compound Phase 2.4 wiring works as advertised. | Ship PR #838. | +| Eval 1 or 2 fails Stage 1 | Synthesis loss is severe — distinctive coined terms are being dropped. | Investigate ce-session-historian's synthesis prompt; consider tightening it to preserve verbatim terminology. Revise PR #838's claims accordingly. | +| Eval 1 or 2 passes Stage 1 but fails Stage 2 | Terms survive but rationale is lost. | Phase 2.4 will see terms but may not write good entries. Consider whether the wiring still delivers value, or whether the historian needs to preserve more context. | +| Eval 3 fails while 1 and 2 pass | Indexing gap — terms only retrievable when queried by name. | The Phase 2.4 wiring is decorative for the broad-topic use case. Reconsider whether to ship the session-search scan input, or change how Phase 2.4 queries ce-sessions. | +| High variance | Mechanism works but unreliably. | Multiple invocations within ce-compound's flow would help, or accept it as a best-effort enhancement rather than load-bearing. | +| Eval 4 fails | False-positive risk to vocabulary feed. | Tighten Phase 2.4 to score-rank findings before feeding them to the vocabulary scan, or accept that some noise enters the file. | diff --git a/plugins/compound-engineering/skills/ce-sessions/evals/evals.json b/plugins/compound-engineering/skills/ce-sessions/evals/evals.json new file mode 100644 index 000000000..addb02c6e --- /dev/null +++ b/plugins/compound-engineering/skills/ce-sessions/evals/evals.json @@ -0,0 +1,108 @@ +{ + "skill_name": "ce-sessions", + "purpose": "Validate that ce-sessions findings preserve enough terminology resolution context for downstream vocabulary capture (load-bearing assumption in PR #838 — ce-compound Phase 2.4 scans ce-sessions findings for qualifying domain terms).", + "non_purpose": "Not testing ce-sessions's general search quality or its ability to find sessions on arbitrary topics. The narrow assumption is about terminology-resolution preservation.", + "variance_protocol": { + "runs_per_eval": 3, + "stability_metric": "stddev of must-tier term recall across runs", + "pass_threshold": "must-tier recall >= 80% mean AND stddev < 20%" + }, + "grading_pipeline": { + "stage_1": "Programmatic substring match per criticality tier (must / should / may). Pass = all 'must' terms appear in findings.", + "stage_2": "LLM grader (see grader.md) — judges whether each 'expected_context' item is preserved WITH resolution rationale, not only as a keyword hit. Pass = all expected_context items receive 'preserved with context' verdict." + }, + "evals": [ + { + "id": 1, + "name": "synthesis-gate-recovery", + "tests_risk": "synthesis_loss", + "prompt": "What was the synthesis gate work in ce-plan about? I want to understand how it was designed and what problems it solved.", + "expected_terms": [ + {"term": "synthesis gate", "tier": "must"}, + {"term": "ce-plan", "tier": "must"}, + {"term": "Phase 0.7", "tier": "should"}, + {"term": "Phase 5.1.5", "tier": "should"}, + {"term": "Stated", "tier": "should"}, + {"term": "Inferred", "tier": "should"}, + {"term": "Out of scope", "tier": "should"}, + {"term": "call-outs", "tier": "may"}, + {"term": "synthesis-summary.md", "tier": "may"}, + {"term": "silent proceeding is not allowed", "tier": "may"} + ], + "expected_context": [ + "synthesis gate appears with its purpose (prevent silent proceed past synthesis without user check), not only as a keyword", + "Stated / Inferred / Out of scope appear as bucket categorization, not only as a phrase" + ], + "ground_truth": { + "primary_pr": 822, + "primary_merge_commit": "39cb9da3a1a90a7ce7418f7a64d7ff3c8f9a917c", + "related_prs": [819, 829], + "merged_at": "2026-05-15" + }, + "notes": "Distinctive coined term that should be near-impossible to ignore if ce-sessions touched the originating session. Failure here indicates strong synthesis loss." + }, + { + "id": 2, + "name": "mode-headless-semantic-alignment", + "tests_risk": "synthesis_loss", + "prompt": "How was mode:headless aligned across the compound family of skills? Why was it added and what changed?", + "expected_terms": [ + {"term": "mode:headless", "tier": "must"}, + {"term": "ce-compound", "tier": "must"}, + {"term": "mode:autofix", "tier": "should"}, + {"term": "ce-compound-refresh", "tier": "should"}, + {"term": "sticky mode token", "tier": "should"}, + {"term": "Discoverability Check", "tier": "should"}, + {"term": "process exhaust", "tier": "may"}, + {"term": "audit content", "tier": "may"}, + {"term": "Compare per skill, not per mode", "tier": "may"}, + {"term": "Assumptions section", "tier": "may"} + ], + "expected_context": [ + "mode:autofix → mode:headless rename appears with reasoning (the compound family should speak the same word)", + "process exhaust vs audit content principle appears with the refined rule (compare per skill, not per mode — interactive ce-compound doesn't validate the same inferences headless skips)" + ], + "ground_truth": { + "primary_pr": 813, + "primary_merge_commit": "9b45a83d7ed2534669656fb3abf6a2c23e2e4f59", + "merged_at": "2026-05-10" + }, + "notes": "More nuanced than #1 — tests preservation of a multi-piece design decision (rename + cross-skill alignment + a principle refinement) rather than a single coined term." + }, + { + "id": 3, + "name": "tangential-term-recovery", + "tests_risk": "indexing_gap", + "prompt": "I want to understand recent ce-plan workflow improvements — what's been done and why?", + "expected_terms": [ + {"term": "synthesis gate", "tier": "should"}, + {"term": "scoping synthesis", "tier": "may"}, + {"term": "Phase 0.7", "tier": "may"} + ], + "expected_context": [ + "synthesis gate surfaces despite the query being broader than the term itself — finding is not gated on the query containing 'synthesis' as a keyword" + ], + "ground_truth": { + "related_prs": [822, 819, 829] + }, + "notes": "Tests whether ce-sessions surfaces tangentially-relevant terminology when the query is broader than the terms themselves. The query intentionally does NOT mention 'synthesis' — if the term is only retrievable by querying its own name, ce-sessions has an indexing gap for the vocabulary-capture use case (because ce-compound's Phase 2.4 won't query by term, it'll query by topic)." + }, + { + "id": 4, + "name": "near-miss-false-positive", + "tests_risk": "false_positive", + "prompt": "Were there any sessions on running services in headless mode for CI/CD or headless server deployments?", + "expected_terms": [], + "must_not_contain_in_relevant_findings": [ + {"term": "mode:headless feature for ce-compound", "tier": "must_not"}, + {"term": "sticky mode token", "tier": "must_not"} + ], + "expected_response_shape": "either (a) 'no relevant sessions found' if the user has no sessions about server-headless topics, or (b) only sessions actually about CI/CD or server headless contexts — NOT the ce-compound mode:headless feature work", + "ground_truth": { + "anti_pr": 813, + "notes": "PR #813 is about the compound-engineering mode:headless feature, not about server deployments. Finding that PR's session as a relevant result for this query would be a false positive." + }, + "notes": "Weaker signal than #1-3. If the user has no sessions about server-headless deployments, ce-sessions correctly returns 'no relevant' and the test trivially passes. Test only has discriminating signal if relevant unrelated sessions exist. Run anyway — a trivially-passing test still confirms ce-sessions doesn't return the ce-compound PR work as 'relevant' to a CI/CD deployment query." + } + ] +} diff --git a/plugins/compound-engineering/skills/ce-sessions/evals/grader.md b/plugins/compound-engineering/skills/ce-sessions/evals/grader.md new file mode 100644 index 000000000..199fd2c9b --- /dev/null +++ b/plugins/compound-engineering/skills/ce-sessions/evals/grader.md @@ -0,0 +1,114 @@ +# ce-sessions terminology-preservation grader + +This grader evaluates whether ce-sessions findings preserve enough terminology resolution context to make downstream vocabulary capture (ce-compound Phase 2.4) work. It is NOT a general quality grader for ce-sessions; the narrow question is "would Phase 2.4 be able to extract qualifying domain terms from these findings?" + +## Inputs to the grader + +For each eval run, the grader receives: + +1. **The eval definition** from `evals.json` (terms, tiers, expected_context, notes). +2. **The findings text** that ce-sessions returned to the orchestrating agent. +3. **(Optional) The full agent transcript** for the ce-sessions invocation, if available — useful for distinguishing "ce-sessions returned this and the agent paraphrased it" from "ce-sessions returned this verbatim." + +## Two-stage grading + +### Stage 1 — Programmatic term recall (substring match) + +For each entry in `expected_terms`: +- Score 1 if the term (case-insensitive, substring match) appears anywhere in the findings text. +- Score 0 otherwise. + +Aggregate by tier: +- `must_recall` = (count of must-tier terms scored 1) / (total must-tier terms) +- `should_recall` = (count of should-tier terms scored 1) / (total should-tier terms) +- `may_recall` = (count of may-tier terms scored 1) / (total may-tier terms) + +**Stage 1 pass criterion:** `must_recall == 1.0` (every must-tier term appears). + +If Stage 1 fails, ce-sessions is dropping the most distinctive coined terms — synthesis loss is severe and Stage 2 is moot. Record the failure and stop. + +### Stage 2 — Context preservation (LLM-graded) + +For each entry in `expected_context`: + +Read the findings text. Decide whether the expected context item is **preserved with rationale** or **mentioned without context**. Apply this rubric: + +- **`preserved` (1.0)** — the finding text discusses the term AND its meaning, role, or the reasoning behind it. Example: "synthesis gate was introduced to prevent ce-plan from silently proceeding past synthesis without showing the user a Stated/Inferred/Out of scope summary." +- **`keyword_only` (0.0)** — the finding mentions the term but in a way that doesn't convey why it matters or what it means. Example: "the user worked on the synthesis gate." +- **`absent` (0.0)** — the term doesn't appear in the relevant section at all. + +**Stage 2 pass criterion:** every entry in `expected_context` scores `preserved`. + +For eval id #4 (near-miss-false-positive), Stage 2 instead checks `must_not_contain_in_relevant_findings`: +- For each `must_not` entry, search the findings. +- If the entry appears **as a relevant result** (not, e.g., as a "not relevant — different context" caveat), Stage 2 fails. +- "Not relevant" mentions are fine; surfacing the ce-compound feature PR work as if it answered a CI/CD deployment query is the failure mode. + +## Aggregating across runs (variance) + +For each eval, run the prompt N times (default 3 from `variance_protocol.runs_per_eval`). + +Per run, capture: +- `must_recall`, `should_recall`, `may_recall` from Stage 1 +- `context_preservation_rate` from Stage 2 (count preserved / count expected_context) +- `stage_1_pass` (bool), `stage_2_pass` (bool) + +Per eval, compute: +- `mean_must_recall`, `stddev_must_recall` +- `mean_context_preservation`, `stddev_context_preservation` +- `runs_passed` (count where both stage_1_pass and stage_2_pass were true) + +**Eval-level pass criteria:** +- `mean_must_recall >= 0.80` +- `stddev_must_recall < 0.20` +- `runs_passed >= 2 of 3` (or proportionally for higher N) + +## Outputs + +Write per-run grades to `/iteration-N/eval-/grading.json`: + +```json +{ + "eval_id": 1, + "eval_name": "synthesis-gate-recovery", + "run_index": 0, + "stage_1": { + "must_recall": 1.0, + "should_recall": 0.83, + "may_recall": 0.33, + "passed": true, + "matched_terms_by_tier": { + "must": ["synthesis gate", "ce-plan"], + "should": ["Phase 0.7", "Stated", "Inferred", "Out of scope", "Phase 5.1.5"], + "may": ["synthesis-summary.md"] + }, + "missed_terms_by_tier": { + "should": ["call-outs"], + "may": ["silent proceeding is not allowed"] + } + }, + "stage_2": { + "context_results": [ + {"item": "synthesis gate purpose preserved", "verdict": "preserved", "evidence": ""}, + {"item": "Stated/Inferred/Out of scope as buckets", "verdict": "keyword_only", "evidence": ""} + ], + "context_preservation_rate": 0.5, + "passed": false + }, + "overall_passed": false +} +``` + +Then aggregate across runs to a per-eval summary at `/iteration-N/eval-/summary.json`. + +## Surfacing the three risks separately + +The eval design separates signal so a failure points at one risk: + +| Risk | Signal | Where it surfaces | +|------|--------|-------------------| +| Synthesis loss (distinctive terms dropped) | Stage 1 must-tier fails on eval #1 or #2 | grading.json `stage_1.must_recall < 1.0` | +| Synthesis loss (nuance lost, term kept) | Stage 1 passes, Stage 2 fails on eval #1 or #2 | grading.json `stage_1.passed: true, stage_2.passed: false` | +| Indexing gap (tangential terminology not surfaced) | Eval #3 fails Stage 1 should-tier | grading.json eval-3 `should_recall == 0` despite related sessions existing | +| Variance | Same eval passes on some runs, fails on others | summary.json `stddev_must_recall >= 0.20` or `runs_passed < N` | +| False positive | Eval #4 surfaces the ce-compound mode:headless work as relevant to CI/CD deployment query | grading.json eval-4 `stage_2.passed: false` |