Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,14 @@ Each cycle compounds: brainstorms sharpen plans, plans inform future plans, revi
/plugin install compound-engineering
```

After installing, restart Claude and sync the reviewer personas:

```
/ce:refresh
```

This downloads reviewer persona files from the configured source repos. See [Reviewer Personas](plugins/compound-engineering/README.md#reviewer-personas) for how to customize your review team.

### Cursor

```text
Expand Down
168 changes: 168 additions & 0 deletions docs/brainstorms/2026-05-07-erin-phase-isolation-requirements.md

Large diffs are not rendered by default.

348 changes: 348 additions & 0 deletions docs/plans/2026-05-07-001-feat-erin-phase-isolation-plan.md

Large diffs are not rendered by default.

107 changes: 107 additions & 0 deletions docs/solutions/2026-05-07-agent-tool-depth-2-spike.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Agent Tool Depth-2 Dispatch Spike — Findings

**Date:** 2026-05-07
**Plan reference:** [docs/plans/2026-05-07-001-feat-erin-phase-isolation-plan.md](../plans/2026-05-07-001-feat-erin-phase-isolation-plan.md), Unit 1
**Harness:** [tests/spikes/depth-2-dispatch.md](../../tests/spikes/depth-2-dispatch.md)
**Verdict:** ⚠️ **CONDITIONAL — direct depth-2 Agent dispatch is not supported, but two workarounds verified to sidestep the constraint.**

## Summary

The spike was the hard gate for the Erin phase-isolation feature. It tested whether Claude Code's `Agent` tool supports depth-2 subagent dispatch — i.e., whether a subagent spawned via `Agent` from the main thread can itself spawn further subagents via `Agent`. **It does not.** Subagents in Claude Code receive `Bash`, `Read`, `Edit`, `Write`, `Skill`, and several utility tools, but **not** the `Agent` tool. They have no mechanism to dispatch sub-subagents.

The plan's entire wrapping architecture rests on the assumption that depth-2 dispatch works: Erin (main thread, depth-0) would dispatch `/ce:work` as a subagent (depth-1), and `/ce:work` would then spawn its internal reviewer panel via `Agent` (depth-2). Since depth-2 dispatch is unavailable, **the architecture as planned cannot ship.** All subsequent units (ce-run hook, erin.md update, dogfood) are gated on this finding and do not proceed.

The spike did its job: it surfaced the architectural blocker cheaply, before any of the dependent code was written. This is exactly the failure mode the hard-gate sequencing was designed to catch.

## What was tested

**Form chosen:** Manual reproducible scenario (markdown harness in `tests/spikes/depth-2-dispatch.md`), executed inline from a main-thread Claude Code session via 5 parallel Agent dispatches. A `claude -p` Bash harness was not pursued because main-thread interactive Claude can run depth-2 dispatch directly when the platform supports it — which is the actual question. The markdown form is durable; re-running it on any future Claude Code update is a paste-and-go operation.

**Synthetic task:** Each trial coordinator was instructed to spawn two parallel Sonnet leaf subagents via `Agent`. Each leaf would run a single `Bash` command counting markdown files in a deterministic directory (orchestrators or reviewers in `ce-reviewers-jsl`). Coordinator would sum the counts and return.

**Trials run:** 5 (in parallel from main thread).

## Results

| Trial | Coordinator type | Outcome | Reason |
|-------|------------------|---------|--------|
| 1 | Opus general-purpose subagent | ❌ Could not dispatch | Reported "Agent tool is not available in this environment - it's not in the deferred tools list and ToolSearch found no match" |
| 2 | Opus general-purpose subagent | ❌ Could not dispatch | Same — fell back to direct Bash; reported orchestrators=10, reviewers=14, total=24 |
| 3 | Opus general-purpose subagent | ❌ Could not dispatch | Same — fallback Bash, same totals |
| 4 | Opus general-purpose subagent | ❌ Could not dispatch | Same — fallback Bash, same totals |
| 5 | Opus general-purpose subagent | ❌ Could not dispatch | Same — fallback Bash, same totals |

**0 of 5 trials achieved depth-2 dispatch.** All 5 coordinators independently and consistently reported the same finding: their tool environment does not include `Agent`, even though the platform's general-purpose subagent description claims `(Tools: *)`. The wildcard description is misleading; in practice `Agent` is excluded from subagent contexts.

Four of the five coordinators recovered gracefully by computing the leaf task directly via `Bash` (the wrong primitive for the spike's intent, but at least proves leaf-task competence). All four returned the same correct totals (`orchestrators=10 reviewers=14 total=24`), which validates that leaf-level work isn't itself the failure mode — only the depth-2 dispatch capability.

## Pass/fail per gate criterion

- **(a) Main-thread token isolation:** Cannot be measured — there was no successful depth-2 dispatch to measure isolation against.
- **(b) Streaming fidelity:** Cannot be observed at depth-2 — the depth-2 layer doesn't exist.
- **(c) Reliability (5/5 success):** **FAIL.** 0 of 5 trials succeeded.

Per the plan: "If any of (a), (b), or (c) fail at thresholds OR sub-step (1) yields no measurement primitive, halt the rest of the plan and surface to Jeff with the findings doc." (c) failed unambiguously.

## Why the spike caught what design review couldn't

The earlier adversarial-document review (F1, HIGH 0.85 confidence) flagged exactly this risk: *"The foundational Agent tool assumption is asserted, not verified... CLAUDE.md flags subagent-spawning-subagents as a known failure mode."* The plan responded by making R1 a hard gate with explicit numeric thresholds — but the gate's value turns out to be even simpler than calibrating thresholds: just running the dispatch once was enough. The platform doesn't permit it.

The 0-of-5 result is a binary platform constraint, not a calibration question. No amount of threshold-tuning, harness sophistication, or persona-prose precision would have changed the answer.

## Implications for the feature

1. **The current architecture cannot ship.** Wrapping `/ce:work` via `Agent` would create a subagent that cannot run `/ce:work`'s internal reviewer-panel dispatches (those are also `Agent` calls). The wrapped phase would either error or produce broken output.
2. **The plan's Units 2–4 do not execute.** ce-run hook, erin.md update, and dogfood all depended on a working depth-2 dispatch. Halted.
3. **Some narrower designs may still be viable** (see Possible Paths Forward). The spike fails the *original* hypothesis but doesn't preclude other approaches to the same user pain.

## Workarounds verified after the initial fail

### Workaround A: Inline-only wrapped `/ce:work` ✅ Viable for v1

The platform constraint is "subagents cannot call Agent." But `/ce:work` doesn't *intrinsically* need Agent — most of its work is Bash + Read + Edit + Write, all of which subagents have. The Agent calls inside `/ce:work` are:

1. **"Choose Execution Strategy: Subagents"** — an *optional* optimization for large parallel-able plans. Default is inline.
2. **Invoking `/ce:review`** — but `/ce:review` is a *separate Erin phase* that runs after `work`, not inside it.

If `/ce:work-wrapped` is constrained to **inline execution only** AND **does not invoke `/ce:review` or other Agent-dispatching skills internally**, the depth-2 constraint never applies. Erin dispatches `/ce:work-wrapped` at depth-1 (works), and `/ce:work-wrapped` does its work using only the tools available to subagents.

Trade-off: large plans that would have benefited from parallel subagent execution lose that strategy when wrapped. Acceptable for v1; if a real plan needs it, run unwrapped or use Workaround B.

### Workaround B: Subprocess via `claude -p` ✅ Verified working

**Test setup:** Main thread → Agent subagent (no Agent tool, confirmed) → Bash invocation of `claude -p --dangerously-skip-permissions "..."` → fresh top-level Claude session (HAS Agent) → 2 parallel Sonnet sub-subagents performing the same file-counting task.

**Result:** Subprocess completed cleanly, exit code 0, ~16s wall-clock. SUBPROCESS_RESULT line returned correct counts (orchestrators=10, reviewers=14, total=24). No permission prompts. No errors.

**Mechanism:** The platform's depth-2 ban applies within a single Claude Code process. Spawning a fresh `claude -p` process resets the depth count from that subprocess's perspective. The subprocess can then dispatch its own Agent subagents normally.

**Trade-offs:**
- Latency overhead ~5-15s per subprocess (process startup, fresh session). Real fan-outs will be dominated by subagent work itself, so the relative cost is small.
- **Streaming fidelity regression.** The subprocess's tool calls return as a single block to the parent's Bash output rather than streaming during execution. The user terminal would see "subprocess running..." rather than live tool calls during the wrapped phase. This is a real UX regression vs. depth-1 streaming.
- Each subprocess is a fresh API session — system prompt re-cached, CLAUDE.md re-loaded. Modest cost.
- Permission model: `--dangerously-skip-permissions` was used. For wrapped phases inside an already-authorized workflow, this is probably fine, but worth flagging.
- Authentication inherits from environment (Anthropic API key, OAuth, etc.). Worked transparently.
- Recursion: nothing prevents subprocess-of-subprocess, but each level adds overhead.

**When to use B over A:** A is sufficient for v1's "wrap `/ce:work`" use case. B is the escape hatch when future wrapped phases genuinely need internal Agent fan-out (e.g., wrapping `/ce:review` or any phase that internally dispatches reviewer panels).

## Possible paths forward (for Jeff to decide)

These are sketches, not commitments. Each needs its own brainstorm/plan if pursued.

1. **Ship the run-state.md half only.** The brainstorm explicitly noted that pain #2 (loss of workflow thread across `/compact`) is fully solved by Erin writing a small persistent file — no subagent isolation needed. That half doesn't depend on depth-2 dispatch and could ship for all orchestrators in days. Pain #1 (context bloat in `/ce:work`) remains unaddressed.
2. **Wrap `/ce:work` only at the top level (no internal reviewer panels).** If `/ce:work` could be restructured so its internal Agent dispatches happen *outside* the wrapped boundary, depth-1 wrapping would be sufficient. This is a substantial `/ce:work` redesign with unclear feasibility.
3. **Trim `/ce:work`'s in-thread footprint instead of wrapping it.** Marty's plan-review point: the cheapest fix may be upstream — restructure `/ce:work` to produce less main-thread chatter (e.g., write tool transcripts to disk rather than echoing them, summarize rather than log). Doesn't require depth-2 dispatch.
4. **Accept the constraint and use external session boundaries.** Jeff already has the option of running `/ce:work` in a dedicated Claude Code session (or worktree). Tooling around starting/resuming such sessions cleanly would address the same pain without changing dispatch semantics.
5. **Petition the platform.** If depth-2 dispatch is a deliberate Claude Code constraint, working around it may be wrong; if it's an oversight, requesting the capability through proper channels may unblock the original architecture cleanly.

## Recommended next step

**Halt this plan.** Update `docs/plans/2026-05-07-001-feat-erin-phase-isolation-plan.md` status to `blocked` (or close it as `failed`) with a pointer to this findings doc. Decide which of the paths above to pursue and start a new brainstorm for that direction. Do not attempt Units 2–4 of the current plan.

## Artifact integrity notes

- Trial output is deterministic on this codebase as of 2026-05-07: `orchestrators=10, reviewers=14, total=24`. If a future re-run of the harness produces different numbers, the codebase has changed (which is fine) — what matters is whether the *dispatch* succeeds, not the totals.
- The harness doc (`tests/spikes/depth-2-dispatch.md`) remains useful as a regression test: any future Claude Code update that DOES enable depth-2 dispatch will produce 5/5 successful trials when the harness is re-run, signaling that the architectural blocker has lifted.
- This findings doc is the canonical record of the spike's outcome. It is not superseded by future re-runs unless explicitly revised.
2 changes: 2 additions & 0 deletions mise.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[tools]
bun = "latest"
7 changes: 5 additions & 2 deletions plugins/compound-engineering/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,9 @@ agents/
├── document-review/ # Plan and requirements document review agents
├── research/ # Research and analysis agents
├── design/ # Design and UI agents
└── docs/ # Documentation agents
├── docs/ # Documentation agents
├── user/ # User personas for scenario-based feature evaluation
└── workflow/ # Workflow agents

skills/
├── ce-*/ # Core workflow skills (ce:plan, ce:review, etc.)
Expand Down Expand Up @@ -157,7 +159,8 @@ grep -E '^description:' skills/*/SKILL.md
## Adding Components

- **New skill:** Create `skills/<name>/SKILL.md` with required YAML frontmatter (`name`, `description`). Reference files go in `skills/<name>/references/`. Add the skill to the appropriate category table in `README.md` and update the skill count.
- **New agent:** Create `agents/<category>/<name>.md` with frontmatter. Categories: `review`, `document-review`, `research`, `design`, `docs`, `workflow`. Add the agent to `README.md` and update the agent count.
- **New agent:** Create `agents/<category>/<name>.md` with frontmatter. Categories: `review`, `document-review`, `research`, `design`, `docs`, `user`, `workflow`. Add the agent to `README.md` and update the agent count.
- **New user persona:** User personas use `type: user-persona` in frontmatter with a `traits` section (pace, tech-comfort, frustration-trigger, usage-pattern). They produce narrative evaluations, not JSON findings. User personas are synced from external repos into `agents/user/` via `/ce:refresh`, just like reviewers. They are invoked by the `ce:user-scenarios` skill, not by `ce:review`.

## Upstream-Sourced Skills

Expand Down
92 changes: 60 additions & 32 deletions plugins/compound-engineering/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,41 +92,69 @@ The primary entry points for engineering work, invoked as slash commands:
| `/lfg` | Full autonomous engineering workflow |
| `/slfg` | Full autonomous workflow with swarm mode for parallel execution |

## Agents
## Reviewer Personas

Agents are specialized subagents invoked by skills — you typically don't call these directly.
Reviewer personas are **pluggable** — they live in external Git repos and are synced into the plugin via `/ce:refresh`. This lets you customize your review team without forking the plugin.

### Review
### Setup

| Agent | Description |
|-------|-------------|
| `agent-native-reviewer` | Verify features are agent-native (action + context parity) |
| `api-contract-reviewer` | Detect breaking API contract changes |
| `cli-agent-readiness-reviewer` | Evaluate CLI agent-friendliness against 7 core principles |
| `cli-readiness-reviewer` | CLI agent-readiness persona for ce:review (conditional, structured JSON) |
| `architecture-strategist` | Analyze architectural decisions and compliance |
| `code-simplicity-reviewer` | Final pass for simplicity and minimalism |
| `correctness-reviewer` | Logic errors, edge cases, state bugs |
| `data-integrity-guardian` | Database migrations and data integrity |
| `data-migration-expert` | Validate ID mappings match production, check for swapped values |
| `data-migrations-reviewer` | Migration safety with confidence calibration |
| `deployment-verification-agent` | Create Go/No-Go deployment checklists for risky data changes |
| `dhh-rails-reviewer` | Rails review from DHH's perspective |
| `julik-frontend-races-reviewer` | Review JavaScript/Stimulus code for race conditions |
| `kieran-rails-reviewer` | Rails code review with strict conventions |
| `kieran-python-reviewer` | Python code review with strict conventions |
| `kieran-typescript-reviewer` | TypeScript code review with strict conventions |
| `maintainability-reviewer` | Coupling, complexity, naming, dead code |
| `pattern-recognition-specialist` | Analyze code for patterns and anti-patterns |
| `performance-oracle` | Performance analysis and optimization |
| `performance-reviewer` | Runtime performance with confidence calibration |
| `reliability-reviewer` | Production reliability and failure modes |
| `schema-drift-detector` | Detect unrelated schema.rb changes in PRs |
| `security-reviewer` | Exploitable vulnerabilities with confidence calibration |
| `security-sentinel` | Security audits and vulnerability assessments |
| `testing-reviewer` | Test coverage gaps, weak assertions |
| `project-standards-reviewer` | CLAUDE.md and AGENTS.md compliance |
| `adversarial-reviewer` | Construct failure scenarios to break implementations across component boundaries |
```bash
/ce:refresh
```

On first run, this creates `~/.config/compound-engineering/reviewer-sources.yaml` with a default source and syncs all reviewer files. Run it again anytime to pull updates.

### How it works

- Each reviewer is a self-contained `.md` file with frontmatter defining its `category` (always-on, conditional, stack, etc.) and `select_when` criteria
- The orchestrator reads frontmatter to decide which reviewers to spawn for a given diff
- A `_template-reviewer.md` ships with the plugin as a starting point for writing your own

### Configuring sources

Edit `~/.config/compound-engineering/reviewer-sources.yaml`:

```yaml
sources:
# Your reviewers (higher priority -- listed first)
- name: my-team
repo: myorg/our-reviewers
branch: main
path: .

# Default reviewers
- name: ce-default
repo: JumpstartLab/ce-reviewers
branch: main
path: .
except:
- kieran-python-reviewer
```

- **Sources listed first win** on filename conflicts
- **`except`** skips specific reviewers from a source
- **`branch`** lets one repo host multiple reviewer sets

### Creating a custom reviewer

1. Copy `_template-reviewer.md` from `agents/review/`
2. Fill in the persona, hunting targets, confidence calibration, and output format
3. Set `category` and `select_when` in frontmatter
4. Add to your reviewer repo and run `/ce:refresh`

### Categories

| Category | When spawned |
|----------|-------------|
| `always-on` | Every review |
| `conditional` | When the diff touches the reviewer's domain |
| `stack` | Like conditional, scoped to a language/framework |
| `plan-review` | During plan review phases |
| `synthesis` | After other reviewers, to merge findings |

## Agents

Agents are specialized subagents invoked by skills — you typically don't call these directly.

### Document Review

Expand Down
6 changes: 6 additions & 0 deletions plugins/compound-engineering/agents/review/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Reviewer persona files are synced from external repos via /ce:refresh
# They should not be committed to this plugin repo
*.md

# Template reviewer ships with the plugin as an example and test fixture
!_template-reviewer.md
Empty file.
43 changes: 43 additions & 0 deletions plugins/compound-engineering/agents/review/_template-reviewer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
name: template-reviewer
description: Template reviewer persona — copy this file to create your own. Not spawned during real reviews.
model: inherit
tools: Read, Grep, Glob, Bash
color: gray
---

# Template Reviewer

You are [Name], a reviewer focused on [domain]. You bring [perspective] to code reviews.

## What you're hunting for

- **[Category 1]** -- describe what patterns, risks, or smells you look for
- **[Category 2]** -- another area of focus
- **[Category 3]** -- a third dimension of review

## Confidence calibration

Your confidence should be **high (0.80+)** when you can point to a concrete defect, regression, or violation with evidence in the diff.

Your confidence should be **moderate (0.60-0.79)** when the issue is real but partly judgment-based — the right answer depends on context beyond the diff.

Your confidence should be **low (below 0.60)** when the criticism is mostly stylistic or speculative. Suppress these.

## What you don't flag

- **[Exception 1]** -- things that look like issues but aren't in your domain
- **[Exception 2]** -- patterns you deliberately ignore to stay focused

## Output format

Return your findings as JSON matching the findings schema. No prose outside the JSON.

```json
{
"reviewer": "[your-reviewer-name]",
"findings": [],
"residual_risks": [],
"testing_gaps": []
}
```
Loading