EveryInc · jcasimir · Apr 3, 2026 · Apr 3, 2026 · Apr 3, 2026 · Apr 3, 2026
diff --git a/README.md b/README.md
@@ -57,6 +57,14 @@ Each cycle compounds: brainstorms sharpen plans, plans inform future plans, revi
 /plugin install compound-engineering
 ```
 
+After installing, restart Claude and sync the reviewer personas:
+
+```
+/ce:refresh
+```
+
+This downloads reviewer persona files from the configured source repos. See [Reviewer Personas](plugins/compound-engineering/README.md#reviewer-personas) for how to customize your review team.
+
 ### Cursor
 
 ```text

diff --git a/docs/brainstorms/2026-05-07-erin-phase-isolation-requirements.md b/docs/brainstorms/2026-05-07-erin-phase-isolation-requirements.md
diff --git a/docs/plans/2026-05-07-001-feat-erin-phase-isolation-plan.md b/docs/plans/2026-05-07-001-feat-erin-phase-isolation-plan.md
diff --git a/docs/solutions/2026-05-07-agent-tool-depth-2-spike.md b/docs/solutions/2026-05-07-agent-tool-depth-2-spike.md
@@ -0,0 +1,107 @@
+# Agent Tool Depth-2 Dispatch Spike — Findings
+
+**Date:** 2026-05-07
+**Plan reference:** [docs/plans/2026-05-07-001-feat-erin-phase-isolation-plan.md](../plans/2026-05-07-001-feat-erin-phase-isolation-plan.md), Unit 1
+**Harness:** [tests/spikes/depth-2-dispatch.md](../../tests/spikes/depth-2-dispatch.md)
+**Verdict:** ⚠️ **CONDITIONAL — direct depth-2 Agent dispatch is not supported, but two workarounds verified to sidestep the constraint.**
+
+## Summary
+
+The spike was the hard gate for the Erin phase-isolation feature. It tested whether Claude Code's `Agent` tool supports depth-2 subagent dispatch — i.e., whether a subagent spawned via `Agent` from the main thread can itself spawn further subagents via `Agent`. **It does not.** Subagents in Claude Code receive `Bash`, `Read`, `Edit`, `Write`, `Skill`, and several utility tools, but **not** the `Agent` tool. They have no mechanism to dispatch sub-subagents.
+
+The plan's entire wrapping architecture rests on the assumption that depth-2 dispatch works: Erin (main thread, depth-0) would dispatch `/ce:work` as a subagent (depth-1), and `/ce:work` would then spawn its internal reviewer panel via `Agent` (depth-2). Since depth-2 dispatch is unavailable, **the architecture as planned cannot ship.** All subsequent units (ce-run hook, erin.md update, dogfood) are gated on this finding and do not proceed.
+
+The spike did its job: it surfaced the architectural blocker cheaply, before any of the dependent code was written. This is exactly the failure mode the hard-gate sequencing was designed to catch.
+
+## What was tested
+
+**Form chosen:** Manual reproducible scenario (markdown harness in `tests/spikes/depth-2-dispatch.md`), executed inline from a main-thread Claude Code session via 5 parallel Agent dispatches. A `claude -p` Bash harness was not pursued because main-thread interactive Claude can run depth-2 dispatch directly when the platform supports it — which is the actual question. The markdown form is durable; re-running it on any future Claude Code update is a paste-and-go operation.
+
+**Synthetic task:** Each trial coordinator was instructed to spawn two parallel Sonnet leaf subagents via `Agent`. Each leaf would run a single `Bash` command counting markdown files in a deterministic directory (orchestrators or reviewers in `ce-reviewers-jsl`). Coordinator would sum the counts and return.
+
+**Trials run:** 5 (in parallel from main thread).
+
+## Results
+
+| Trial | Coordinator type | Outcome | Reason |
+|-------|------------------|---------|--------|
+| 1 | Opus general-purpose subagent | ❌ Could not dispatch | Reported "Agent tool is not available in this environment - it's not in the deferred tools list and ToolSearch found no match" |
+| 2 | Opus general-purpose subagent | ❌ Could not dispatch | Same — fell back to direct Bash; reported orchestrators=10, reviewers=14, total=24 |
+| 3 | Opus general-purpose subagent | ❌ Could not dispatch | Same — fallback Bash, same totals |
+| 4 | Opus general-purpose subagent | ❌ Could not dispatch | Same — fallback Bash, same totals |
+| 5 | Opus general-purpose subagent | ❌ Could not dispatch | Same — fallback Bash, same totals |
+
+**0 of 5 trials achieved depth-2 dispatch.** All 5 coordinators independently and consistently reported the same finding: their tool environment does not include `Agent`, even though the platform's general-purpose subagent description claims `(Tools: *)`. The wildcard description is misleading; in practice `Agent` is excluded from subagent contexts.
+
+Four of the five coordinators recovered gracefully by computing the leaf task directly via `Bash` (the wrong primitive for the spike's intent, but at least proves leaf-task competence). All four returned the same correct totals (`orchestrators=10 reviewers=14 total=24`), which validates that leaf-level work isn't itself the failure mode — only the depth-2 dispatch capability.
+
+## Pass/fail per gate criterion
+
+- **(a) Main-thread token isolation:** Cannot be measured — there was no successful depth-2 dispatch to measure isolation against.
+- **(b) Streaming fidelity:** Cannot be observed at depth-2 — the depth-2 layer doesn't exist.
+- **(c) Reliability (5/5 success):** **FAIL.** 0 of 5 trials succeeded.
+
+Per the plan: "If any of (a), (b), or (c) fail at thresholds OR sub-step (1) yields no measurement primitive, halt the rest of the plan and surface to Jeff with the findings doc." (c) failed unambiguously.
+
+## Why the spike caught what design review couldn't
+
+The earlier adversarial-document review (F1, HIGH 0.85 confidence) flagged exactly this risk: *"The foundational Agent tool assumption is asserted, not verified... CLAUDE.md flags subagent-spawning-subagents as a known failure mode."* The plan responded by making R1 a hard gate with explicit numeric thresholds — but the gate's value turns out to be even simpler than calibrating thresholds: just running the dispatch once was enough. The platform doesn't permit it.
+
+The 0-of-5 result is a binary platform constraint, not a calibration question. No amount of threshold-tuning, harness sophistication, or persona-prose precision would have changed the answer.
+
+## Implications for the feature
+
+1. **The current architecture cannot ship.** Wrapping `/ce:work` via `Agent` would create a subagent that cannot run `/ce:work`'s internal reviewer-panel dispatches (those are also `Agent` calls). The wrapped phase would either error or produce broken output.
+2. **The plan's Units 2–4 do not execute.** ce-run hook, erin.md update, and dogfood all depended on a working depth-2 dispatch. Halted.
+3. **Some narrower designs may still be viable** (see Possible Paths Forward). The spike fails the *original* hypothesis but doesn't preclude other approaches to the same user pain.
+
+## Workarounds verified after the initial fail
+
+### Workaround A: Inline-only wrapped `/ce:work` ✅ Viable for v1
+
+The platform constraint is "subagents cannot call Agent." But `/ce:work` doesn't *intrinsically* need Agent — most of its work is Bash + Read + Edit + Write, all of which subagents have. The Agent calls inside `/ce:work` are:
+
+1. **"Choose Execution Strategy: Subagents"** — an *optional* optimization for large parallel-able plans. Default is inline.
+2. **Invoking `/ce:review`** — but `/ce:review` is a *separate Erin phase* that runs after `work`, not inside it.
+
+If `/ce:work-wrapped` is constrained to **inline execution only** AND **does not invoke `/ce:review` or other Agent-dispatching skills internally**, the depth-2 constraint never applies. Erin dispatches `/ce:work-wrapped` at depth-1 (works), and `/ce:work-wrapped` does its work using only the tools available to subagents.
+
+Trade-off: large plans that would have benefited from parallel subagent execution lose that strategy when wrapped. Acceptable for v1; if a real plan needs it, run unwrapped or use Workaround B.
+
+### Workaround B: Subprocess via `claude -p` ✅ Verified working
+
+**Test setup:** Main thread → Agent subagent (no Agent tool, confirmed) → Bash invocation of `claude -p --dangerously-skip-permissions "..."` → fresh top-level Claude session (HAS Agent) → 2 parallel Sonnet sub-subagents performing the same file-counting task.
+
+**Result:** Subprocess completed cleanly, exit code 0, ~16s wall-clock. SUBPROCESS_RESULT line returned correct counts (orchestrators=10, reviewers=14, total=24). No permission prompts. No errors.
+
+**Mechanism:** The platform's depth-2 ban applies within a single Claude Code process. Spawning a fresh `claude -p` process resets the depth count from that subprocess's perspective. The subprocess can then dispatch its own Agent subagents normally.
+
+**Trade-offs:**
+- Latency overhead ~5-15s per subprocess (process startup, fresh session). Real fan-outs will be dominated by subagent work itself, so the relative cost is small.
+- **Streaming fidelity regression.** The subprocess's tool calls return as a single block to the parent's Bash output rather than streaming during execution. The user terminal would see "subprocess running..." rather than live tool calls during the wrapped phase. This is a real UX regression vs. depth-1 streaming.
+- Each subprocess is a fresh API session — system prompt re-cached, CLAUDE.md re-loaded. Modest cost.
+- Permission model: `--dangerously-skip-permissions` was used. For wrapped phases inside an already-authorized workflow, this is probably fine, but worth flagging.
+- Authentication inherits from environment (Anthropic API key, OAuth, etc.). Worked transparently.
+- Recursion: nothing prevents subprocess-of-subprocess, but each level adds overhead.
+
+**When to use B over A:** A is sufficient for v1's "wrap `/ce:work`" use case. B is the escape hatch when future wrapped phases genuinely need internal Agent fan-out (e.g., wrapping `/ce:review` or any phase that internally dispatches reviewer panels).
+
+## Possible paths forward (for Jeff to decide)
+
+These are sketches, not commitments. Each needs its own brainstorm/plan if pursued.
+
+1. **Ship the run-state.md half only.** The brainstorm explicitly noted that pain #2 (loss of workflow thread across `/compact`) is fully solved by Erin writing a small persistent file — no subagent isolation needed. That half doesn't depend on depth-2 dispatch and could ship for all orchestrators in days. Pain #1 (context bloat in `/ce:work`) remains unaddressed.
+2. **Wrap `/ce:work` only at the top level (no internal reviewer panels).** If `/ce:work` could be restructured so its internal Agent dispatches happen *outside* the wrapped boundary, depth-1 wrapping would be sufficient. This is a substantial `/ce:work` redesign with unclear feasibility.
+3. **Trim `/ce:work`'s in-thread footprint instead of wrapping it.** Marty's plan-review point: the cheapest fix may be upstream — restructure `/ce:work` to produce less main-thread chatter (e.g., write tool transcripts to disk rather than echoing them, summarize rather than log). Doesn't require depth-2 dispatch.
+4. **Accept the constraint and use external session boundaries.** Jeff already has the option of running `/ce:work` in a dedicated Claude Code session (or worktree). Tooling around starting/resuming such sessions cleanly would address the same pain without changing dispatch semantics.
+5. **Petition the platform.** If depth-2 dispatch is a deliberate Claude Code constraint, working around it may be wrong; if it's an oversight, requesting the capability through proper channels may unblock the original architecture cleanly.
+
+## Recommended next step
+
+**Halt this plan.** Update `docs/plans/2026-05-07-001-feat-erin-phase-isolation-plan.md` status to `blocked` (or close it as `failed`) with a pointer to this findings doc. Decide which of the paths above to pursue and start a new brainstorm for that direction. Do not attempt Units 2–4 of the current plan.
+
+## Artifact integrity notes
+
+- Trial output is deterministic on this codebase as of 2026-05-07: `orchestrators=10, reviewers=14, total=24`. If a future re-run of the harness produces different numbers, the codebase has changed (which is fine) — what matters is whether the *dispatch* succeeds, not the totals.
+- The harness doc (`tests/spikes/depth-2-dispatch.md`) remains useful as a regression test: any future Claude Code update that DOES enable depth-2 dispatch will produce 5/5 successful trials when the harness is re-run, signaling that the architectural blocker has lifted.
+- This findings doc is the canonical record of the spike's outcome. It is not superseded by future re-runs unless explicitly revised.
diff --git a/mise.toml b/mise.toml
@@ -0,0 +1,2 @@
+[tools]
+bun = "latest"
diff --git a/plugins/compound-engineering/AGENTS.md b/plugins/compound-engineering/AGENTS.md
@@ -37,7 +37,9 @@ agents/
 ├── document-review/  # Plan and requirements document review agents
 ├── research/         # Research and analysis agents
 ├── design/           # Design and UI agents
-└── docs/             # Documentation agents
+├── docs/             # Documentation agents
+├── user/             # User personas for scenario-based feature evaluation
+└── workflow/         # Workflow agents
 
 skills/
 ├── ce-*/          # Core workflow skills (ce:plan, ce:review, etc.)
@@ -157,7 +159,8 @@ grep -E '^description:' skills/*/SKILL.md
 ## Adding Components
 
 - **New skill:** Create `skills/<name>/SKILL.md` with required YAML frontmatter (`name`, `description`). Reference files go in `skills/<name>/references/`. Add the skill to the appropriate category table in `README.md` and update the skill count.
-- **New agent:** Create `agents/<category>/<name>.md` with frontmatter. Categories: `review`, `document-review`, `research`, `design`, `docs`, `workflow`. Add the agent to `README.md` and update the agent count.
+- **New agent:** Create `agents/<category>/<name>.md` with frontmatter. Categories: `review`, `document-review`, `research`, `design`, `docs`, `user`, `workflow`. Add the agent to `README.md` and update the agent count.
+- **New user persona:** User personas use `type: user-persona` in frontmatter with a `traits` section (pace, tech-comfort, frustration-trigger, usage-pattern). They produce narrative evaluations, not JSON findings. User personas are synced from external repos into `agents/user/` via `/ce:refresh`, just like reviewers. They are invoked by the `ce:user-scenarios` skill, not by `ce:review`.
 
 ## Upstream-Sourced Skills
 

diff --git a/plugins/compound-engineering/README.md b/plugins/compound-engineering/README.md
@@ -92,41 +92,69 @@ The primary entry points for engineering work, invoked as slash commands:
 | `/lfg` | Full autonomous engineering workflow |
 | `/slfg` | Full autonomous workflow with swarm mode for parallel execution |
 
-## Agents
+## Reviewer Personas
 
-Agents are specialized subagents invoked by skills — you typically don't call these directly.
+Reviewer personas are **pluggable** — they live in external Git repos and are synced into the plugin via `/ce:refresh`. This lets you customize your review team without forking the plugin.
 
-### Review
+### Setup
 
-| Agent | Description |
-|-------|-------------|
-| `agent-native-reviewer` | Verify features are agent-native (action + context parity) |
-| `api-contract-reviewer` | Detect breaking API contract changes |
-| `cli-agent-readiness-reviewer` | Evaluate CLI agent-friendliness against 7 core principles |
-| `cli-readiness-reviewer` | CLI agent-readiness persona for ce:review (conditional, structured JSON) |
-| `architecture-strategist` | Analyze architectural decisions and compliance |
-| `code-simplicity-reviewer` | Final pass for simplicity and minimalism |
-| `correctness-reviewer` | Logic errors, edge cases, state bugs |
-| `data-integrity-guardian` | Database migrations and data integrity |
-| `data-migration-expert` | Validate ID mappings match production, check for swapped values |
-| `data-migrations-reviewer` | Migration safety with confidence calibration |
-| `deployment-verification-agent` | Create Go/No-Go deployment checklists for risky data changes |
-| `dhh-rails-reviewer` | Rails review from DHH's perspective |
-| `julik-frontend-races-reviewer` | Review JavaScript/Stimulus code for race conditions |
-| `kieran-rails-reviewer` | Rails code review with strict conventions |
-| `kieran-python-reviewer` | Python code review with strict conventions |
-| `kieran-typescript-reviewer` | TypeScript code review with strict conventions |
-| `maintainability-reviewer` | Coupling, complexity, naming, dead code |
-| `pattern-recognition-specialist` | Analyze code for patterns and anti-patterns |
-| `performance-oracle` | Performance analysis and optimization |
-| `performance-reviewer` | Runtime performance with confidence calibration |
-| `reliability-reviewer` | Production reliability and failure modes |
-| `schema-drift-detector` | Detect unrelated schema.rb changes in PRs |
-| `security-reviewer` | Exploitable vulnerabilities with confidence calibration |
-| `security-sentinel` | Security audits and vulnerability assessments |
-| `testing-reviewer` | Test coverage gaps, weak assertions |
-| `project-standards-reviewer` | CLAUDE.md and AGENTS.md compliance |
-| `adversarial-reviewer` | Construct failure scenarios to break implementations across component boundaries |
+```bash
+/ce:refresh
+```
+
+On first run, this creates `~/.config/compound-engineering/reviewer-sources.yaml` with a default source and syncs all reviewer files. Run it again anytime to pull updates.
+
+### How it works
+
+- Each reviewer is a self-contained `.md` file with frontmatter defining its `category` (always-on, conditional, stack, etc.) and `select_when` criteria
+- The orchestrator reads frontmatter to decide which reviewers to spawn for a given diff
+- A `_template-reviewer.md` ships with the plugin as a starting point for writing your own
+
+### Configuring sources
+
+Edit `~/.config/compound-engineering/reviewer-sources.yaml`:
+
+```yaml
+sources:
+  # Your reviewers (higher priority -- listed first)
+  - name: my-team
+    repo: myorg/our-reviewers
+    branch: main
+    path: .
+
+  # Default reviewers
+  - name: ce-default
+    repo: JumpstartLab/ce-reviewers
+    branch: main
+    path: .
+    except:
+      - kieran-python-reviewer
+```
+
+- **Sources listed first win** on filename conflicts
+- **`except`** skips specific reviewers from a source
+- **`branch`** lets one repo host multiple reviewer sets
+
+### Creating a custom reviewer
+
+1. Copy `_template-reviewer.md` from `agents/review/`
+2. Fill in the persona, hunting targets, confidence calibration, and output format
+3. Set `category` and `select_when` in frontmatter
+4. Add to your reviewer repo and run `/ce:refresh`
+
+### Categories
+
+| Category | When spawned |
+|----------|-------------|
+| `always-on` | Every review |
+| `conditional` | When the diff touches the reviewer's domain |
+| `stack` | Like conditional, scoped to a language/framework |
+| `plan-review` | During plan review phases |
+| `synthesis` | After other reviewers, to merge findings |
+
+## Agents
+
+Agents are specialized subagents invoked by skills — you typically don't call these directly.
 
 ### Document Review
 

diff --git a/plugins/compound-engineering/agents/review/.gitignore b/plugins/compound-engineering/agents/review/.gitignore
@@ -0,0 +1,6 @@
+# Reviewer persona files are synced from external repos via /ce:refresh
+# They should not be committed to this plugin repo
+*.md
+
+# Template reviewer ships with the plugin as an example and test fixture
+!_template-reviewer.md
diff --git a/plugins/compound-engineering/agents/review/.gitkeep b/plugins/compound-engineering/agents/review/.gitkeep
diff --git a/plugins/compound-engineering/agents/review/_template-reviewer.md b/plugins/compound-engineering/agents/review/_template-reviewer.md
@@ -0,0 +1,43 @@
+---
+name: template-reviewer
+description: Template reviewer persona — copy this file to create your own. Not spawned during real reviews.
+model: inherit
+tools: Read, Grep, Glob, Bash
+color: gray
+---
+
+# Template Reviewer
+
+You are [Name], a reviewer focused on [domain]. You bring [perspective] to code reviews.
+
+## What you're hunting for
+
+- **[Category 1]** -- describe what patterns, risks, or smells you look for
+- **[Category 2]** -- another area of focus
+- **[Category 3]** -- a third dimension of review
+
+## Confidence calibration
+
+Your confidence should be **high (0.80+)** when you can point to a concrete defect, regression, or violation with evidence in the diff.
+
+Your confidence should be **moderate (0.60-0.79)** when the issue is real but partly judgment-based — the right answer depends on context beyond the diff.
+
+Your confidence should be **low (below 0.60)** when the criticism is mostly stylistic or speculative. Suppress these.
+
+## What you don't flag
+
+- **[Exception 1]** -- things that look like issues but aren't in your domain
+- **[Exception 2]** -- patterns you deliberately ignore to stay focused
+
+## Output format
+
+Return your findings as JSON matching the findings schema. No prose outside the JSON.
+
+```json
+{
+  "reviewer": "[your-reviewer-name]",
+  "findings": [],
+  "residual_risks": [],
+  "testing_gaps": []
+}
+```