feat(batch-bug-shepherd): recommendation-fold loop + Copilot+CI gates + mergeability table#1518
Merged
Merged
Conversation
… mergeability table Refactor the batch-bug-shepherd skill into a single shepherd-driver convergence loop that closes four production gaps surfaced by the in-flight bug-queue sweep: 1. Recommendation-fold loop. Every panel CEO follow-up and Copilot inline review item is run through assets/fold-vs-defer-rubric.md and folded unless it crosses the PR's stated scope. Default is fold; defer is the scope-creep exception with a one-line scope_boundary_crossed note. 2. Copilot PR review address loop. Phase X.0 fetches copilot-pull-request-reviewer[bot] review per assets/copilot-classification-prompt.md, classifies each item LEGIT/NOT-LEGIT, and folds LEGIT into the same iteration. 2-round cap on Copilot fetches. 3. Post-push CI verification loop. gh pr checks --watch after every push, with assets/ci-recovery-checklist.md bucketing failures (lint / test / infra / unknown) under a 3-iteration cap. 4. Orchestrator ownership signal. Assigns the shepherd actor and applies status/shepherding on pickup; the label is cleared on terminal. New asset assets/shepherd-driver-prompt.md replaces the old shepherd-prompt / completion-prompt split. New supporting assets: fold-vs-defer-rubric.md, copilot-classification-prompt.md, ci-recovery-checklist.md, strategic-alignment-prompt.md, conflict-resolution-prompt.md, progress-diagram.md. New references/ directory with mergeability-gate.md and strategic-alignment-gate.md. Genesis design record in design.md. Mergeability status table (new in this commit). Shepherd-driver step X.8 captures a per-PR mergeability snapshot via gh pr view <n> --json mergeable,mergeStateStatus,statusCheckRollup immediately after the last push. The snapshot lands as a one-row table in the PR advisory comment (final-report-template.md PR ADVISORY COMMENT block) and is aggregated by the orchestrator at saga-end into a Mergeability status table in the FINAL REPORT block (PR, head SHA, CEO stance, outer iterations, folds, deferrals, Copilot rounds, CI status, mergeable, mergeStateStatus, notes). verdict-schema.json grows four optional completion-return fields: head_sha, mergeable, merge_state_status, ci_status. Validated on the wave-2 shepherd run that drove PRs #1472, #1512, #1513, #1514, #1515, #1516 to advisory-terminal. PR #1514 hit 4 outer iterations with 11 folds + 1 deferral, exercising the fold-by-default discipline at the cap. CHANGELOG entry under [Unreleased] / Added. Lint notes: this commit touches NO Python (.agents/ skill files are markdown + JSON + CHANGELOG markdown). The only applicable lint gates are the ASCII guard and bash scripts/lint-auth-signals.sh, both silent. ruff / pylint / ruff format are skipped per .apm/instructions/linting.instructions.md scope (src/ tests/ only). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
f6f5ecb to
1d3c2d5
Compare
…ommendation-fold-loop # Conflicts: # CHANGELOG.md
Contributor
There was a problem hiding this comment.
Pull request overview
Refactors the batch-bug-shepherd skill from a two-phase shepherd+completion split into a single shepherd-driver convergence loop, adds new orchestrator-side gates (strategic-alignment, mergeability), and extends the verdict schema and final-report template to carry per-PR mergeability snapshots aggregated into a saga-end status table. All changes are confined to the skill's markdown assets, JSON schema, and a single CHANGELOG.md line.
Changes:
- New
shepherd-driver-prompt.md(with fold-vs-defer rubric, Copilot classification, CI recovery checklist) replaces deletedshepherd-prompt.md/completion-prompt.md; SKILL.md rewritten around the new four-wave shape andstatus/shepherdingownership signal. verdict-schema.jsonextendscompletion_returnwith iteration/Copilot/CI caps, fold/defer arrays, and mergeability fields (head_sha,mergeable,merge_state_status,ci_status); newadvisory-with-deferredstatus added.- New supporting docs:
design.md,progress-diagram.md,references/{mergeability,strategic-alignment}-gate.md,conflict-resolution-prompt.md,strategic-alignment-prompt.md, plus final-report-template updates with the saga-end mergeability table.
Show a summary per file
| File | Description |
|---|---|
CHANGELOG.md |
Adds one Unreleased/Added entry summarizing the refactor. |
.agents/skills/batch-bug-shepherd/SKILL.md |
Rewritten around the shepherd-driver loop + ownership signaling + fold-by-default invariants. |
.agents/skills/batch-bug-shepherd/design.md |
New genesis design record; counterpart to SKILL.md. |
.agents/skills/batch-bug-shepherd/assets/shepherd-driver-prompt.md |
New unified per-PR convergence loop with steps X.0..X.8. |
.agents/skills/batch-bug-shepherd/assets/fold-vs-defer-rubric.md |
New rubric defining scope-creep as the decision axis. |
.agents/skills/batch-bug-shepherd/assets/copilot-classification-prompt.md |
New LEGIT/NOT-LEGIT classification template for Copilot review. |
.agents/skills/batch-bug-shepherd/assets/ci-recovery-checklist.md |
New post-push watch + 4-bucket recovery loop (cap 3). |
.agents/skills/batch-bug-shepherd/assets/strategic-alignment-prompt.md |
New Phase 1.5 ceo-align spawn body; promises schema validation. |
.agents/skills/batch-bug-shepherd/assets/conflict-resolution-prompt.md |
New Phase 5b spawn body; references a comment block name not present in the template. |
.agents/skills/batch-bug-shepherd/assets/progress-diagram.md |
New operator-visibility mermaid; subgraph labels still use pre-refactor phase names. |
.agents/skills/batch-bug-shepherd/assets/verdict-schema.json |
Extends completion_return + new status enum; missing strategic_alignment_return definition the new prompts promise. |
.agents/skills/batch-bug-shepherd/assets/final-report-template.md |
Adds per-PR mergeability row + saga-end aggregated table + folded/deferred sections. |
.agents/skills/batch-bug-shepherd/assets/ground-truth-table.md |
Adds new shepherd-driver-iter-* and advisory-with-deferred statuses. |
.agents/skills/batch-bug-shepherd/assets/fix-prompt.md |
Notes the hand-off to shepherd-driver and CI checklist on first-push red. |
.agents/skills/batch-bug-shepherd/references/mergeability-gate.md |
New Phase 5 reference procedure (5a probe, 5b fan-out, 5c synthesis). |
.agents/skills/batch-bug-shepherd/references/strategic-alignment-gate.md |
New Phase 1.5 reference procedure with fail-open semantics. |
.agents/skills/batch-bug-shepherd/assets/shepherd-prompt.md |
Deleted (absorbed into shepherd-driver). |
.agents/skills/batch-bug-shepherd/assets/completion-prompt.md |
Deleted (absorbed into shepherd-driver). |
Copilot's findings
- Files reviewed: 18/18 changed files
- Comments generated: 3
Comment on lines
+130
to
+131
| resolved`. Render from the RESOLUTION CONFIRMATION COMMENT block | ||
| in `final-report-template.md`. Include: |
Comment on lines
+53
to
+60
| subgraph WAVE2[" "] | ||
| direction LR | ||
| P3a["Phase 3a<br/>shepherd<br/>k = <k> PRs in flight"]:::pending | ||
| P3b["Phase 3b<br/>fix dispatch<br/>m = <m> rows without PR"]:::pending | ||
| end | ||
|
|
||
| P4["Phase 4<br/>completion<br/>F = <F> PRs needing follow-up"]:::pending | ||
|
|
Comment on lines
+119
to
+123
| "panel_final_verdict": { | ||
| "type": "string", | ||
| "enum": ["ship_now", "ship_with_followups", "needs_discussion", "needs_rework"], | ||
| "description": "CEO stance from the final panel pass in this run." | ||
| }, |
…evidence Backfills the genesis Step 6 EVALS PLAN and Step 8 EVALS GATE that the structural refactor (PR #1518) shipped without. Per genesis the EVALS GATE blocks any future shipping of skill-body changes. Adds three structured-input content evals that exercise the load- bearing decision policies introduced by the v2 refactor: * fold-vs-defer-panel -- Phase X.2 rubric application on a panel CEO ship_with_followups return (3 fold, 1 defer with scope_boundary_crossed). * copilot-classification-and-fold -- Phase X.0 LEGIT/NOT-LEGIT classification on 5 inline comments (4 LEGIT folded, 1 NOT-LEGIT dismissed with rationale). * ci-recovery-lint-bucket -- Phase X.6 bucket-1 lint recovery via ruff format + push + watch re-entry, cap 3. Each scenario ships with_skill and without_skill fixtures and a regex rubric scored by the existing scripts/run_evals.py runner. All five content scenarios + the trigger val split pass: triggers val: 1.0 should-fire, 1.0 should-not-fire content delta_anchors: 5, 7, 7, 7, 8 (gate: >=1) Adds evals/real-task-refinement.md capturing the wave-1 (v1 SKILL.md) vs wave-2 (v2 SKILL.md) comparison that drove the default-fold + Copilot-first-class + CI-recovery-first-class edits, plus the iter-4 #1513 CI-infra rollback as a positive trace of the new bucket-3 recovery path. Adds an Evals section near the bottom of SKILL.md pointing at evals/ and stating the EVALS GATE for future skill-body changes. ASCII guard clean. auth-signals lint clean. Scope: skill-only; install.ps1 and tests/unit/install/test_windows_shim_template.py remain untracked (they belong to PR #1512, not here). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat(batch-bug-shepherd): recommendation-fold loop + Copilot+CI gates + mergeability table
TL;DR
Refactor
batch-bug-shepherdinto a single shepherd-driver convergenceloop that closes four production gaps surfaced by the in-flight bug-queue
sweep: (1) fold-by-default of every panel/Copilot recommendation,
(2) Copilot review address loop (2-round cap), (3) post-push CI watch +
recovery (3-iteration cap), (4) orchestrator ownership signal via
assign +
status/shepherding. Plus a new per-PR mergeability snapshotthat aggregates into a saga-end status table the orchestrator emits at
terminal. Validated on the wave-2 run that drove PRs #1472, #1512,
#1513, #1514, #1515, #1516; PR #1514 exercised the cap (4 outer
iterations, 11 folds, 1 deferral).
Problem (WHY)
The previous skill split shepherd review and completion into two phases.
That seam hard-coded a "post advisory, address it later" pattern that
left foldable items as unbounded backlog and produced four observable
production gaps during the bug-queue sweep:
review items were posted as advisory bullets and rarely folded
into the same PR; severity, not scope, was the default decision
axis, so reviewers re-read the same items across multiple
iterations without convergence.
copilot-pull-request-reviewer[bot]comments were treated as "for the author" rather than as input
the shepherd had to classify and either fold or decline with a
recorded rationale.
discovered only on the next human pass. The shepherd could close
a session claiming "ready" while CI was still failing.
driving carried no machine-readable indication, so concurrent
sessions or community contributors could pick up the same item.
The mergeability table was missing too: orchestrator-side aggregation
of
gh pr view --json mergeable,mergeStateStatus,statusCheckRollupacross every shepherded PR was done ad-hoc, by hand, in the final
human report. There was no canonical schema for it.
Approach (WHAT)
Collapse shepherd + completion into one
shepherd-driversubagentthat runs an iterative convergence loop per PR. The loop owns
classification, panel re-run, fold/defer decision, push, CI watch,
and terminal-state signalling. Hard caps bound the loop. A new
fold-vs-defer rubric defines the decision axis as SCOPE-CREEP, not
severity. The orchestrator owns label + assignee writes on pickup
and clears them on terminal.
gh pr checks --watch+ 4-bucket recovery (3-iter cap)status/shepherdingon pickup, clear on terminalImplementation (HOW)
SKILL.mdapm-review-panelrather than re-implementing it; documents fold-by-default and ownership invariants.design.mdassets/shepherd-driver-prompt.mdshepherd-prompt.md+completion-prompt.md. Defines steps X.0..X.8 of the per-PR loop.assets/fold-vs-defer-rubric.mdassets/copilot-classification-prompt.mdgh api, classify LEGIT/NOT-LEGIT with rationale.assets/ci-recovery-checklist.mdgh pr checks --watchcontract + 4 failure buckets (lint / test / infra / unknown) + 3-iteration cap.assets/strategic-alignment-prompt.md,conflict-resolution-prompt.md,progress-diagram.mdassets/verdict-schema.jsoncompletion_returnfields:head_sha,mergeable,merge_state_status,ci_status.assets/final-report-template.mdassets/ground-truth-table.md,fix-prompt.mdreferences/mergeability-gate.md,references/strategic-alignment-gate.mdCHANGELOG.md## [Unreleased]/### Addeddescribing the refactor + 4 gaps + mergeability table; references wave-2 PR numbers.Mergeability snapshot mechanics
Shepherd-driver step X.8 (added in this PR) runs:
and projects into the return shape:
head_sha->.headRefOid(the sha actually pushed last)mergeable->MERGEABLE|CONFLICTING|UNKNOWNmerge_state_status->CLEAN|BLOCKED|BEHIND|DIRTY|UNSTABLE|HAS_HOOKS|UNKNOWNci_status-> coarse projection fromstatusCheckRollup(green / yellow / red / blocked)
UNKNOWNtriggers one 5-second-delay retry (GitHub computesmergeability asynchronously after a push). The shepherd-driver
emits a one-row table fragment into the PR advisory comment; the
orchestrator aggregates all rows into the FINAL REPORT
Mergeability status tableat saga-end.Diagram
flowchart TD A[Orchestrator picks up PR<br/>assign + status/shepherding] --> B[Spawn shepherd-driver subagent] B --> C[X.0 Fetch + classify Copilot] C --> D[X.1 Run apm-review-panel] D --> E[X.2 Apply fold-vs-defer rubric] E --> F[X.3 Edit code, fold foldable items] F --> G[X.4 Lint chain silent] G --> H[X.5 Push to fork or superseding PR] H --> I[X.6 gh pr checks --watch<br/>cap 3 CI recovery iters] I --> J{Terminal?<br/>cap 4 outer iters} J -- No --> C J -- Yes --> K[X.8 Capture mergeability snapshot<br/>gh pr view --json mergeable,mergeStateStatus,statusCheckRollup] K --> L[Post final advisory comment<br/>incl. per-PR mergeability row] L --> M[Return completion_return JSON<br/>incl. head_sha, mergeable, ci_status] M --> N[Orchestrator clears status/shepherding<br/>aggregates saga-end Mergeability status table]Trade-offs
3 CI recovery iterations. Hitting a cap returns
advisory-with- deferredorblocked, not silent continuation. Cost: some PRsterminate with unfolded items; benefit: bounded subagent budget
and predictable convergence.
past their original diff with regression-trap tests, CHANGELOG
entries, and doc-drift fixes. Cost: larger diffs; benefit: no
follow-up issue backlog from the panel pass.
auto-merge or block on
mergeStateStatus; the table is for themaintainer's situational awareness.
completion_returnpayloads still validate; the four new fields land non-required to
preserve backwards compatibility during the wave-2 transition.
Validation evidence
Wave-2 shepherd runs
The refactor was validated on a real bug-queue sweep that drove six
PRs through the new loop:
PR #1514 specifically exercised the cap: 4 outer iterations, 11
items folded into the same PR, 1 item deferred with an explicit
scope-boundary note. This demonstrates the rubric does what it
claims (defer is the exception; capacity is not a defer axis).
Lint chain
This PR touches NO Python; the only applicable repo lint gates are:
ruff / ruff format / pylint R0801 do not apply (their scope is
src/andtests/); see.apm/instructions/linting.instructions.md.Schema
assets/verdict-schema.jsonvalidates as draft-07 (verified withpython3 -c "import json; json.load(open('...'))"-> JSON OK).How to test
assets/shepherd-driver-prompt.mdstep X.8 names the exactgh pr view --json mergeable,mergeStateStatus,statusCheckRollupcommand and the field projections.
assets/verdict-schema.jsonhas newcompletion_returnproperties
head_sha,mergeable,merge_state_status,ci_status(all optional, all enum-constrained).assets/final-report-template.mdPR ADVISORY block carries aone-row mergeability table; FINAL REPORT block carries the
aggregated saga-end mergeability table.
OK.Scope discipline
This PR touches ONLY
.agents/skills/batch-bug-shepherd/andCHANGELOG.md. No production code, no tests, no workflows. Theworktree contained unrelated in-flight edits to
install.ps1andan untracked
tests/unit/install/test_windows_shim_template.py;both were explicitly excluded from this commit.
Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com
Evals (genesis Step 6 / Step 8 backfill)
Commit
c8ddee45backfills the evals plan that the structuralrefactor shipped without. Per genesis the EVALS GATE blocks any
future skill-body change from shipping without a passing eval
suite.
Files added under
.agents/skills/batch-bug-shepherd/evals/:content/fold-vs-defer-panel.json(+ fixtures) -- Phase X.2rubric application on a panel CEO
ship_with_followupsreturn:3 follow-ups folded, 1 deferred with
scope_boundary_crossed.content/copilot-classification-and-fold.json(+ fixtures) --Phase X.0 LEGIT/NOT-LEGIT classification on 5 inline comments:
4 LEGIT folded, 1 NOT-LEGIT dismissed with rationale logged.
content/ci-recovery-lint-bucket.json(+ fixtures) -- Phase X.6bucket-1 lint recovery via
ruff format+ push + re-watch.real-task-refinement.md-- wave-1 vs wave-2 evidence (seetable below) plus the iter-4 fix(opencode): validate-and-warn on incompatible agent frontmatter at install (Phase 1 of #581) #1513 CI-infra rollback as a
positive trace of the new bucket-3 recovery path.
A brief Evals section was added to
SKILL.mdnear the bottompointing at
evals/and stating the EVALS GATE.Wave-1 (v1 SKILL.md) vs Wave-2 (v2 SKILL.md)
Eval suite results
(gate >= 1).
Why structured-input evals (not live-PR)
This skill takes 30+ minutes per real-PR shepherd-driver run and
composes against network-gated assets (panel, Copilot, CI). A true
live
with_skillvswithout_skillcomparison is infeasible at CIcadence. The structured-input evals exercise the LOAD-BEARING
decision policy (fold-vs-defer rubric, Copilot classification, CI
bucket routing) rather than the long-running orchestration. The
wave-1 -> wave-2 table above stands as the
real-task-refinement evidence; the structured-input evals stand as
the per-change regression guard.