Skip to content

DO NOT MERGE — Verification autopilot and routing policy engine (experimental)#15

Open
johnlindquist wants to merge 30 commits intomainfrom
explore/ideas
Open

DO NOT MERGE — Verification autopilot and routing policy engine (experimental)#15
johnlindquist wants to merge 30 commits intomainfrom
explore/ideas

Conversation

@johnlindquist
Copy link
Copy Markdown
Collaborator

@johnlindquist johnlindquist commented Mar 28, 2026

DO NOT MERGE — Experimental exploration branch

Summary

Closed-loop learning system for skill routing: observe → verify → promote → recall.

Core subsystems

  • Verification ledger — append-only observation log tracking what Claude actually verifies across 4 boundary types (UI renders, client requests, server handlers, environment)
  • Verification plan + directives — computes "next action" suggestions from ledger state and hands them to subagents as environment-variable directives
  • Closed-loop feedback — PostToolUse observer matches executed commands against directives, persists observations, and exposes adherence via verify-plan --json
  • Verification signal broadening — observes Read, Grep, Glob, and Fetch results (not just Bash), with local provenance gating to block external fetch contamination

Routing policy engine

  • Policy ledger + compiler — learns from verification outcomes to boost/suppress skill routing decisions over time
  • Policy recall + attribution — route-scoped policy recall for prompt and pretooluse hooks, with precise causal credit attribution
  • Decision capsules + causality — first-class causal evidence on every routing decision (why matched, boosted, recalled, dropped)
  • Closure diagnostics — append-only capsules recording why observations did/didn't close policy gates

Learning & memory

  • Verified rule learninglearn CLI replays verified routing decisions and promotes them into project-scoped JSON rulebooks
  • Companion learning — persists companion skill pairs that consistently close verification gaps together
  • Playbook recall — multi-step verified workflows promoted and recalled as reusable procedures
  • Learned routing rulebooks — runtime ranking applies stable learned guidance from canonical project-scoped artifacts

Observability

  • Session diagnostics CLIsession-explain, routing-explain, decision-cat commands for inspecting routing decisions
  • Skill exclusion policy — unified exclusion rules with manifest parity tests

Stats

  • ~43,250 lines added across 146 files
  • ~50 new test suites (80+ total test files)
  • 30 commits on explore/ideas

Test plan

  • bun test — full suite passes
  • bun run build — hooks + manifest + from-skills compile cleanly
  • bun run doctor — self-diagnosis passes
  • Review verification plan output: bun run src/commands/verify-plan.ts --json
  • Review learned rulebook output: bun run src/cli/learn.ts --help

Preserve a deterministic verification trail so troubleshooting guidance can
advance from observed evidence instead of repeating generic next steps.

Surface one ranked verification action and scoped context to the CLI,
hooks, and subagents so agents can continue the same investigation with
clear boundary coverage and less duplicated probing.

Ploop-Iter: 1
Carry verification intent across subagent boundaries so agents can
receive a deterministic next step and downstream hooks can confirm
whether the requested verification actually happened.

Keep verification planning resilient when cached state is missing or
stale by recomputing from ledger data, and log fallback failures so
runtime proof paths do not fail silently during debugging.

Ploop-Iter: 2
Persist verification observations and expose adherence snapshots so the plugin can replan from real execution evidence instead of only static intent.

This gives downstream agents a stable machine-readable view of whether the last verification action followed guidance, which makes the autopilot loop more reliable and easier to regress-test.

Ploop-Iter: 3
Persist routing outcomes and exposures so skill selection can learn from
verified boundary observations instead of relying only on static pattern
weights.

Align explain output and regression coverage with the runtime injector so
the adaptive policy remains inspectable and deterministic as the routing
ledger evolves.

Ploop-Iter: 1
Prevent routing-policy learning from over-crediting skills across unrelated verification stories or routes. This keeps adaptive ranking tied to the active verification thread so future injections learn from relevant evidence instead of noisy session-wide matches.

Harden session-scoped exposure storage so unusual session identifiers cannot leak into tmp filenames, and keep generated artifacts free of stray orphan test fixtures that would pollute validation and manifest state.

Ploop-Iter: 2
Automated checkpoint commit.

Ploop-Iter: 3
Make routing learning trustworthy enough to drive policy updates from replayed evidence instead of loosely correlated observations.

This keeps route and story attribution honest, preserves enough trace detail to reconstruct routing decisions deterministically, and enables bounded policy tuning from session outcomes without teaching on ambiguous signals.

Ploop-Iter: 4
Move verification handoff into a shared directive contract so top-level hooks and subagents resolve the same story, route, and next action across tool boundaries.

Export deterministic verification env clearing and route-aware fallback behavior to prevent stale state from leaking between calls and to let PostToolUse close policy exposures even when command inference is incomplete.

Add focused tests around banner export and directive-win closure so the verification loop stays stable as routing policy logic evolves.

Ploop-Iter: 1
Keep runtime routing evidence trustworthy by preventing test-only skills\nfrom leaking into the generated manifest and by locking the\nverification directive handoff to a stable env contract.\n\nThese checks reduce false routing attribution, catch manifest/live-scan\ndrift before it reaches hook behavior, and make end-to-end verification\nresolution deterministic across SubagentStart and PostToolUse.\n\nPloop-Iter: 2
Align control-plane diagnostics with manifest generation so operators and
agents see the same runtime truth instead of divergent live-scan results.
This preserves trust in routing and doctor output when fixture skills exist
and gives one place to inspect session state during verification work.

Harden the new snapshot path so broken generated state degrades into
actionable diagnostics instead of a command crash, which keeps debugging
workflows usable when manifests drift or become malformed.

Ploop-Iter: 3
Route-aware recall lets the injector reuse historically verified winners when pattern matching misses them, improving skill selection without weakening strict story and route attribution.

Persisting exact-route, wildcard, and legacy policy buckets preserves backward compatibility while creating enough evidence to boost and recall skills conservatively across similar flows.

Ploop-Iter: 1
Keep historically verified recall from overriding stronger live matches while preserving traceability and repeatable rebuild behavior.

Restore the excluded test-skill provenance so manifest-backed diagnostics and session explain output continue to reflect the repo's exclusion policy instead of silently drifting after regeneration.

Ploop-Iter: 2
Expose deterministic routing-diagnosis data at the decision edge so
operators and downstream agents can understand why route-scoped
recall did or did not fire without changing routing behavior.

Extend session-explain with the same additive diagnosis surface to
make recent routing outcomes inspectable in CI and local debugging,
which reduces guesswork when policy history, precedence, or missing
signal affect recall.

Ploop-Iter: 3
Keep verification evidence and next-action planning isolated per story so
multi-route sessions do not contaminate each other.

Align downstream consumers with the active-story projection and make
session auto-detection prefer the freshest ledger activity so routing
policy recall, directives, and CLI output stay coherent.

Ploop-Iter: 1
Prevent co-injected helper skills from distorting long-term routing policy so policy learning stays tied to the skill that actually drove an injection. This keeps verification outcomes and replay data fully observable while making future ranking decisions more causally accurate and stable.\n\nAlso surface manifest exclusion parity and directive-env fallback details in session diagnostics so operators can understand why a skill is absent and why a verification closure did or did not resolve.\n\nPloop-Iter: 2
Automated checkpoint commit.

Ploop-Iter: 3
Prompt-time routing only becomes useful training data when it can be
resolved against a concrete verification boundary. Binding prompt
decisions to the plan's predicted boundary avoids stale exposures and
prevents policy learning from synthetic none|none scenarios.

Ploop-Iter: 4
Recover historically successful skills during prompt submission when direct prompt matching misses, so active verification flows can still surface the right help instead of failing closed on zero-match paths.

Expose the recalled path in doctor, trace, and attribution output so operators can understand why a skill was injected and trust the policy loop when it promotes a proven route-scoped winner.

Ploop-Iter: 5
@johnlindquist johnlindquist added the experimental research Experimental research and exploration label Mar 28, 2026
Enable routing improvements to be promoted from verified evidence instead of relying on manual policy tuning.

Deterministic replay gating keeps learned rules safe to adopt, and project-scoped session discovery prevents unrelated tmp artifacts from contaminating learning results across worktrees.

Ploop-Iter: 1
Keep routing-policy as an observational ledger so learned promotion runs do not fabricate wins, exposures, or demotions into the evidence store. This preserves ground truth for later analysis while still producing a deterministic promotion artifact that downstream tooling can inspect and apply safely.

Tighten replay gating around observed verification outcomes so pending placeholder traces from PreToolUse do not count as verified success. That keeps learn-mode promotion decisions aligned with the real verification lifecycle and prevents false positive promotions or regressions.

Ploop-Iter: 2
Promote replay-verified routing decisions into a canonical project-scoped rulebook so runtime ranking can apply stable learned guidance and decision capsules can expose provenance without re-deriving routing state.

Normalize demotion boosts to stored magnitudes so compiler-produced rulebooks preserve runtime precedence semantics and cannot invert a demotion into an accidental promotion.

Ploop-Iter: 3
Verification guidance was too dependent on Bash-only evidence, which left useful observations from reads, grep, glob, and fetch operations invisible to the planner. Normalizing those signals lets the verification loop keep state from more of the agent's actual behavior.

The routing policy also needed stronger attribution discipline so soft evidence can inform plan progress without being over-credited as a successful verification outcome. Closing that gap protects route-scoped learning, prevents stale exposures from lingering silently, and gives the policy ledger cleaner data to learn from over time.

Ploop-Iter: 1
Prevent external fetch observations from training routing policy and keep verification attribution tied to the story that actually owns the observed route. This makes oracle feedback more trustworthy and keeps gate telemetry machine-readable for regression coverage.

Ploop-Iter: 2
Capture why PostToolUse verification observations do or do not close\nrouting policy so agents and developers can distinguish gate failures\nfrom zero-match exposure misses without replaying ledger state by hand.\n\nPersisting append-only closure capsules preserves negative-path receipts\nand makes the verification loop auditable, deterministic to test, and\nsafer to evolve as routing resolution logic becomes more nuanced.\n\nPloop-Iter: 3
Improve routing by preserving companion skill pairs that consistently close verification gaps, so repeated scenarios can surface complementary skills automatically instead of relearning the pairing every session.

Keep companion evidence separate from single-skill policy credit and expose the learned artifact through the learn CLI and routing diagnostics so promotion decisions stay explainable, replayable, and safe to inspect.

Also record recalled companions explicitly in decision traces so routing-explain and other observability surfaces can identify companion injections reliably.

Ploop-Iter: 1
Expose companion recall in session diagnostics so operators can verify that verified-companion routing remains causal, synthetic, and subordinate to stronger direct or policy-driven matches.

Locking these expectations into regression coverage reduces the risk of attribution drift, hidden policy contamination, and debugging blind spots as routing logic evolves.

Ploop-Iter: 2
Persist first-class causal evidence in routing traces so diagnosis can explain why skills were matched, boosted, recalled, linked, or dropped without inferring intent from ranked order alone.

Teach session-explain to prefer explicit causes and edges so operator output stays correct as routing grows more synthetic and route-scoped, while preserving safe fallback behavior for older traces.

Ploop-Iter: 3
@Melkeydev
Copy link
Copy Markdown
Collaborator

Before reviewing - we need to find a better solution from only supporting Claude in this current repository. With Cursor and other harnesses, we need to find a way to support more generically, and not just Claude

Single-skill and pairwise memory were not preserving proven multi-step workflows, so repeated verification wins could not compound into reusable procedure. Persisting promoted playbooks and recalling them during injection lets the plugin reuse validated sequences, keeps learn output deterministic across JSON/text/write flows, and adds regression coverage so procedural memory stays accretive instead of speculative.

Ploop-Iter: 1
Keep verified playbook recall deterministic and credit-safe so learned
procedures improve guidance without distorting long-term routing policy.

Threading the playbook banner and reason metadata through the hook output
makes the contract inspectable, while forcing inserted steps to inherit the
anchor skill for exposure attribution prevents context helpers from stealing
policy wins or stale-miss credit from the originating skill.

Ploop-Iter: 2
Ensure verified playbook metadata only reflects steps that were actually
applied so banner output, exposure attribution, and causality stay
consistent with runtime behavior.

Add deterministic verification coverage for apply versus no-op paths and
align the focused tests with the stricter contract to prevent regressions.

Ploop-Iter: 3
@johnlindquist johnlindquist changed the title Verification autopilot and routing policy engine DO NOT MERGE — Verification autopilot and routing policy engine (experimental) Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

experimental research Experimental research and exploration

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants