Skip to content

Deterministic, language-agnostic propagate via inventory + translator + consensus#36

Open
panayotovk wants to merge 2 commits into
juxt:mainfrom
panayotovk:feat/deterministic-propagate
Open

Deterministic, language-agnostic propagate via inventory + translator + consensus#36
panayotovk wants to merge 2 commits into
juxt:mainfrom
panayotovk:feat/deterministic-propagate

Conversation

@panayotovk
Copy link
Copy Markdown

Follow-up to #35 (deterministic distill). Same inventory + translator + consensus architecture, applied to propagate, with a backend-dispatch layer so the methodology lands language-agnostically. Two reference backends ship in this PR: pytest+hypothesis (Python) and jest+fastcheck (TypeScript). Adding a third backend is a documented exercise against skills/propagate/references/backend-authoring-guide.md with no translator changes required.

Why

propagate today is fully LLM-mediated and inherits all of pre-deterministic distill's variance, with worse consequences — its output (test files) gets committed to source and re-run forever. Every run produces a different test suite, compounding CI churn, review noise, and reviewer fatigue.

Pipeline

allium plan / allium model              (deterministic external inputs)
   ↓
K subagents (Agent tool, default K=3)
   ↓  each produces obligation-bridge-i.json
scripts/canonicalize-obligations.mjs    (per inventory — multiset validation against `allium plan`)
   ↓
scripts/merge-obligations.mjs           (K-vote consensus; bridge ambiguity surfaced as low-confidence stubs)
   ↓
scripts/obligations-to-tests.mjs        (translator core + per-backend manifest/templates)
   ↓ N test files
scripts/run-suite.mjs                   (Stage C: runs backend's runner; produces propagation-report.md)

The translator is byte-deterministic given fixed inputs (proven by repeated re-translation in development). LLM judgement enters only at Stage A; everything downstream is pure functions over JSON.

A/B results (real K=3 LLM runs, both fixtures)

Python — insurance-claims TypeScript — build-pipeline
Baseline → Experimental Baseline → Experimental
Test files per run 10 / 10 / 11 → 46 / 46 / 46 9 / 11 / 9 → 31 / 31 / 31
3-way file-name set agreement 6 of 15 (40%) → 46 of 46 (100%) 3 of 21 (14%) → 31 of 31 (100%)
Byte-identity of in-common files (per pair) 0 of 6 in-common → 27 / 46 (59%) 0 of 3 in-common → avg 73%
Stage C report present 0 / 3 → 3 / 3 0 / 3 → 3 / 3
Wall-clock per sample 577 s → 517 s 758 s → 554 s

Baseline samples produced 15 unique file names across 3 runs against insurance-claims, with only 6 appearing in all three. Even files with matching names differed by 50–100% in size (e.g. test_rules.py is 10.5 KB vs 17.3 KB across samples). On build-pipeline the agreement dropped to 3 of 21 unique names, with 0 byte-identical matches.

Experimental produces the same file set every run (100% set agreement on both fixtures). Byte-identity across two independent K=3 runs reflects pure consensus variance: 73% on TS where all three runs picked the same code_root framing; 59% on Python where one run picked a different framing (now fixed via a SKILL.md tightening that mechanically locks code_root).

How bridge ambiguity is surfaced

Where the K-vote can't converge on a single witnessing symbol, the merged inventory keeps the candidates and the translator emits a backend-idiomatic stub:

  • pytest: pytest.skip("bridge-unresolved") with the candidates in the docstring.
  • jest: test.skip("name [bridge-unresolved]", () => { ... }) with the candidates in a comment block.

Reviewers see ambiguity, not silence. On the two real A/B runs, 3 of 96 obligations on insurance-claims and 4 of 63 on build-pipeline landed in bridge-unresolved — all genuinely-ambiguous cases (an invariant enforced across multiple rules, a Routes surface aggregating many handlers, etc.).

Wrong bridges that K-vote can't fix (all subagents agree on the wrong primary) are caught downstream by the type checker / test runner — e.g. on build-pipeline, all three TS subagents agreed ReceiveGithubPushEvent lives in src/routes.ts when it's actually in src/webhooks.ts; tsc flags it as a missing export. The safety net works as designed.

Adding a new backend

A backend is skills/propagate/backends/<id>/ with:

  • manifest.json — language, file extension, runner command, report format, per-test_kind imports, named bridge_import.transform.
  • name-policy.json — casing rules and file/test name patterns.
  • conventions.md — Stage-A subagent guidance for the symbol convention.
  • templates/ — six placeholder-driven templates.

The translator has a small registry of named bridge_import transforms (python_module, typescript_relative, noop); adding proptest+cargo-test, rapid+go-test, etc. is a manifest + name-policy + 6 templates plus one transform entry. No edits to the translator core, the canonicaliser, or the merger are required.

See skills/propagate/references/backend-authoring-guide.md for the full contract.

Reproducing the A/B numbers

(Eval harness scripts live in the sandbox repo, not in this plugin. They're available if helpful; see referenced commit messages.)

# Within the sandbox repo:
node eval/run-propagate.mjs --variants baseline,experimental \
    --samples 3 --fixtures insurance-claims,build-pipeline \
    --backends pytest+hypothesis,jest+fastcheck --parallel
node eval/compare-propagate.mjs <results-dir>

Known caveats

  • Cross-orchestration byte-identity is 59–73%, not 100%. This is genuine K=3 consensus variance on medium-confidence bridges; the file set is 100% deterministic and the pipeline (canonicalize → merge → translate) is byte-identical given fixed inputs. Higher K=5 should narrow this further; reviewers may prefer that as a default.
  • Stage C for TS is partial in our test env (no npx jest installed); reports are generated but don't include runtime outcomes. tsc --noEmit was used as a parse-time check on the hand-validated subset and passes after a one-line : unknown annotation fix in the PBT template.
  • allium plan emits duplicate obligation_ids when the spec has overloaded rules (e.g. build-pipeline's two rule ReceiveGithubPushEvent declarations). The canonicaliser handles this via deterministic disambiguation (__1/__2 suffixes); a cleaner fix lives in allium plan itself but isn't blocking.
  • Two backends, not three. Rust/Go/Elixir are explicit non-goals here — the v1 deliverable is the methodology and one extra-language proof point. The interface contract is what should be reviewed; new backends are documented exercises.

Files changed (25 files, +2812 / -180)

Two commits — review them independently if it helps:

  1. propagate: byte-deterministic pipeline (schema, scripts, orchestrator) — 4 scripts, the rewritten SKILL.md, and two reference docs.
  2. propagate: pytest+hypothesis and jest+fastcheck reference backends — two backend directories with their templates.

🤖 Generated with Claude Code

Yavor Panayotov added 2 commits May 17, 2026 21:26
Rewrites the propagate skill to use the same inventory + translator +
consensus architecture as distill: K subagents produce structured
obligation-bridge inventories, language-agnostic scripts canonicalise,
merge by K-vote, and dispatch to a per-language backend (manifest +
name-policy + templates) which is loaded from the skill's backends/
directory.

The translator is byte-deterministic given fixed inputs. Bridge
ambiguity (where K subagents cannot converge on a single witness) is
surfaced as low-confidence stubs with candidate symbols, not silenced.

Stage C runs the backend's runner command (e.g. pytest, jest) and
emits a categorised propagation-report.md (pass / fail / error /
bridge-unresolved / infrastructure-gap).

Adds:
  - scripts/canonicalize-obligations.mjs  (multiset validation against
    allium plan; deterministic disambiguation of duplicate obligation
    IDs from overloaded spec rules)
  - scripts/merge-obligations.mjs         (K-vote consensus; per-field
    modal voting; bridge-ambiguity surfaced as low confidence)
  - scripts/obligations-to-tests.mjs      (translator core + named
    bridge_import transforms + 4-construct template renderer)
  - scripts/run-suite.mjs                 (Stage C with pluggable
    per-format adapters)
  - skills/propagate/SKILL.md             (rewritten as orchestrator;
    code_root and spec_path are mechanically locked from the user
    invocation so two runs on the same project use identical framing)
  - skills/propagate/references/obligation-bridge-schema.md
  - skills/propagate/references/backend-authoring-guide.md
Two reference backends prove the dispatcher works across (language ×
test framework × PBT framework) combinations. Each backend is a
self-contained directory under skills/propagate/backends/<id>/ with
no translator-side code changes required to add it.

Each backend consists of:
  - manifest.json   declares language, file extension, runner command,
                    report format, imports lists per test_kind, and the
                    named bridge_import transform that turns
                    <path>::<symbol> into an idiomatic import line.
  - name-policy.json declares casing rules and the file/test name
                    patterns the canonicaliser applies.
  - conventions.md  human-readable guidance for Stage A subagents on
                    how to populate the bridge field for this language.
  - templates/      six placeholder-driven templates (test-file,
                    assertion, pbt-property, state-machine,
                    stub-unresolved, fixture).

pytest+hypothesis:
  - python_module bridge import (app/services.py::approve_claim ->
    "from app.services import approve_claim")
  - conftest fixture style
  - pytest-junitxml runner adapter

jest+fastcheck:
  - typescript_relative bridge import (rewrites <path> relative to the
    test file's location, e.g. "../src/services/claim")
  - in-file fixture style (factories declared next to tests)
  - jest-json runner adapter

A third backend is the documented exercise in
references/backend-authoring-guide.md (no translator changes
required).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant