Deterministic, language-agnostic propagate via inventory + translator + consensus by panayotovk · Pull Request #36 · juxt/allium

panayotovk · 2026-05-17T18:30:19Z

Follow-up to #35 (deterministic distill). Same inventory + translator + consensus architecture, applied to propagate, with a backend-dispatch layer so the methodology lands language-agnostically. Two reference backends ship in this PR: pytest+hypothesis (Python) and jest+fastcheck (TypeScript). Adding a third backend is a documented exercise against skills/propagate/references/backend-authoring-guide.md with no translator changes required.

Why

propagate today is fully LLM-mediated and inherits all of pre-deterministic distill's variance, with worse consequences — its output (test files) gets committed to source and re-run forever. Every run produces a different test suite, compounding CI churn, review noise, and reviewer fatigue.

Pipeline

allium plan / allium model              (deterministic external inputs)
   ↓
K subagents (Agent tool, default K=3)
   ↓  each produces obligation-bridge-i.json
scripts/canonicalize-obligations.mjs    (per inventory — multiset validation against `allium plan`)
   ↓
scripts/merge-obligations.mjs           (K-vote consensus; bridge ambiguity surfaced as low-confidence stubs)
   ↓
scripts/obligations-to-tests.mjs        (translator core + per-backend manifest/templates)
   ↓ N test files
scripts/run-suite.mjs                   (Stage C: runs backend's runner; produces propagation-report.md)

The translator is byte-deterministic given fixed inputs (proven by repeated re-translation in development). LLM judgement enters only at Stage A; everything downstream is pure functions over JSON.

A/B results (real K=3 LLM runs, both fixtures)

	Python — `insurance-claims`	TypeScript — `build-pipeline`
	Baseline → Experimental	Baseline → Experimental
Test files per run	10 / 10 / 11 → 46 / 46 / 46	9 / 11 / 9 → 31 / 31 / 31
3-way file-name set agreement	6 of 15 (40%) → 46 of 46 (100%)	3 of 21 (14%) → 31 of 31 (100%)
Byte-identity of in-common files (per pair)	0 of 6 in-common → 27 / 46 (59%)	0 of 3 in-common → avg 73%
Stage C report present	0 / 3 → 3 / 3	0 / 3 → 3 / 3
Wall-clock per sample	577 s → 517 s	758 s → 554 s

Baseline samples produced 15 unique file names across 3 runs against insurance-claims, with only 6 appearing in all three. Even files with matching names differed by 50–100% in size (e.g. test_rules.py is 10.5 KB vs 17.3 KB across samples). On build-pipeline the agreement dropped to 3 of 21 unique names, with 0 byte-identical matches.

Experimental produces the same file set every run (100% set agreement on both fixtures). Byte-identity across two independent K=3 runs reflects pure consensus variance: 73% on TS where all three runs picked the same code_root framing; 59% on Python where one run picked a different framing (now fixed via a SKILL.md tightening that mechanically locks code_root).

How bridge ambiguity is surfaced

Where the K-vote can't converge on a single witnessing symbol, the merged inventory keeps the candidates and the translator emits a backend-idiomatic stub:

pytest: pytest.skip("bridge-unresolved") with the candidates in the docstring.
jest: test.skip("name [bridge-unresolved]", () => { ... }) with the candidates in a comment block.

Reviewers see ambiguity, not silence. On the two real A/B runs, 3 of 96 obligations on insurance-claims and 4 of 63 on build-pipeline landed in bridge-unresolved — all genuinely-ambiguous cases (an invariant enforced across multiple rules, a Routes surface aggregating many handlers, etc.).

Wrong bridges that K-vote can't fix (all subagents agree on the wrong primary) are caught downstream by the type checker / test runner — e.g. on build-pipeline, all three TS subagents agreed ReceiveGithubPushEvent lives in src/routes.ts when it's actually in src/webhooks.ts; tsc flags it as a missing export. The safety net works as designed.

Adding a new backend

A backend is skills/propagate/backends/<id>/ with:

manifest.json — language, file extension, runner command, report format, per-test_kind imports, named bridge_import.transform.
name-policy.json — casing rules and file/test name patterns.
conventions.md — Stage-A subagent guidance for the symbol convention.
templates/ — six placeholder-driven templates.

The translator has a small registry of named bridge_import transforms (python_module, typescript_relative, noop); adding proptest+cargo-test, rapid+go-test, etc. is a manifest + name-policy + 6 templates plus one transform entry. No edits to the translator core, the canonicaliser, or the merger are required.

See skills/propagate/references/backend-authoring-guide.md for the full contract.

Reproducing the A/B numbers

(Eval harness scripts live in the sandbox repo, not in this plugin. They're available if helpful; see referenced commit messages.)

# Within the sandbox repo:
node eval/run-propagate.mjs --variants baseline,experimental \
    --samples 3 --fixtures insurance-claims,build-pipeline \
    --backends pytest+hypothesis,jest+fastcheck --parallel
node eval/compare-propagate.mjs <results-dir>

Known caveats

Cross-orchestration byte-identity is 59–73%, not 100%. This is genuine K=3 consensus variance on medium-confidence bridges; the file set is 100% deterministic and the pipeline (canonicalize → merge → translate) is byte-identical given fixed inputs. Higher K=5 should narrow this further; reviewers may prefer that as a default.
Stage C for TS is partial in our test env (no npx jest installed); reports are generated but don't include runtime outcomes. tsc --noEmit was used as a parse-time check on the hand-validated subset and passes after a one-line : unknown annotation fix in the PBT template.
allium plan emits duplicate obligation_ids when the spec has overloaded rules (e.g. build-pipeline's two rule ReceiveGithubPushEvent declarations). The canonicaliser handles this via deterministic disambiguation (__1/__2 suffixes); a cleaner fix lives in allium plan itself but isn't blocking.
Two backends, not three. Rust/Go/Elixir are explicit non-goals here — the v1 deliverable is the methodology and one extra-language proof point. The interface contract is what should be reviewed; new backends are documented exercises.

Files changed (25 files, +2812 / -180)

Two commits — review them independently if it helps:

propagate: byte-deterministic pipeline (schema, scripts, orchestrator) — 4 scripts, the rewritten SKILL.md, and two reference docs.
propagate: pytest+hypothesis and jest+fastcheck reference backends — two backend directories with their templates.

🤖 Generated with Claude Code

Rewrites the propagate skill to use the same inventory + translator + consensus architecture as distill: K subagents produce structured obligation-bridge inventories, language-agnostic scripts canonicalise, merge by K-vote, and dispatch to a per-language backend (manifest + name-policy + templates) which is loaded from the skill's backends/ directory. The translator is byte-deterministic given fixed inputs. Bridge ambiguity (where K subagents cannot converge on a single witness) is surfaced as low-confidence stubs with candidate symbols, not silenced. Stage C runs the backend's runner command (e.g. pytest, jest) and emits a categorised propagation-report.md (pass / fail / error / bridge-unresolved / infrastructure-gap). Adds: - scripts/canonicalize-obligations.mjs (multiset validation against allium plan; deterministic disambiguation of duplicate obligation IDs from overloaded spec rules) - scripts/merge-obligations.mjs (K-vote consensus; per-field modal voting; bridge-ambiguity surfaced as low confidence) - scripts/obligations-to-tests.mjs (translator core + named bridge_import transforms + 4-construct template renderer) - scripts/run-suite.mjs (Stage C with pluggable per-format adapters) - skills/propagate/SKILL.md (rewritten as orchestrator; code_root and spec_path are mechanically locked from the user invocation so two runs on the same project use identical framing) - skills/propagate/references/obligation-bridge-schema.md - skills/propagate/references/backend-authoring-guide.md

Two reference backends prove the dispatcher works across (language × test framework × PBT framework) combinations. Each backend is a self-contained directory under skills/propagate/backends/<id>/ with no translator-side code changes required to add it. Each backend consists of: - manifest.json declares language, file extension, runner command, report format, imports lists per test_kind, and the named bridge_import transform that turns <path>::<symbol> into an idiomatic import line. - name-policy.json declares casing rules and the file/test name patterns the canonicaliser applies. - conventions.md human-readable guidance for Stage A subagents on how to populate the bridge field for this language. - templates/ six placeholder-driven templates (test-file, assertion, pbt-property, state-machine, stub-unresolved, fixture). pytest+hypothesis: - python_module bridge import (app/services.py::approve_claim -> "from app.services import approve_claim") - conftest fixture style - pytest-junitxml runner adapter jest+fastcheck: - typescript_relative bridge import (rewrites <path> relative to the test file's location, e.g. "../src/services/claim") - in-file fixture style (factories declared next to tests) - jest-json runner adapter A third backend is the documented exercise in references/backend-authoring-guide.md (no translator changes required).

Yavor Panayotov added 2 commits May 17, 2026 21:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deterministic, language-agnostic propagate via inventory + translator + consensus#36

Deterministic, language-agnostic propagate via inventory + translator + consensus#36
panayotovk wants to merge 2 commits into
juxt:mainfrom
panayotovk:feat/deterministic-propagate

panayotovk commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

panayotovk commented May 17, 2026

Why

Pipeline

A/B results (real K=3 LLM runs, both fixtures)

How bridge ambiguity is surfaced

Adding a new backend

Reproducing the A/B numbers

Known caveats

Files changed (25 files, +2812 / -180)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant