From fab5c59d68496a5d273561f2047e335e8920c37f Mon Sep 17 00:00:00 2001
From: "Hanna Paasivirta (OpenFn)" <hanna@openfn.org>
Date: Tue, 12 May 2026 01:40:52 +0900
Subject: [PATCH 01/15] add plan

---
 .../4-acceptance-tests.md                     | 382 ++++++++++++++++++
 .../observations-on-acceptance-tests.md       | 169 ++++++++
 2 files changed, 551 insertions(+)
 create mode 100644 agent-team-architecture-plan/4-acceptance-tests.md
 create mode 100644 agent-team-architecture-plan/observations-on-acceptance-tests.md
diff --git a/agent-team-architecture-plan/4-acceptance-tests.md b/agent-team-architecture-plan/4-acceptance-tests.md
new file mode 100644
index 0000000..7b9d2fc
--- /dev/null
+++ b/agent-team-architecture-plan/4-acceptance-tests.md
@@ -0,0 +1,382 @@
+# Section 4 — Acceptance Tests Architecture
+
+> Scope: quality-, voice-, and style-focused evaluation of Apollo chat services (`global_chat`, `workflow_chat`, `job_chat`) against product-owner-authored hero questions. Judged by an LLM-as-judge, reviewable by a human (Joe or Brandon), optionally logged to Langfuse for trend analysis.
+
+**Non-goals (other tiers):**
+
+- Unit tests of pure functions → Section 1.
+- Mocked-LLM `main()` invocations → Section 2.
+- Functional flow with regex content assertions → Section 3.
+
+Acceptance answers a different question than integration: not "does the system function end-to-end?" but "does the answer sound like us, read well, satisfy the user's intent, and not regress in voice as we bump model versions?"
+
+This section is written against the unit-test architecture that landed in #486 (commit `e249b86`) — those folder names, that `conftest.py`, that `services/testing/` package. Where the older draft of this section disagreed with what shipped, the shipped reality wins.
+
+---
+
+## 1. Guiding principles
+
+1. **Specs are markdown.** A PO edits a text file, not Python. YAML frontmatter + markdown sections.
+2. **HTTP is internal plumbing.** Specs never mention ports, payload shapes, or service internals.
+3. **Live models.** The whole point is to audit the real production path after model upgrades.
+4. **LLM-as-judge with receipts.** Every evaluation records the judge's reasoning so a human can spot-check.
+5. **pytest is the runner.** Same as every other tier. Spec files are collected via a tiny `pytest_collect_file` hook. No custom CLI, no `bless`/`differ`/`migrate-questions` subcommands.
+6. **Human-triggered only.** Never on every push, never on a schedule. Humans decide when a change is big enough to warrant an acceptance run — via PR label or manual `workflow_dispatch`.
+7. **No `test_hooks`.** Acceptance dispatches over real HTTP through the running bun server; the bridge strips anything beyond the JSON payload, so the service-tier `test_hooks` second arg is invisible here. Every call hits the production code path with real LLM responses.
+
+---
+
+## 2. Directory layout
+
+Acceptance specs live alongside the service they test, in an `acceptance/` subfolder of `services/<svc>/tests/`. This is consistent with the unit-tests merge — tier directories (`unit/`, `integration/`, and now `acceptance/`) sit flat under `services/<svc>/tests/` with no top-level `tests/` tree.
+
+```
+services/<svc>/tests/
+  acceptance/
+    *.md                          # one spec per file
+    _template.md                  # copy-paste starter (underscore = skipped by collector)
+```
+
+Concretely:
+
+```
+services/global_chat/tests/acceptance/
+  hero-patient-sync.md
+  voice-concise-answers.md
+  refuse-non-openfn-questions.md
+services/workflow_chat/tests/acceptance/
+  *.md
+services/job_chat/tests/acceptance/
+  *.md
+```
+
+**Cross-service specs** (refusals, safety, "hero" questions that exercise the orchestrator end-to-end) live under `services/global_chat/tests/acceptance/` since `global_chat` is the entry point everyone hits.
+
+**Tool services** (e.g. `services/tools/search_documentation/`) inherit the same pattern — if a tool ever needs its own acceptance specs, drop them in `services/tools/<tool>/tests/acceptance/`. The shared collector picks them up with no additional config. In practice tools are judged through the chat services that call them; standalone tool acceptance specs are unlikely on day one.
+
+The judge helper and the markdown-spec pytest collector live in the **shared `services/testing/` package**:
+
+```
+services/testing/
+  __init__.py                     # already shipped
+  README.md                       # already shipped
+  yaml_assertions.py              # already shipped
+  judge.py                        # new — LLM-as-judge helper (~150 lines)
+  spec_collector.py               # new — SpecFile / SpecItem / parse_spec
+```
+
+`services/testing/` is on the import path via `pyproject.toml`'s `pythonpath = ["services"]`, so `from testing.judge import evaluate` works without path-munging. (The older draft of this section talked about an `apollo/testing/` peer of `services/` — that never shipped. The actual location is `services/testing/`.)
+
+The `pytest_collect_file` hook is wired into the **root `conftest.py`** (at the repo root — that's where the unit-tests merge put the tier auto-marker) so it discovers `*.md` under any `acceptance/` folder anywhere in `services/`.
+
+**No `golden/` tree, no `reports/` folder in git.** Langfuse is the trend / comparison backend. Local test output (pass/fail + judge reasoning) comes from pytest stdout and `--junitxml`. If a run needs an HTML report, generate it with `pytest-html` when someone asks for it — not preemptively.
+
+**No `services/llm_evaluator/` service.** The judge is a helper module that calls Anthropic directly via the SDK. Promote to a service only when a non-test caller needs to invoke it.
+
+---
+
+## 3. Spec format
+
+One spec per markdown file. YAML frontmatter + named markdown sections.
+
+### 3.1 Frontmatter
+
+```yaml
+---
+id: global-chat.hero.patient-sync
+title: "Build a CommCare to DHIS2 sync"
+service: global_chat                     # global_chat | workflow_chat | job_chat
+tags: [hero, voice, multi-turn]
+runs: 3                                  # default 1 — number of times to run the same spec
+judge_model: claude-sonnet-4-6           # defaults to the same in the root conftest
+---
+```
+
+Only `id` and `service` are required. Everything else inherits sensible defaults from the root `conftest.py`.
+
+### 3.2 Body sections (top-level markdown headers, case-insensitive)
+
+| Section | Required | Purpose |
+|---|---|---|
+| `# conversation` | one of conversation/question | `- user:` / `- assistant:` list. Last user line is tested; earlier lines become `history`. |
+| `# question` | one of conversation/question | Shorthand for a single-turn conversation. |
+| `# context` | optional | YAML block merged into payload: `workflow_yaml`, `page`, `context`, `attachments`, etc. |
+| `# must_include` | optional | Substrings or `/regex/` that must appear in `response`. Deterministic; failure short-circuits before the judge runs. |
+| `# must_not_include` | optional | Opposite. |
+| `# assertions` | required | Natural-language criteria, one per bullet — each passed to the LLM judge. |
+| `# notes` | optional | Reviewer context, not sent to the judge. |
+
+### 3.3 Example
+
+```markdown
+---
+id: global-chat.hero.patient-sync
+title: "Build a CommCare to DHIS2 sync"
+service: global_chat
+tags: [hero, planner]
+runs: 3
+---
+
+# conversation
+
+- user: "I want to create a workflow that fetches new patient registrations from CommCare every hour and creates matching tracked entities in DHIS2."
+
+# must_include
+- /commcare/i
+- /dhis2/i
+
+# must_not_include
+- "I cannot help with that"
+
+# assertions
+- The response proposes a workflow with at least two jobs.
+- The tone is warm and collaborative, not clinical.
+- An attached workflow_yaml is present and syntactically valid.
+- The response does not leak the user's api_key or any secret-looking string.
+```
+
+---
+
+## 4. Collection: spec → pytest item
+
+The root `conftest.py` already auto-applies a tier marker by walking `item.path.parts` for the tier directory name (see `_TIER_DIRS = ("unit", "service", "integration", "acceptance")` in the shipped file). That mechanism works for any pytest item whose path lives under an `acceptance/` folder — including `SpecItem`s collected from markdown — so **no extra marker plumbing is needed**. Acceptance specs become `pytest -m acceptance`-selectable for free.
+
+The collector itself is a standard `pytest_collect_file` hook in the root `conftest.py`:
+
+```python
+def pytest_collect_file(parent, file_path):
+    if (
+        file_path.suffix == ".md"
+        and not file_path.name.startswith("_")
+        and file_path.parent.name == "acceptance"
+    ):
+        return SpecFile.from_parent(parent, path=file_path)
+
+class SpecFile(pytest.File):
+    def collect(self):
+        spec = parse_spec(self.path)
+        for run_index in range(spec.runs):
+            yield SpecItem.from_parent(
+                self,
+                name=f"{spec.id}[run={run_index}]",
+                spec=spec,
+                run_index=run_index,
+            )
+
+class SpecItem(pytest.Item):
+    def runtest(self):
+        payload = build_payload(self.spec, self.run_index)
+        response = self.client.call(self.spec.service, payload)
+        check_must_include(self.spec, response)       # hard precondition; raises on fail
+        verdict = judge.evaluate(self.spec, response, model=self.spec.judge_model)
+        if not verdict.passed:
+            raise AssertionError(verdict.summary)
+```
+
+`SpecFile`, `SpecItem`, and `parse_spec` live in `services/testing/spec_collector.py`; the hook in `conftest.py` is a four-line import + dispatch.
+
+Each run is a separate pytest item. Benefits: `pytest -m acceptance` works, `--junitxml` works, `pytest-xdist` works, filtering with `-k hero` works. No new runner.
+
+A tiny `pytest_sessionfinish` hook counts, per spec, how many of the N runs the judge marked `passed=True` and prints `spec-id: 2/3 passed` to stdout. No pass/fail policy is applied — the count is raw output for humans to read. (The pytest exit code still reflects individual item pass/fail in the usual way.)
+
+**Sequencing note.** The collector hook is small enough to land in the scaffolding PR (overview §9 step 1) alongside the existing root `conftest.py` work. The first acceptance *spec* doesn't ship until later, but having the collector in place from day one means adding `acceptance/*.md` is purely additive.
+
+---
+
+## 5. The judge
+
+`services/testing/judge.py` is a single module (~150 lines). Not an Apollo service.
+
+### 5.1 Interface
+
+```python
+@dataclass
+class Verdict:
+    passed: bool
+    score: float                        # 0..1 — fraction of criteria passed
+    criteria: list[CriterionResult]
+    reasoning_summary: str              # shown on pytest failure
+    judge_usage: dict                   # input/output tokens
+
+@dataclass
+class CriterionResult:
+    criterion: str
+    passed: bool
+    reasoning: str
+    evidence: str                       # verbatim span from candidate
+
+def evaluate(spec: Spec, response: dict, *, model: str) -> Verdict: ...
+```
+
+### 5.2 Prompt strategy
+
+Judge prompt forces JSON via prefilled `{` and demands per-criterion verdict + verbatim evidence; bad JSON or refusal → `Verdict(passed=False, reasoning_summary="judge_error: ...")` surfaced loudly. Full prompt text lives in the judge module's docstring; this section deliberately doesn't repeat it.
+
+### 5.3 Why a helper and not a service
+
+The judge only has one caller today (this tier). A whole Apollo service + `/services/llm_evaluator` HTTP endpoint + per-service test directory is overkill for that. If future callers appear (a ranker for `search_docsite`, a sanity-check step in a generator), promote `services/testing/judge.py` to `services/llm_evaluator/llm_evaluator.py` — it's a ~50-line reshape, not a redesign.
+
+### 5.4 Self-tests for the judge
+
+The judge module itself needs tests. Per the unit-tier I/O policy (root `conftest.py` blocks `anthropic.Anthropic.__init__` for any test marked `unit`), the parts of the judge that touch a mocked Anthropic transport are **service-tier**, not unit-tier:
+
+- **Unit.** Pure helpers: prompt builder, JSON parser, frontmatter parser, `_format_criteria()`. No Anthropic construction. Lives at `services/testing/tests/unit/test_judge_*.py` (or wherever testing-package tests end up — see below).
+- **Service.** End-to-end `evaluate()` against a `MockAnthropic` transport. Lives at `services/testing/tests/service/test_judge_evaluate.py`.
+
+Open detail to settle when the service tier lands: tests for code in `services/testing/` could live alongside the module (`services/testing/tests/`) or under a shared `services/tests/` umbrella (the precedent the unit-tests doc set for `services/util.py` helpers). Either works; just commit to one before the judge ships.
+
+---
+
+## 6. Langfuse integration
+
+Langfuse is already wired on `add-langfuse` — acceptance leans on it lightly for cross-run comparison. The runner does NOT rebuild Langfuse's dataset / score UI.
+
+### 6.1 Already in place (we reuse)
+
+- `services/langfuse_util.py::should_track()` gates trace export. Payloads set `user.employee=True` to stay inside the employee window.
+- `@observe` on each chat service's `main()` — every acceptance run is automatically traced when `LANGFUSE_TRACING=true`.
+
+### 6.2 What we add
+
+1. **Session tagging.** Each run sets `session_id = f"acceptance-{spec.id}-run{i}"` and `tags = ["acceptance", spec.id, ...spec.tags]`. Done via Langfuse's `propagate_attributes`.
+2. **Score push.** After the judge returns, write one score per run: `acceptance_pass` (0/1) and `acceptance_score` (0..1). Use Langfuse's Scores API directly from `services/testing/judge.py` — no `langfuse_sink.py` wrapper.
+3. **Cross-version comparison.** Native Langfuse dataset-runs view does this. The collector surfaces the URL in stdout.
+
+### 6.3 What we don't do via Langfuse
+
+- No Langfuse-hosted eval (we own the prompt).
+- No hard dependency — acceptance runs offline if `LANGFUSE_PUBLIC_KEY` is unset OR `LANGFUSE_TRACING=false`; scores are skipped, runs still complete.
+
+---
+
+## 7. Multi-run sampling
+
+Specs declare `runs: N` in frontmatter; default is `1`. Each of the N runs becomes a separate pytest item, named `<spec-id>[run=0]` ... `<spec-id>[run=N-1]`, judged independently. `pytest_sessionfinish` prints `<spec-id>: <k>/<N> passed` to stdout. The pytest exit code reflects individual item pass/fail; the count is raw output for humans to read — no `2-of-3` policy, no aggregator. Whoever reads the output decides whether the ratio is acceptable for that spec.
+
+Per-run pass/fail comes from the LLM judge (`Verdict.passed`, §5.1).
+
+---
+
+## 8. Human review loop
+
+**Primary: Langfuse UI.** Joe / Brandon open the dashboard, filter by `tags:acceptance`, review candidate + judge reasoning + score, override with a human annotation if they disagree.
+
+**Secondary: pytest stdout / JUnit.** CI logs show `FAIL global-chat.hero.patient-sync[run=1]` with the judge's reasoning summary as the pytest message. Enough for a quick triage.
+
+No dedicated HTML report on day one. Add `pytest-html` the first time someone asks for it.
+
+---
+
+## 9. Triggers
+
+Acceptance is **never triggered automatically**. A human decides when a change is big enough to warrant spending the money on a run.
+
+| Trigger | Mechanism |
+|---|---|
+| Local manual | `poetry run pytest -m acceptance` |
+| CI manual (any branch) | GH Actions `workflow_dispatch` on the acceptance workflow |
+| PR label | Apply `run-acceptance` label to a PR |
+
+Explicitly excluded: no cron, no nightly, no push-to-main, no tag-push, no scheduled runs of any kind. If the team later decides they want continuous drift monitoring, that's a deliberate policy change — not a default.
+
+---
+
+## 10. CI workflow
+
+A second GH Actions workflow alongside the shipped `unit-tests.yaml`. Two reasonable shapes; pick one when the integration tier's workflow lands so the choice is consistent:
+
+- **Option A — shared `llm-tests.yaml`.** One file with two jobs (`integration` and `acceptance`), each gated by its own label / `workflow_dispatch` condition. Matches the original overview §6 design. Cleaner if integration and acceptance share env wiring.
+- **Option B — dedicated `acceptance-tests.yaml`.** Mirrors the existing `unit-tests.yaml` naming (one file per tier). Easier to grep, easier to point a human at "the acceptance workflow."
+
+Either way, the acceptance job's shape mirrors the integration job (see `3-integration-tests.md` §9) with three differences:
+
+1. **Triggers:** `run-acceptance` label or `workflow_dispatch` only — no `push`, no `schedule`.
+2. **Env:** `LANGFUSE_TRACING=true` (acceptance runs always trace; that's the point) plus the standard `*_TEST` secrets.
+3. **Timeout:** 45 minutes as a hard ceiling.
+
+The run command is `poetry run pytest -m acceptance -v --junitxml=tmp/test-logs/acceptance-junit.xml`. Note this is marker-filtered, not path-filtered like `unit-tests.yaml` (`pytest services/*/tests/unit`) — markdown specs need the collector hook to fire, which means crawling the test tree, which means filtering by marker. Deliberate divergence from the unit pattern, not an oversight.
+
+JUnit XML uploads as an artifact for 14 days.
+
+---
+
+## 11. Cost control
+
+Day-one approach is human-gated triggering + sensible defaults, not elaborate budget code:
+
+- Never automatic — every run is a deliberate human action.
+- `runs: 1` default — specs must opt into sampling.
+- Judge defaults to `claude-sonnet-4-6` (not opus).
+- Prompt caching on candidate calls — preserved across the N runs of one spec by shared `session_id`.
+- 45-minute workflow timeout as a hard ceiling.
+
+A budget env + soft circuit breaker can be added the first time a manual run surprises someone. Not on day one.
+
+---
+
+## 12. Dependencies on the integration tier
+
+Acceptance and integration both dispatch through a live bun server, so they share infrastructure that the integration tier owns:
+
+- `services/testing/server.py` — session-scoped `apollo_server` fixture (spawn bun, poll `GET /` until ready, SIGTERM on teardown, honour `APOLLO_TEST_BASE_URL` to reuse a running staging server).
+- `services/testing/server.py::ApolloClient` — wraps `.call()`, `.stream()`, `.ws()`. The collector's `SpecItem.runtest` uses `.call()`.
+
+Neither exists today. Acceptance can't ship before integration's server fixture lands. The collector hook itself can ship earlier (scaffolding PR) — markdown specs would just have no runner attached until `ApolloClient` arrives.
+
+---
+
+## 13. Extensibility
+
+Adding a new sub-agent or tool — no Python required:
+
+1. Ensure the new service exposes `main()` at `services/<name>/<name>.py` (auto-mounts via `describe-modules.ts`).
+2. Create `services/<name>/tests/acceptance/` and drop markdown specs.
+
+Adding a new judge model: list it in the root `conftest.py` (or let it be free-form — strings all the way). `judge_model:` in frontmatter.
+
+---
+
+## 14. Relationship to integration
+
+| Concern | Integration | Acceptance |
+|---|---|---|
+| Goal | Functional correctness | Quality, voice, style |
+| Assertions | Regex + shape | Natural-language criteria + LLM judge |
+| Trigger | PR label / push to main / manual | PR label / manual — **never automatic** |
+| Stability | Deterministic | Probabilistic (N runs) |
+| Runner | pytest | pytest |
+| Marker | `@pytest.mark.integration` (auto-applied by root conftest) | `@pytest.mark.acceptance` (auto-applied by root conftest) |
+| Location | `services/<svc>/tests/integration/test_*.py` | `services/<svc>/tests/acceptance/*.md` |
+| `test_hooks` | not used (real HTTP) | not used (real HTTP) |
+
+**Overlap rule:** a test lives in exactly one tier. An "acceptance" spec that merely asserts a YAML attachment exists belongs in integration. An integration test that checks "the tone feels terse enough" belongs in acceptance.
+
+---
+
+## 15. Migration of existing artefacts
+
+- `services/job_chat/evaluation/questions.md` — mostly-compatible format. One-time manual conversion (split per entry, add frontmatter, drop into `services/job_chat/tests/acceptance/`). No migration CLI needed — it's a one-shot editor task.
+- `services/global_chat/tests/test_workflow_chat_qualitative.py`, `services/global_chat/tests/test_job_chat_qualitative.py`, `services/workflow_chat/tests/test_qualitative.py`, `services/job_chat/tests/test_qualitative.py` — the prose at the top of each test (in `print()` statements) becomes `# notes`; `content`/`context` become spec sections; qualitative asserts become `# assertions` bullets. Drop the resulting markdown files into the relevant service's `acceptance/` folder (use `services/global_chat/tests/acceptance/` for cross-service tests targeting the orchestrator). Any machine-checkable asserts move to integration.
+- `temp_test_empty_response_guard.py` (recently added in #487 to all three chat services) — these are service-tier candidates (mocked LLM, structural assertion), not acceptance. Flagged here so they don't accidentally end up in `acceptance/` during migration.
+
+Migration is opt-in, one file at a time.
+
+---
+
+## 16. What this tier deliberately does NOT do
+
+- **No top-level `tests/` tree.** Specs live under their service.
+- **No `apollo/testing/` peer of `services/`.** The shared package is `services/testing/` — that's what shipped in #486.
+- **No `services/llm_evaluator/` service.** Judge is a helper module.
+- **No custom acceptance runner.** Pytest collects specs; that's it.
+- **No `bless` / `differ` / `migrate-questions` / `review` subcommands.** The first two make sense if we adopt golden-file diffing; we don't on day one (Langfuse's dataset-runs comparison is the primitive). The last two are one-off editor tasks.
+- **No `golden/` git tree.** Model drift is tracked in Langfuse.
+- **No HTML reporter.** `pytest-html` is a line in `pyproject.toml` the day we want it.
+- **No per-spec cost caps, budget estimator, `list`/`lint` commands, skip-on-no-change mode.** Defer until bills say otherwise.
+- **No `criteria_mode: weighted` with per-criterion weights.** `all` or `any` across criteria. Add weighting when a spec genuinely needs it.
+
+---
+
+## Summary
+
+Acceptance = markdown specs in `services/<svc>/tests/acceptance/` + a `pytest_collect_file` hook in the existing root `conftest.py` + a tiny judge helper in `services/testing/judge.py` + Langfuse scores. The `acceptance` marker is already declared in `pyproject.toml` and auto-applied by the existing path-based hook — no new marker plumbing. No new Apollo service, no custom runner, no golden tree, no top-level `tests/` directory. Runs via the standard pytest mechanism under a label-gated GH Actions workflow. Adding a sub-agent or tool means dropping markdown files under that service's `acceptance/` folder. The judge promotes to a service the day it has a second caller; nothing else changes. Acceptance depends on the integration tier shipping `services/testing/server.py` (`apollo_server` fixture + `ApolloClient`) — it can't run before that lands, though the collector itself can ship earlier in the scaffolding PR.
diff --git a/agent-team-architecture-plan/observations-on-acceptance-tests.md b/agent-team-architecture-plan/observations-on-acceptance-tests.md
new file mode 100644
index 0000000..d058687
--- /dev/null
+++ b/agent-team-architecture-plan/observations-on-acceptance-tests.md
@@ -0,0 +1,169 @@
+# Observations: re-reading `4-acceptance-tests.md` against the unit-tests / service-tests / overview decisions
+
+**Branches consulted**
+
+- `testing-architecture-unit-tests` — most polished, taken as authoritative.
+- `testing-architecture-service-tests` — `2-service-tests.md` only.
+- `5-overview.md` — identical on both branches; treated as the contract every section must obey.
+- `4-acceptance-tests.md` — identical on both branches; the draft under review.
+
+**TL;DR.** The acceptance draft is *mostly* aligned and not overly complex. It predates a couple of the unit-tests-branch decisions, so a small handful of details have drifted. None of it warrants a rewrite — call it five surgical edits plus one mild trim.
+
+---
+
+## 1. What the draft already gets right
+
+These match the authoritative architecture without change:
+
+- **Specs live under `services/<svc>/tests/acceptance/`** as a subfolder of the per-service flat `tests/` folder. Consistent with overview §2 and unit-tests-branch §2.
+- **No top-level `tests/` tree, no `services/llm_evaluator/` service, no `golden/` git tree, no custom CLI runner.** All match overview §10 and unit-tests-branch §10.
+- **Markdown specs collected by `pytest_collect_file`** registered in the root `apollo/conftest.py`. Matches overview §5.
+- **Judge as a helper module, not a service.** Matches overview decisions and unit-tests-branch §3.
+- **CI: human-triggered only, shared `llm-tests.yaml` workflow.** Matches overview §6.
+- **Langfuse used lightly (sessions + scores), not as the runner.** Sane.
+
+The draft also correctly *doesn't* lean on the `test_hooks` second-arg pattern — acceptance goes via HTTP, which strips it. This is implicit and could be made one-line explicit (see §3 below) but isn't a defect.
+
+---
+
+## 2. What needs to change (five surgical edits)
+
+### 2.1 Make the location of `testing/` explicit
+
+**Why this matters.** There are now two competing readings of where the shared test package lives:
+
+- **Unit-tests branch + overview:** `apollo/testing/` — peer of `services/` and `platform/`, not a service. `pythonpath = ["services", "."]` lets `from testing.fixtures import ...` resolve.
+- **Service-tests branch §7:** `services/testing/` — under `services/`, on the import path via `pythonpath = ["services"]`.
+
+Per the user, the unit-tests branch is authoritative, so `apollo/testing/` (root-level peer) wins. The acceptance draft writes "`testing/`" in §2 and §5 without disambiguating; a casual reader of the service-tests branch would parse it the other way.
+
+**Fix.** Replace every bare `testing/...` reference with `apollo/testing/...` once at the top of §2, then keep using the short form. Add a one-line note clarifying the location.
+
+Affected lines: §2 (the `testing/` tree block, and the sentence "the judge helper and the markdown-spec pytest collector live in **`testing/`**"), §5 (header "`testing/judge.py`"), §6.2 step 2 (Scores API "directly from `judge.py`"), §15 ("Judge is a helper module").
+
+### 2.2 The judge self-test in §5.4 is misclassified as `_unit`
+
+The draft says:
+
+> `services/tests/test_judge_unit.py` exercises the prompt builder and JSON parser with canned inputs. Uses `MockAnthropicClient` from the service tier's shared helpers.
+
+That violates unit-tests-branch §1 rule 3:
+
+> **Zero LLM calls**, not even through a mock HTTP client. If the test needs a mocked Anthropic response to make sense, it's a service test.
+
+**Fix — pick one.** Either:
+
+- **(a)** Split the self-tests by what they touch:
+  - `apollo/testing/tests/test_judge_unit.py` — pure prompt builder + JSON parser, no mocks.
+  - `apollo/testing/tests/test_judge_service.py` — end-to-end `evaluate()` against `MockAnthropic`.
+- **(b)** Drop the unit/service distinction and call them `apollo/testing/tests/test_judge_service.py` only — the planner builder is small enough that exercising it via the mock is fine.
+
+Either way, also resolve the second drift: where do tests for `apollo/testing/` itself live? The unit-tests branch §3 reserves `services/tests/test_<helper>_unit.py` for shared `services/util.py` helpers, which is a reasonable analogue but lives under `services/`. Tests for code in `apollo/testing/` arguably belong next to that code — `apollo/testing/tests/`. This is a one-paragraph decision the draft currently glosses.
+
+### 2.3 How does the `acceptance` marker get applied?
+
+The unit/service/integration tiers all rely on a `pytest_collection_modifyitems` hook that auto-marks by filename suffix (`_unit.py`, `_service.py`, `_integration.py`). Acceptance has no such suffix — items come from `.md` files in an `acceptance/` folder via the custom `SpecFile`/`SpecItem` collector.
+
+The draft mentions `@pytest.mark.acceptance` in the comparison table (§13) but never says how it gets applied to the synthesised `SpecItem`s. Right now the only mechanism is implicit — relying on `pytest -m acceptance` working off… nothing.
+
+**Fix.** Add one sentence to §4 (right after the `SpecItem` definition):
+
+> `SpecItem.__init__` calls `self.add_marker(pytest.mark.acceptance)` so `pytest -m acceptance` selects every spec item.
+
+Or, equivalently, fold acceptance into the directory-based marker hook from overview §7 ("`acceptance/` folder name → `acceptance` marker"). Either path works; the draft just shouldn't leave it implicit.
+
+### 2.4 Spell out that acceptance never uses `test_hooks`
+
+`test_hooks` is *the* big architectural change in the service-tests branch. A reader landing on the acceptance plan after reading the others will look for it. One sentence in §1 or §13 closes the loop:
+
+> Acceptance dispatches via HTTP through the running bun server, so `test_hooks` (service-tier only) is not threaded — every call hits the real production code path with real LLM responses.
+
+This is a free clarification — the draft already implies it but never names it.
+
+### 2.5 The `pyproject` markers need acceptance's marker declared
+
+Unit-tests-branch §4 lists four markers:
+
+```toml
+markers = [
+  "unit: ...",
+  "service: ...",
+  "integration: ...",
+  "acceptance: LLM-judged quality/voice tests",
+]
+```
+
+The acceptance draft doesn't restate this — fine — but with `--strict-markers` set in service-tests-branch §8's pytest config, an undeclared `acceptance` marker would fail collection. Worth a one-line cross-reference: "the `acceptance` marker is declared in `pyproject.toml`'s `[tool.pytest.ini_options].markers` block (see overview §5 / unit-tests §4)."
+
+---
+
+## 3. Where the plan could be lighter (mild trim)
+
+The draft is 377 lines across 15 sections. Most of it earns its keep, but a few places over-describe:
+
+- **§5.2 Prompt strategy** — the four bullets (system prompt, user prompt, prefilled `{`, malformed-JSON handling) are implementation details that belong in the judge's docstring or a follow-up implementation PR description. Cut to one line: "Judge prompt forces JSON via prefill and demands per-criterion verdict + verbatim evidence; bad JSON → `passed=False, reason='judge_error'`."
+- **§7 Multi-run sampling** — the "no policies, no `2-of-3`" paragraph repeats §1 principle 6 ("humans decide"). One sentence: "Each of the N runs is a separate pytest item, judged independently; `pytest_sessionfinish` prints `<spec-id>: <k>/<N> passed` and exits with the standard pass/fail code per item." Lose the rest.
+- **§10 Acceptance CI YAML** — fully duplicates the integration tier's YAML structure from `3-integration-tests.md` §9. Replace with: "Same shape as integration job (see `3-integration-tests.md` §9); differences: trigger condition (`run-acceptance` label / `workflow_dispatch` only — no `push`), `LANGFUSE_TRACING=true` always, 45-min timeout."
+- **§13 Relationship to integration** vs **§15 What this tier does NOT do** — the "no top-level `tests/`" / "no llm_evaluator service" / "no custom runner" bullets in §15 partially duplicate §13 and the per-tier comparison in overview §1. Could lose ~3 bullets.
+
+Together these are ~50-70 lines of trim. Optional — none of it is wrong, just verbose against the now-tighter tone of the unit-tests and overview docs.
+
+---
+
+## 4. What does NOT need changing
+
+I deliberately don't think these need touching:
+
+- **Spec markdown format (§3).** Frontmatter + named sections is the right shape. Worth keeping all of `# question`, `# context`, `# must_include`, `# must_not_include`, `# assertions`, `# notes` — each pulls weight. Don't cut.
+- **Judge `Verdict` dataclass (§5.1).** Necessary for the human review loop and Langfuse score push.
+- **Multi-run sampling mechanism (§7) — the *mechanism*.** Just trim the discussion, not the feature.
+- **Langfuse integration (§6) — the scope as drawn.** Score push + session tagging is the right minimum; the §6.3 "what we don't do" guardrail is genuinely useful.
+- **Trigger model (§9).** "Never automatic, human-gated only" is the most important policy in the document and is well placed.
+
+---
+
+## 5. Folder-structure check against the unit-tests branch
+
+The unit-tests-branch decision is:
+
+```
+services/<svc>/tests/                      # flat, one folder per service
+  __init__.py
+  conftest.py                              # auto-marker by filename suffix
+  test_<module>_unit.py                    # tier 1
+  test_<module>_service.py                 # tier 2
+  test_<svc>_integration.py                # tier 3
+  acceptance/                              # tier 4 — markdown specs only
+    *.md
+  fixtures/                                # optional per-service
+```
+
+The acceptance draft's §2 layout matches this exactly — `acceptance/` is the only deviation from "flat folder + filename suffix", which is the correct deviation because markdown specs aren't `test_*.py` files. **No restructuring required.**
+
+(For reference: the *service-tests-branch* §6 shows tier subfolders — `unit/`, `service/`, `integration/`, `acceptance/`. That's an inconsistency on that branch, not in the acceptance draft. Per the user's brief, the unit-tests-branch wins; the acceptance draft is on the right side of that disagreement already.)
+
+---
+
+## 6. Sequencing observation (small)
+
+Overview §9 lists the implementation order: scaffolding → unit → service → integration → acceptance. The acceptance plan §2 says the markdown collector hook is registered in the root `apollo/conftest.py` "from day one". Because the root `conftest.py` lands in the **scaffolding** PR (overview §9 step 1), the spec collector should also land in step 1 even though the *first acceptance spec* doesn't ship until step 5. A one-line note in §4 of the acceptance plan would make this explicit:
+
+> The `pytest_collect_file` hook ships with the scaffolding PR (overview §9 step 1) so adding `acceptance/*.md` later is purely additive.
+
+Optional but cheap.
+
+---
+
+## 7. Summary of the punch list
+
+| # | Edit | Where | Size |
+|---|------|-------|------|
+| 1 | Disambiguate `testing/` → `apollo/testing/` | §2, §5, §6.2, §15 | 4 line edits |
+| 2 | Fix judge self-test classification (`_unit` vs `_service`); decide where tests for `apollo/testing/` live | §5.4 | 1 paragraph |
+| 3 | State how the `acceptance` marker gets applied to `SpecItem`s | §4 | 1 sentence |
+| 4 | One sentence: acceptance does not use `test_hooks` | §1 or §13 | 1 sentence |
+| 5 | Cross-reference the marker declaration in `pyproject.toml` | §1 or §15 | 1 sentence |
+| 6 *(optional)* | Trim §5.2, §7, §10, §13/§15 overlap | §§5.2, 7, 10, 15 | -50 lines |
+| 7 *(optional)* | Note collector hook lands in scaffolding PR | §4 | 1 sentence |
+
+**Verdict.** The draft survives the unit-tests / service-tests / overview decisions intact. The folder structure is already aligned. Five small clarifications close the remaining gaps; an optional trim brings tone in line with the tighter sibling docs. No restructure needed.

From d37c579c19bb6563142576ea5c1e49cd87587d4d Mon Sep 17 00:00:00 2001
From: "Hanna Paasivirta (OpenFn)" <hanna@openfn.org>
Date: Wed, 13 May 2026 17:11:37 +0900
Subject: [PATCH 02/15] add base acceptance architecture

---
 .../4-acceptance-tests.md                     | 615 +++++++++++-------
 conftest.py                                   |   5 +
 .../global_chat/tests/acceptance/__init__.py  |   0
 .../test_commcare_to_dhis2_with_job_code.py   |  58 ++
 .../job_chat/tests/acceptance/__init__.py     |   0
 .../test_adaptor_context_switching.py         |  60 ++
 services/testing/README.md                    |  16 +
 services/testing/apollo_client.py             |  71 ++
 services/testing/fixtures.py                  |  20 +
 services/testing/judge.py                     | 257 ++++++++
 services/testing/judge_rules.md               |   0
 services/testing/payloads.py                  | 188 ++++++
 services/testing/responses.py                 |  90 +++
 .../tests/acceptance/__init__.py              |   0
 .../test_navigation_job_to_workflow.py        | 110 ++++
 15 files changed, 1264 insertions(+), 226 deletions(-)
 create mode 100644 services/global_chat/tests/acceptance/__init__.py
 create mode 100644 services/global_chat/tests/acceptance/test_commcare_to_dhis2_with_job_code.py
 create mode 100644 services/job_chat/tests/acceptance/__init__.py
 create mode 100644 services/job_chat/tests/acceptance/test_adaptor_context_switching.py
 create mode 100644 services/testing/apollo_client.py
 create mode 100644 services/testing/fixtures.py
 create mode 100644 services/testing/judge.py
 create mode 100644 services/testing/judge_rules.md
 create mode 100644 services/testing/payloads.py
 create mode 100644 services/testing/responses.py
 create mode 100644 services/workflow_chat/tests/acceptance/__init__.py
 create mode 100644 services/workflow_chat/tests/acceptance/test_navigation_job_to_workflow.py

diff --git a/agent-team-architecture-plan/4-acceptance-tests.md b/agent-team-architecture-plan/4-acceptance-tests.md
index 7b9d2fc..950b5d1 100644
--- a/agent-team-architecture-plan/4-acceptance-tests.md
+++ b/agent-team-architecture-plan/4-acceptance-tests.md
@@ -1,6 +1,6 @@
 # Section 4 — Acceptance Tests Architecture
 
-> Scope: quality-, voice-, and style-focused evaluation of Apollo chat services (`global_chat`, `workflow_chat`, `job_chat`) against product-owner-authored hero questions. Judged by an LLM-as-judge, reviewable by a human (Joe or Brandon), optionally logged to Langfuse for trend analysis.
+> Scope: quality-, voice-, and style-focused evaluation of Apollo chat services (`global_chat`, `workflow_chat`, `job_chat`) against hero questions. Each test combines deterministic structural assertions (that already exist today) with an LLM judge that evaluates a small list of natural-language criteria.
 
 **Non-goals (other tiers):**
 
@@ -8,331 +8,505 @@
 - Mocked-LLM `main()` invocations → Section 2.
 - Functional flow with regex content assertions → Section 3.
 
-Acceptance answers a different question than integration: not "does the system function end-to-end?" but "does the answer sound like us, read well, satisfy the user's intent, and not regress in voice as we bump model versions?"
+Acceptance answers a different question than integration: not "does the system function end-to-end?" but "does the answer sound like us, read well, satisfy the user's intent, and not regress in voice as we bump model versions?" — but **on top of** all the structural checks the team already trusts.
 
-This section is written against the unit-test architecture that landed in #486 (commit `e249b86`) — those folder names, that `conftest.py`, that `services/testing/` package. Where the older draft of this section disagreed with what shipped, the shipped reality wins.
+This section is written against the unit-test architecture that landed in #486 — those folder names, that `conftest.py`, that `services/testing/` package. It is also written against what the existing `test_qualitative.py` (in `workflow_chat` and `job_chat`) and `test_planner_multistep.py` (in `global_chat`) actually need to express, because those tests are the migration target.
 
 ---
 
 ## 1. Guiding principles
 
-1. **Specs are markdown.** A PO edits a text file, not Python. YAML frontmatter + markdown sections.
-2. **HTTP is internal plumbing.** Specs never mention ports, payload shapes, or service internals.
-3. **Live models.** The whole point is to audit the real production path after model upgrades.
-4. **LLM-as-judge with receipts.** Every evaluation records the judge's reasoning so a human can spot-check.
-5. **pytest is the runner.** Same as every other tier. Spec files are collected via a tiny `pytest_collect_file` hook. No custom CLI, no `bless`/`differ`/`migrate-questions` subcommands.
-6. **Human-triggered only.** Never on every push, never on a schedule. Humans decide when a change is big enough to warrant an acceptance run — via PR label or manual `workflow_dispatch`.
-7. **No `test_hooks`.** Acceptance dispatches over real HTTP through the running bun server; the bridge strips anything beyond the JSON payload, so the service-tier `test_hooks` second arg is invisible here. Every call hits the production code path with real LLM responses.
+1. **Tests are Python.** Same shape as the existing `test_qualitative.py` files — full control over the rich, service-specific payload (`existing_yaml`, `expression`, `adaptor`, `page_name`, `meta.last_page`, `meta.rag`, `suggest_code`, `errors`, …). No custom DSL to learn.
+2. **Two assertion layers per test.**
+   - **Structural** — deterministic Python asserts: yaml-shape helpers, `[pg:…]` prefix checks, attachment shape, tool-call sequencing, "response must not contain 'yaml'", etc. These are what the existing qualitative tests already do, and they stay exactly as they are.
+   - **Quality** — one `judge.evaluate(...)` call near the end of the test. Takes a small list of natural-language criteria; returns a verdict the test asserts on.
+3. **The criteria field is obvious.** Each test has a clearly named module-level constant — `QUALITY_CRITERIA` — that anyone, technical or not, can read and edit without touching the rest of the test. Adding a new criterion is a one-line edit. The constant is fed verbatim into the judge prompt.
+4. **The judge flags more than just the listed criteria.** The judge prompt instructs the model to:
+   1. Verdict each listed criterion (pass/fail + verbatim evidence).
+   2. *Also* flag anything else notable about the response — tone drift, hedging, hallucinated facts, leaked secrets, anything that looks off — even if no criterion covers it. These show up in the verdict as `general_flags`.
+   This means a criteria list never has to be exhaustive — POs write what they care about most, the judge surfaces surprises.
+5. **Payload building uses intuitive names.** A builder like `build_job_chat_payload(...)` exposes named kwargs that map to user-facing concepts: `current_job_code`, `current_adaptor`, `current_page`, `previous_page`, `rag_results`, `suggest_code`. The builder translates to the underlying JSON keys. Lowers the barrier for anyone not steeped in the payload spec.
+6. **Live HTTP, live models.** Specs dispatch through the running bun server via `ApolloClient` (from the integration tier). Same path a user hits in production. No mocks. No `test_hooks` — that's service-tier only.
+7. **Run on demand only.** Never in an automated pipeline. Humans decide when a change is big enough to warrant a run — typically before a release, after a prompt/model bump, or when investigating a quality regression.
+8. **pytest is the runner.** Tier-marker auto-applied by the existing path-based hook in root `conftest.py` (folder name `acceptance` → `pytest.mark.acceptance`). No custom collector.
 
 ---
 
 ## 2. Directory layout
 
-Acceptance specs live alongside the service they test, in an `acceptance/` subfolder of `services/<svc>/tests/`. This is consistent with the unit-tests merge — tier directories (`unit/`, `integration/`, and now `acceptance/`) sit flat under `services/<svc>/tests/` with no top-level `tests/` tree.
-
-```
-services/<svc>/tests/
-  acceptance/
-    *.md                          # one spec per file
-    _template.md                  # copy-paste starter (underscore = skipped by collector)
-```
-
-Concretely:
+Acceptance tests are Python files in an `acceptance/` subfolder of `services/<svc>/tests/`. Consistent with the unit-tests merge (tier directories sit flat under `services/<svc>/tests/`).
 
 ```
 services/global_chat/tests/acceptance/
-  hero-patient-sync.md
-  voice-concise-answers.md
-  refuse-non-openfn-questions.md
+  __init__.py
+  test_planner_multistep.py             # migrated from tests/test_planner_multistep.py
+  test_planner_underspecified.py        # vague-request specs from same file
+  test_routing_voice.py                 # cross-service voice / refusal specs
 services/workflow_chat/tests/acceptance/
-  *.md
+  __init__.py
+  test_first_turn.py                    # migrated from test_qualitative.py
+  test_conversational_turn.py
+  test_long_yaml.py
+  test_navigation_job_to_workflow.py
 services/job_chat/tests/acceptance/
-  *.md
+  __init__.py
+  test_basic_input.py
+  test_contextualised_input.py
+  test_adaptor_context_switching.py
+  test_navigation_workflow_to_job.py
 ```
 
-**Cross-service specs** (refusals, safety, "hero" questions that exercise the orchestrator end-to-end) live under `services/global_chat/tests/acceptance/` since `global_chat` is the entry point everyone hits.
+**Cross-service specs** (planner orchestration, refusals, safety) live under `services/global_chat/tests/acceptance/` since `global_chat` is the entry point.
 
-**Tool services** (e.g. `services/tools/search_documentation/`) inherit the same pattern — if a tool ever needs its own acceptance specs, drop them in `services/tools/<tool>/tests/acceptance/`. The shared collector picks them up with no additional config. In practice tools are judged through the chat services that call them; standalone tool acceptance specs are unlikely on day one.
+**Tool services** (e.g. `services/tools/search_documentation/`) inherit the same pattern — if a tool ever needs its own acceptance tests, drop them in `services/tools/<tool>/tests/acceptance/`. The existing path-based marker hook picks them up with no config. In practice tools are judged through the chat services that call them.
 
-The judge helper and the markdown-spec pytest collector live in the **shared `services/testing/` package**:
+The judge, payload builders, response helpers, and migrated yaml-assertions all live in the **shared `services/testing/` package**:
 
 ```
 services/testing/
   __init__.py                     # already shipped
   README.md                       # already shipped
-  yaml_assertions.py              # already shipped
-  judge.py                        # new — LLM-as-judge helper (~150 lines)
-  spec_collector.py               # new — SpecFile / SpecItem / parse_spec
+  yaml_assertions.py              # already shipped — kept as-is
+  judge.py                        # new — LLM-as-judge helper (§5)
+  payloads.py                     # new — build_*_chat_payload() builders (§7)
+  responses.py                    # new — get_attachment, assert_routed_to, assert_agent_calls (§7)
+  apollo_client.py                # new — owned by integration tier; reused here
 ```
 
-`services/testing/` is on the import path via `pyproject.toml`'s `pythonpath = ["services"]`, so `from testing.judge import evaluate` works without path-munging. (The older draft of this section talked about an `apollo/testing/` peer of `services/` — that never shipped. The actual location is `services/testing/`.)
-
-The `pytest_collect_file` hook is wired into the **root `conftest.py`** (at the repo root — that's where the unit-tests merge put the tier auto-marker) so it discovers `*.md` under any `acceptance/` folder anywhere in `services/`.
+(`apollo_client.py` and the `apollo_server` session fixture are integration-tier deliverables. Acceptance is a strict consumer — it can't run before they land. See §13.)
 
-**No `golden/` tree, no `reports/` folder in git.** Langfuse is the trend / comparison backend. Local test output (pass/fail + judge reasoning) comes from pytest stdout and `--junitxml`. If a run needs an HTML report, generate it with `pytest-html` when someone asks for it — not preemptively.
+**No `golden/` tree, no `reports/` folder in git.** Langfuse is the trend / comparison backend. Local output is pytest stdout + `--junitxml`. Add `pytest-html` the day someone asks.
 
-**No `services/llm_evaluator/` service.** The judge is a helper module that calls Anthropic directly via the SDK. Promote to a service only when a non-test caller needs to invoke it.
+**No `services/llm_evaluator/` service.** Judge is a helper module. Promote to a service the day a non-test caller appears.
 
 ---
 
-## 3. Spec format
+## 3. What an acceptance test looks like
 
-One spec per markdown file. YAML frontmatter + named markdown sections.
+A single self-contained file. Roughly the same shape as today's `test_qualitative.py` entries, with three deliberate additions:
 
-### 3.1 Frontmatter
+1. A module-level `QUALITY_CRITERIA` constant (the "edit me" surface for POs).
+2. A payload built via `build_<svc>_chat_payload(...)` with intuitive named kwargs instead of nested dicts.
+3. A `judge.evaluate(...)` call at the end alongside the existing structural asserts.
 
-```yaml
----
-id: global-chat.hero.patient-sync
-title: "Build a CommCare to DHIS2 sync"
-service: global_chat                     # global_chat | workflow_chat | job_chat
-tags: [hero, voice, multi-turn]
-runs: 3                                  # default 1 — number of times to run the same spec
-judge_model: claude-sonnet-4-6           # defaults to the same in the root conftest
----
-```
+### 3.1 Example: workflow_chat navigation test
 
-Only `id` and `service` are required. Everything else inherits sensible defaults from the root `conftest.py`.
+(Migration target: `services/workflow_chat/tests/test_qualitative.py::test_navigation_job_to_workflow`)
 
-### 3.2 Body sections (top-level markdown headers, case-insensitive)
+```python
+"""User just navigated from a job-code page to a workflow editor and asks for
+a new step. The model should infer the context switch and respond about the
+workflow, not the job code's error handling."""
+
+import yaml
+
+from testing import judge
+from testing.apollo_client import ApolloClient
+from testing.payloads import build_workflow_chat_payload
+from testing.yaml_assertions import assert_no_special_chars
+
+
+QUALITY_CRITERIA = [
+    "The response talks about the workflow as a structure (jobs, edges, triggers), not about job-code-level error handling.",
+    "The tone is warm and collaborative, not clinical or terse.",
+    "If the response proposes a new email step, the rationale is plausible (e.g. mentions notification, summary, or alerting).",
+]
+
+
+def test_navigation_job_to_workflow(apollo_client: ApolloClient):
+    existing_yaml = """..."""  # the pipeline yaml from the original test
+
+    payload = build_workflow_chat_payload(
+        existing_yaml=existing_yaml,
+        history=[
+            {"role": "user",      "content": "[pg:job_code/transform-data/http] Can you add error handling to this HTTP request?"},
+            {"role": "assistant", "content": "I'll add try-catch error handling…"},
+            {"role": "user",      "content": "[pg:job_code/transform-data/http] Also add retry logic with backoff"},
+            {"role": "assistant", "content": "I'll add exponential backoff retry logic…"},
+        ],
+        user_message="Add a step to send the results via email",
+        current_page="data-pipeline",
+        previous_page={"type": "job_code", "name": "transform-data", "adaptor": "http"},
+    )
+
+    response = apollo_client.call("workflow_chat", payload)
+
+    # ---- Structural assertions (deterministic, same as today) -------------
+    assert response["response_yaml"], "Model should have generated YAML"
+    yaml_obj = yaml.safe_load(response["response_yaml"])
+    job_names = [j.get("name", "").lower() for j in yaml_obj["jobs"].values()]
+    assert any("email" in n or "mail" in n or "send" in n for n in job_names), \
+        "Email job not found in workflow"
+    assert len(yaml_obj["jobs"]) > 3, "Expected a new job to be added"
+    assert_no_special_chars(response["response_yaml"])
+
+    response_text = response["response"].lower()
+    assert not any(p in response_text for p in ["try", "catch", "retry", "backoff"]), \
+        "Response should be about workflow structure, not job-code error handling"
+
+    # ---- Quality assertions (LLM-judged) ----------------------------------
+    verdict = judge.evaluate(
+        criteria=QUALITY_CRITERIA,
+        candidate=response,
+        test_notes=__doc__,
+    )
+    assert verdict.passed, verdict.summary
+```
 
-| Section | Required | Purpose |
-|---|---|---|
-| `# conversation` | one of conversation/question | `- user:` / `- assistant:` list. Last user line is tested; earlier lines become `history`. |
-| `# question` | one of conversation/question | Shorthand for a single-turn conversation. |
-| `# context` | optional | YAML block merged into payload: `workflow_yaml`, `page`, `context`, `attachments`, etc. |
-| `# must_include` | optional | Substrings or `/regex/` that must appear in `response`. Deterministic; failure short-circuits before the judge runs. |
-| `# must_not_include` | optional | Opposite. |
-| `# assertions` | required | Natural-language criteria, one per bullet — each passed to the LLM judge. |
-| `# notes` | optional | Reviewer context, not sent to the judge. |
+### 3.2 What `QUALITY_CRITERIA` looks like to a non-technical contributor
 
-### 3.3 Example
+The constant is a plain Python list of strings — visible at the top of every acceptance test file, named the same way every time. Editing it does not require touching anything else. Examples of additions a PO might make:
 
-```markdown
----
-id: global-chat.hero.patient-sync
-title: "Build a CommCare to DHIS2 sync"
-service: global_chat
-tags: [hero, planner]
-runs: 3
----
+```python
+QUALITY_CRITERIA = [
+    "The response talks about the workflow as a structure…",
+    "The tone is warm and collaborative, not clinical or terse.",
+    "If the response proposes a new email step, the rationale is plausible.",
+    # PO adds:
+    "The response uses British English spelling.",
+    "The response does not start with 'Certainly!' or 'Of course!'.",
+]
+```
 
-# conversation
+The judge sees these verbatim plus the open-ended "flag anything else notable" instruction baked into the prompt, so a contributor never has to enumerate the full universe of things that could go wrong.
 
-- user: "I want to create a workflow that fetches new patient registrations from CommCare every hour and creates matching tracked entities in DHIS2."
+### 3.3 Example: job_chat adaptor context switching
 
-# must_include
-- /commcare/i
-- /dhis2/i
+```python
+"""User was on a Salesforce page and asked 'how do I get data?'; assistant
+answered with SOQL. User has now navigated to a DHIS2 page and asks the
+same question. The model should switch context."""
+
+from testing import judge
+from testing.payloads import build_job_chat_payload
+
+
+QUALITY_CRITERIA = [
+    "The response is specifically about fetching data from DHIS2 — not from Salesforce.",
+    "The response references DHIS2 concepts (tracker, data values, events, programs, etc.) rather than SQL/SOQL.",
+    "The response does not assume the previous Salesforce context still applies.",
+]
+
+
+def test_adaptor_context_switching(apollo_client):
+    payload = build_job_chat_payload(
+        user_message="How do I get data?",
+        history=[
+            {"role": "user",      "content": "[pg:job_code/fetch-records/salesforce@9.0.3] How do I get data?"},
+            {"role": "assistant", "content": "To get data from Salesforce, you can use `query()` with SOQL…"},
+        ],
+        current_job_code="fn(state => { return state; });",
+        current_adaptor="@openfn/language-dhis2@8.0.7",
+        current_page="fetch-data",
+        suggest_code=False,
+    )
+
+    response = apollo_client.call("job_chat", payload)
+
+    # Structural — history was correctly prefixed with the new page tag
+    assert response["history"][2]["role"] == "user"
+    assert "[pg:job_code/fetch-data/dhis2@8.0.7]" in response["history"][2]["content"]
+
+    # Quality
+    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
+    assert verdict.passed, verdict.summary
+```
 
-# must_not_include
-- "I cannot help with that"
+### 3.4 Example: global_chat planner orchestration
 
-# assertions
-- The response proposes a workflow with at least two jobs.
-- The tone is warm and collaborative, not clinical.
-- An attached workflow_yaml is present and syntactically valid.
-- The response does not leak the user's api_key or any secret-looking string.
+```python
+"""From-scratch CommCare→DHIS2 workflow with job code for both steps.
+Expects planner to call workflow_agent then job_code_agent at least twice."""
+
+import yaml
+
+from testing import judge
+from testing.payloads import build_global_chat_payload
+from testing.responses import assert_routed_to, assert_agent_calls, get_attachment
+from testing.yaml_assertions import assert_yaml_has_ids, assert_yaml_jobs_have_body
+
+
+QUALITY_CRITERIA = [
+    "The response explains the workflow's purpose in plain language a non-engineer can follow.",
+    "The job code for the CommCare step uses CommCare adaptor functions, not generic JS.",
+    "The job code for the DHIS2 step uses DHIS2 adaptor functions, not generic JS.",
+    "The response does not leak the user's api_key or any secret-looking string.",
+]
+
+
+def test_commcare_to_dhis2_with_job_code(apollo_client):
+    payload = build_global_chat_payload(
+        user_message="Create a workflow that fetches patient cases from CommCare and registers them in DHIS2.",
+        history=[],
+    )
+    response = apollo_client.call("global_chat", payload)
+
+    # Structural — routing + orchestration
+    assert_routed_to(response, "planner")
+    assert_agent_calls(
+        response["meta"],
+        expected_agents=["planner", "workflow_agent", "job_agent"],
+        min_job_code_calls=2,
+    )
+
+    # Structural — attached workflow yaml shape
+    yaml_str = get_attachment(response, "workflow_yaml")
+    assert yaml_str, "Expected a workflow_yaml attachment"
+    parsed = yaml.safe_load(yaml_str)
+    assert len(parsed["jobs"]) >= 2
+    assert_yaml_has_ids(yaml_str)
+    assert_yaml_jobs_have_body(yaml_str)
+
+    # Quality
+    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
+    assert verdict.passed, verdict.summary
 ```
 
 ---
 
-## 4. Collection: spec → pytest item
+## 4. The judge
 
-The root `conftest.py` already auto-applies a tier marker by walking `item.path.parts` for the tier directory name (see `_TIER_DIRS = ("unit", "service", "integration", "acceptance")` in the shipped file). That mechanism works for any pytest item whose path lives under an `acceptance/` folder — including `SpecItem`s collected from markdown — so **no extra marker plumbing is needed**. Acceptance specs become `pytest -m acceptance`-selectable for free.
+`services/testing/judge.py` is one module (~200 lines). Not an Apollo service.
 
-The collector itself is a standard `pytest_collect_file` hook in the root `conftest.py`:
+### 4.1 Interface
 
 ```python
-def pytest_collect_file(parent, file_path):
-    if (
-        file_path.suffix == ".md"
-        and not file_path.name.startswith("_")
-        and file_path.parent.name == "acceptance"
-    ):
-        return SpecFile.from_parent(parent, path=file_path)
-
-class SpecFile(pytest.File):
-    def collect(self):
-        spec = parse_spec(self.path)
-        for run_index in range(spec.runs):
-            yield SpecItem.from_parent(
-                self,
-                name=f"{spec.id}[run={run_index}]",
-                spec=spec,
-                run_index=run_index,
-            )
-
-class SpecItem(pytest.Item):
-    def runtest(self):
-        payload = build_payload(self.spec, self.run_index)
-        response = self.client.call(self.spec.service, payload)
-        check_must_include(self.spec, response)       # hard precondition; raises on fail
-        verdict = judge.evaluate(self.spec, response, model=self.spec.judge_model)
-        if not verdict.passed:
-            raise AssertionError(verdict.summary)
+@dataclass
+class CriterionResult:
+    criterion: str
+    passed: bool
+    reasoning: str
+
+@dataclass
+class Verdict:
+    passed: bool                        # all listed criteria passed AND no general_flags marked "regression"
+    score: float                        # 0..1 — fraction of listed criteria passed
+    criteria: list[CriterionResult]
+    general_flags: list[str]            # judge-surfaced concerns not covered by criteria
+    summary: str                        # multi-line; shown on pytest failure
+    judge_usage: dict                   # input/output tokens
+
+def evaluate(
+    *,
+    criteria: list[str],
+    candidate: dict,                    # full chat-service response dict (response, response_yaml, attachments, meta, history)
+    test_notes: str | None = None,      # usually __doc__ — context the judge sees but doesn't grade against
+    model: str = "claude-sonnet-4-6",
+) -> Verdict: ...
 ```
 
-`SpecFile`, `SpecItem`, and `parse_spec` live in `services/testing/spec_collector.py`; the hook in `conftest.py` is a four-line import + dispatch.
+### 4.2 Prompt strategy
 
-Each run is a separate pytest item. Benefits: `pytest -m acceptance` works, `--junitxml` works, `pytest-xdist` works, filtering with `-k hero` works. No new runner.
+The judge prompt has two parts:
 
-A tiny `pytest_sessionfinish` hook counts, per spec, how many of the N runs the judge marked `passed=True` and prints `spec-id: 2/3 passed` to stdout. No pass/fail policy is applied — the count is raw output for humans to read. (The pytest exit code still reflects individual item pass/fail in the usual way.)
+1. **Listed criteria.** For each criterion, return a JSON object with `passed` and `reasoning`.
+2. **General flags.** A separate instruction: *"Additionally, flag anything else in the response that looks like a problem — tone drift, hedging, hallucinated facts, leaked secrets, broken formatting, or anything else that would make a reviewer pause — even if no criterion covers it. Return these as a `general_flags` array. Mark each flag with a severity: `note` (minor, informational) or `regression` (would surprise a reviewer or hurt a user). The verdict passes only if all listed criteria pass and there are no `regression`-severity flags."*
 
-**Sequencing note.** The collector hook is small enough to land in the scaffolding PR (overview §9 step 1) alongside the existing root `conftest.py` work. The first acceptance *spec* doesn't ship until later, but having the collector in place from day one means adding `acceptance/*.md` is purely additive.
+The judge sees `test_notes` (usually the test's docstring) as context but is instructed not to grade against it directly. JSON output is forced by prefilling `{`. Malformed JSON or judge refusal → `Verdict(passed=False, summary="judge_error: …")` surfaced loudly.
 
----
+The full prompt text lives in the judge module's docstring; this section deliberately doesn't repeat it.
 
-## 5. The judge
+### 4.3 Why a helper and not a service
 
-`services/testing/judge.py` is a single module (~150 lines). Not an Apollo service.
+The judge has one caller today. A whole Apollo service + HTTP endpoint is overkill. If future callers appear (a ranker for `search_docsite`, a sanity-check step in a generator), promote `services/testing/judge.py` to `services/llm_evaluator/llm_evaluator.py` — a ~50-line reshape, not a redesign.
 
-### 5.1 Interface
+### 4.4 Self-tests for the judge
 
-```python
-@dataclass
-class Verdict:
-    passed: bool
-    score: float                        # 0..1 — fraction of criteria passed
-    criteria: list[CriterionResult]
-    reasoning_summary: str              # shown on pytest failure
-    judge_usage: dict                   # input/output tokens
+Per the unit-tier I/O policy (root `conftest.py` blocks `anthropic.Anthropic.__init__` for tests marked `unit`), tests that touch a mocked Anthropic transport are **service-tier**, not unit-tier:
 
-@dataclass
-class CriterionResult:
-    criterion: str
-    passed: bool
-    reasoning: str
-    evidence: str                       # verbatim span from candidate
+- **Unit.** Pure helpers: prompt builder, JSON parser, criteria formatter. No Anthropic construction.
+- **Service.** End-to-end `evaluate()` against a `MockAnthropic` transport — verifies criteria are threaded, general_flags are parsed, judge_error path works.
 
-def evaluate(spec: Spec, response: dict, *, model: str) -> Verdict: ...
-```
+Open detail when the service tier lands: tests for code in `services/testing/` could live alongside the module (`services/testing/tests/`) or under a shared `services/tests/` umbrella (the precedent set for `services/util.py` helpers). Commit to one before the judge ships.
 
-### 5.2 Prompt strategy
+---
+
+## 5. Payload builders
 
-Judge prompt forces JSON via prefilled `{` and demands per-criterion verdict + verbatim evidence; bad JSON or refusal → `Verdict(passed=False, reasoning_summary="judge_error: ...")` surfaced loudly. Full prompt text lives in the judge module's docstring; this section deliberately doesn't repeat it.
+`services/testing/payloads.py` exposes one builder per chat service. The builders translate intuitive kwargs into the underlying JSON keys that each service expects today — so a contributor doesn't need to remember whether the field is `expression` or `current_code`, or whether navigation lives under `meta.last_page.adaptor` or `context.previous_adaptor`.
 
-### 5.3 Why a helper and not a service
+### 5.1 `build_workflow_chat_payload`
 
-The judge only has one caller today (this tier). A whole Apollo service + `/services/llm_evaluator` HTTP endpoint + per-service test directory is overkill for that. If future callers appear (a ranker for `search_docsite`, a sanity-check step in a generator), promote `services/testing/judge.py` to `services/llm_evaluator/llm_evaluator.py` — it's a ~50-line reshape, not a redesign.
+```python
+def build_workflow_chat_payload(
+    *,
+    user_message: str | None = None,
+    existing_yaml: str = "",
+    history: list[dict] | None = None,
+    errors: str | None = None,                     # alternative to user_message
+    current_page: str | None = None,               # → context.page_name
+    previous_page: dict | None = None,             # → meta.last_page
+    api_key: str | None = None,
+) -> dict: ...
+```
 
-### 5.4 Self-tests for the judge
+### 5.2 `build_job_chat_payload`
 
-The judge module itself needs tests. Per the unit-tier I/O policy (root `conftest.py` blocks `anthropic.Anthropic.__init__` for any test marked `unit`), the parts of the judge that touch a mocked Anthropic transport are **service-tier**, not unit-tier:
+```python
+def build_job_chat_payload(
+    *,
+    user_message: str,
+    history: list[dict] | None = None,
+    current_job_code: str | None = None,           # → context.expression
+    current_adaptor: str | None = None,            # → context.adaptor
+    project_adaptors: list[str] | None = None,     # → context.adaptors
+    current_page: str | None = None,               # → context.page_name
+    project_id: str | None = None,                 # → context.projectId
+    job_id: str | None = None,                     # → context.jobId
+    input_data: Any = None,                        # → context.input
+    output_data: Any = None,                       # → context.output
+    log_data: Any = None,                          # → context.log
+    rag_results: list[dict] | None = None,         # → meta.rag.search_results
+    rag_queries: list[str] | None = None,          # → meta.rag.search_queries
+    previous_page: dict | None = None,             # → meta.last_page
+    suggest_code: bool | None = None,
+    api_key: str | None = None,
+    stream: bool | None = None,
+) -> dict: ...
+```
 
-- **Unit.** Pure helpers: prompt builder, JSON parser, frontmatter parser, `_format_criteria()`. No Anthropic construction. Lives at `services/testing/tests/unit/test_judge_*.py` (or wherever testing-package tests end up — see below).
-- **Service.** End-to-end `evaluate()` against a `MockAnthropic` transport. Lives at `services/testing/tests/service/test_judge_evaluate.py`.
+### 5.3 `build_global_chat_payload`
 
-Open detail to settle when the service tier lands: tests for code in `services/testing/` could live alongside the module (`services/testing/tests/`) or under a shared `services/tests/` umbrella (the precedent the unit-tests doc set for `services/util.py` helpers). Either works; just commit to one before the judge ships.
+```python
+def build_global_chat_payload(
+    *,
+    user_message: str,
+    history: list[dict] | None = None,
+    workflow_yaml: str | None = None,
+    current_page: str | None = None,
+    previous_page: dict | None = None,
+    api_key: str | None = None,
+) -> dict: ...
+```
 
----
+Each builder is ~30 lines of "if-not-None-set". The whole point is that the builder signature **is** the documentation — a contributor can read it once and know what's available without grepping the service code.
 
-## 6. Langfuse integration
+Adding a new payload field: one line in the builder + a one-line docstring entry. Removing one: same.
 
-Langfuse is already wired on `add-langfuse` — acceptance leans on it lightly for cross-run comparison. The runner does NOT rebuild Langfuse's dataset / score UI.
+---
 
-### 6.1 Already in place (we reuse)
+## 6. Response helpers
 
-- `services/langfuse_util.py::should_track()` gates trace export. Payloads set `user.employee=True` to stay inside the employee window.
-- `@observe` on each chat service's `main()` — every acceptance run is automatically traced when `LANGFUSE_TRACING=true`.
+`services/testing/responses.py` lifts the cross-service helpers that today live duplicated in `services/global_chat/tests/test_utils.py`:
 
-### 6.2 What we add
+```python
+def get_attachment(response: dict, name: str) -> str | None: ...
+    # walks response["attachments"] for an entry with attachment_type == name
 
-1. **Session tagging.** Each run sets `session_id = f"acceptance-{spec.id}-run{i}"` and `tags = ["acceptance", spec.id, ...spec.tags]`. Done via Langfuse's `propagate_attributes`.
-2. **Score push.** After the judge returns, write one score per run: `acceptance_pass` (0/1) and `acceptance_score` (0..1). Use Langfuse's Scores API directly from `services/testing/judge.py` — no `langfuse_sink.py` wrapper.
-3. **Cross-version comparison.** Native Langfuse dataset-runs view does this. The collector surfaces the URL in stdout.
+def assert_routed_to(response: dict, agent: str, *, context: str = "") -> None: ...
+    # checks response["meta"]["router"]["agent"] (or wherever the router stamps it)
 
-### 6.3 What we don't do via Langfuse
+def assert_agent_calls(meta: dict, expected_agents: list[str], min_job_code_calls: int = 0, *, context: str = "") -> None: ...
+    # the planner-chain assertion from test_planner_multistep.py — verifies
+    # workflow_agent appears before any job_code_agent in meta["tool_calls"]
+```
 
-- No Langfuse-hosted eval (we own the prompt).
-- No hard dependency — acceptance runs offline if `LANGFUSE_PUBLIC_KEY` is unset OR `LANGFUSE_TRACING=false`; scores are skipped, runs still complete.
+`testing.yaml_assertions` is already shipped in #486 and unchanged here.
 
 ---
 
 ## 7. Multi-run sampling
 
-Specs declare `runs: N` in frontmatter; default is `1`. Each of the N runs becomes a separate pytest item, named `<spec-id>[run=0]` ... `<spec-id>[run=N-1]`, judged independently. `pytest_sessionfinish` prints `<spec-id>: <k>/<N> passed` to stdout. The pytest exit code reflects individual item pass/fail; the count is raw output for humans to read — no `2-of-3` policy, no aggregator. Whoever reads the output decides whether the ratio is acceptable for that spec.
+A test that benefits from sampling (tone, voice, anything where the LLM varies between runs) uses built-in pytest parametrization — nothing custom:
 
-Per-run pass/fail comes from the LLM judge (`Verdict.passed`, §5.1).
+```python
+@pytest.mark.parametrize("_run", range(3))
+def test_navigation_job_to_workflow(apollo_client, _run):
+    ...
+```
 
----
+Each value of `_run` becomes a separate pytest item. The arg is unused — the underscore signals that. Pytest's default output handles everything: per-item pass/fail in the run log, totals in the summary line, integration with `-k`, `pytest-xdist`, `--junitxml`. No `pytest_sessionfinish` tally, no custom marker, no policy — humans read the counts off pytest's normal output.
 
-## 8. Human review loop
+---
 
-**Primary: Langfuse UI.** Joe / Brandon open the dashboard, filter by `tags:acceptance`, review candidate + judge reasoning + score, override with a human annotation if they disagree.
+## 8. Langfuse integration
 
-**Secondary: pytest stdout / JUnit.** CI logs show `FAIL global-chat.hero.patient-sync[run=1]` with the judge's reasoning summary as the pytest message. Enough for a quick triage.
+Langfuse is already wired on `add-langfuse`. Acceptance leans on it lightly:
 
-No dedicated HTML report on day one. Add `pytest-html` the first time someone asks for it.
+### 8.1 Already in place (reused)
 
----
+- `services/langfuse_util.py::should_track()` gates trace export. Payloads set `user.employee=True`.
+- `@observe` on each chat service's `main()` — every acceptance call is auto-traced when `LANGFUSE_TRACING=true`.
 
-## 9. Triggers
+### 8.2 What's new
 
-Acceptance is **never triggered automatically**. A human decides when a change is big enough to warrant spending the money on a run.
+1. **Session tagging.** A `langfuse_session` fixture in `services/testing/` reads the current test id + run index from the pytest item and sets `session_id = f"acceptance-{test_id}-run{run}"` and `tags = ["acceptance", test_id]` via `propagate_attributes`.
+2. **Score push.** After the judge returns, write one score per run: `acceptance_pass` (0/1), `acceptance_score` (0..1), and `acceptance_general_flag_count`. Use Langfuse's Scores API directly from `services/testing/judge.py`.
+3. **Cross-version comparison.** Native Langfuse dataset-runs view. The runner prints the URL on stdout at session end.
 
-| Trigger | Mechanism |
-|---|---|
-| Local manual | `poetry run pytest -m acceptance` |
-| CI manual (any branch) | GH Actions `workflow_dispatch` on the acceptance workflow |
-| PR label | Apply `run-acceptance` label to a PR |
+### 8.3 What's not built
 
-Explicitly excluded: no cron, no nightly, no push-to-main, no tag-push, no scheduled runs of any kind. If the team later decides they want continuous drift monitoring, that's a deliberate policy change — not a default.
+- No Langfuse-hosted eval (we own the prompt).
+- No hard dependency — acceptance runs offline if `LANGFUSE_PUBLIC_KEY` is unset OR `LANGFUSE_TRACING=false`; scores are skipped, tests still run.
 
 ---
 
-## 10. CI workflow
-
-A second GH Actions workflow alongside the shipped `unit-tests.yaml`. Two reasonable shapes; pick one when the integration tier's workflow lands so the choice is consistent:
+## 9. How to run
 
-- **Option A — shared `llm-tests.yaml`.** One file with two jobs (`integration` and `acceptance`), each gated by its own label / `workflow_dispatch` condition. Matches the original overview §6 design. Cleaner if integration and acceptance share env wiring.
-- **Option B — dedicated `acceptance-tests.yaml`.** Mirrors the existing `unit-tests.yaml` naming (one file per tier). Easier to grep, easier to point a human at "the acceptance workflow."
+```bash
+# Run everything marked acceptance
+poetry run pytest -m acceptance
 
-Either way, the acceptance job's shape mirrors the integration job (see `3-integration-tests.md` §9) with three differences:
+# Run one service's acceptance tests
+poetry run pytest services/workflow_chat/tests/acceptance
 
-1. **Triggers:** `run-acceptance` label or `workflow_dispatch` only — no `push`, no `schedule`.
-2. **Env:** `LANGFUSE_TRACING=true` (acceptance runs always trace; that's the point) plus the standard `*_TEST` secrets.
-3. **Timeout:** 45 minutes as a hard ceiling.
+# Run one test, with output
+poetry run pytest services/global_chat/tests/acceptance/test_planner_multistep.py -v
 
-The run command is `poetry run pytest -m acceptance -v --junitxml=tmp/test-logs/acceptance-junit.xml`. Note this is marker-filtered, not path-filtered like `unit-tests.yaml` (`pytest services/*/tests/unit`) — markdown specs need the collector hook to fire, which means crawling the test tree, which means filtering by marker. Deliberate divergence from the unit pattern, not an oversight.
+# Run against a staging server instead of spawning a local bun
+APOLLO_TEST_BASE_URL=https://staging.apollo.openfn.org poetry run pytest -m acceptance
+```
 
-JUnit XML uploads as an artifact for 14 days.
+Requires the real `*_API_KEY` env vars (Anthropic, OpenAI, Pinecone) and — if Langfuse score push is wired — `LANGFUSE_PUBLIC_KEY` / `LANGFUSE_SECRET_KEY` / `LANGFUSE_BASE_URL`. Without Langfuse vars, tests still pass/fail normally; only the score push is skipped.
 
 ---
 
-## 11. Cost control
+## 10. Cost control
+
+- Run on demand only — every run is a deliberate human action.
+- Default to no parametrization; tests opt into multi-run with `@pytest.mark.parametrize("_run", range(N))`.
+- Judge defaults to `claude-sonnet-4-6`, not opus.
+- Prompt caching on candidate calls.
 
-Day-one approach is human-gated triggering + sensible defaults, not elaborate budget code:
+Add a budget env or circuit breaker the first time a run surprises someone. Not on day one.
 
-- Never automatic — every run is a deliberate human action.
-- `runs: 1` default — specs must opt into sampling.
-- Judge defaults to `claude-sonnet-4-6` (not opus).
-- Prompt caching on candidate calls — preserved across the N runs of one spec by shared `session_id`.
-- 45-minute workflow timeout as a hard ceiling.
+---
+
+## 11. Migration of existing artefacts
 
-A budget env + soft circuit breaker can be added the first time a manual run surprises someone. Not on day one.
+| Existing | Migration target | Notes |
+|---|---|---|
+| `services/workflow_chat/tests/test_qualitative.py` | one file per `def test_*` under `services/workflow_chat/tests/acceptance/` | Each test keeps its existing structural asserts (`assert_yaml_*`, deep-equality, response negative-substring). Add `QUALITY_CRITERIA` derived from the test's `print(...)` description + reviewer instincts. Swap `subprocess.run` for `apollo_client.call("workflow_chat", payload)`. Swap nested-dict payload construction for `build_workflow_chat_payload(...)`. Add `judge.evaluate(...)`. |
+| `services/job_chat/tests/test_qualitative.py` | one file per `def test_*` under `services/job_chat/tests/acceptance/` | Same pattern. `meta.rag` and `meta.last_page` map onto the builder's `rag_results` / `rag_queries` / `previous_page` kwargs. |
+| `services/global_chat/tests/test_planner_multistep.py` | one file per `def test_*` under `services/global_chat/tests/acceptance/` | Keep `assert_routed_to`, `assert_agent_calls`, `get_attachment` — they move to `services/testing/responses.py` and import from there. The "vague request" tests (`test_vague_gmail_to_database`, `test_gsheets_transform_salesforce_with_cron`) currently have no quality asserts — exactly the case `QUALITY_CRITERIA` was designed for. |
+| `services/job_chat/evaluation/questions.md` | optional — convert by hand into acceptance test files | Each `## question / ## adaptor / ## code` triple becomes a `build_job_chat_payload(...)` call. Quality criteria added by hand. |
+| `temp_test_empty_response_guard.py` (in all three services, from #487) | service tier, not acceptance | Structural assertion against a mocked LLM — belongs in `tests/service/` once that tier lands. Flagged here so they don't accidentally get pulled into `acceptance/` during migration. |
+
+Migration is opt-in, one test at a time. A test can ship to `acceptance/` before its `QUALITY_CRITERIA` are written — the structural assertions alone are valuable, the judge call can land empty (`criteria=[]`) and still surface general_flags.
 
 ---
 
 ## 12. Dependencies on the integration tier
 
-Acceptance and integration both dispatch through a live bun server, so they share infrastructure that the integration tier owns:
+Acceptance dispatches through a live bun server, so it shares infrastructure that the integration tier owns and ships first:
 
-- `services/testing/server.py` — session-scoped `apollo_server` fixture (spawn bun, poll `GET /` until ready, SIGTERM on teardown, honour `APOLLO_TEST_BASE_URL` to reuse a running staging server).
-- `services/testing/server.py::ApolloClient` — wraps `.call()`, `.stream()`, `.ws()`. The collector's `SpecItem.runtest` uses `.call()`.
+- `services/testing/apollo_client.py::ApolloClient` — wraps `.call()`, `.stream()`, `.ws()`.
+- `services/testing/server.py::apollo_server` — session-scoped pytest fixture; spawns `bun run start`, polls `GET /` until ready, SIGTERM on teardown, honours `APOLLO_TEST_BASE_URL` for staging reuse.
 
-Neither exists today. Acceptance can't ship before integration's server fixture lands. The collector hook itself can ship earlier (scaffolding PR) — markdown specs would just have no runner attached until `ApolloClient` arrives.
+Neither exists today. Acceptance can't run before they land. The judge module and payload builders **can** ship earlier — they're useful in any context that wants to evaluate an LLM response, not just acceptance.
 
 ---
 
 ## 13. Extensibility
 
-Adding a new sub-agent or tool — no Python required:
+Adding a new sub-agent or tool:
 
-1. Ensure the new service exposes `main()` at `services/<name>/<name>.py` (auto-mounts via `describe-modules.ts`).
-2. Create `services/<name>/tests/acceptance/` and drop markdown specs.
+1. Ensure the service exposes `main()` at `services/<name>/<name>.py` (auto-mounts via `describe-modules.ts`).
+2. Create `services/<name>/tests/acceptance/`.
+3. If the payload shape is meaningfully different from existing chat services, add `build_<name>_payload(...)` to `services/testing/payloads.py`. Otherwise reuse an existing builder.
+4. Drop test files in. The marker is auto-applied; pytest selects them via `-m acceptance`.
 
-Adding a new judge model: list it in the root `conftest.py` (or let it be free-form — strings all the way). `judge_model:` in frontmatter.
+Adding a new judge model: pass `model="…"` to `judge.evaluate()`. No config file needed.
 
 ---
 
@@ -340,43 +514,32 @@ Adding a new judge model: list it in the root `conftest.py` (or let it be free-f
 
 | Concern | Integration | Acceptance |
 |---|---|---|
-| Goal | Functional correctness | Quality, voice, style |
-| Assertions | Regex + shape | Natural-language criteria + LLM judge |
-| Trigger | PR label / push to main / manual | PR label / manual — **never automatic** |
-| Stability | Deterministic | Probabilistic (N runs) |
-| Runner | pytest | pytest |
+| Goal | Functional correctness | Quality, voice, style — *on top of* the structural checks |
+| Assertions | Regex + shape | Same structural shape + `judge.evaluate(QUALITY_CRITERIA, ...)` |
+| When run | Whenever the integration tier dictates | On demand only — never in an automated pipeline |
+| Stability | Deterministic | Probabilistic (optional `@pytest.mark.parametrize("run", …)`) |
 | Marker | `@pytest.mark.integration` (auto-applied by root conftest) | `@pytest.mark.acceptance` (auto-applied by root conftest) |
-| Location | `services/<svc>/tests/integration/test_*.py` | `services/<svc>/tests/acceptance/*.md` |
+| Location | `services/<svc>/tests/integration/test_*.py` | `services/<svc>/tests/acceptance/test_*.py` |
 | `test_hooks` | not used (real HTTP) | not used (real HTTP) |
 
-**Overlap rule:** a test lives in exactly one tier. An "acceptance" spec that merely asserts a YAML attachment exists belongs in integration. An integration test that checks "the tone feels terse enough" belongs in acceptance.
-
----
-
-## 15. Migration of existing artefacts
-
-- `services/job_chat/evaluation/questions.md` — mostly-compatible format. One-time manual conversion (split per entry, add frontmatter, drop into `services/job_chat/tests/acceptance/`). No migration CLI needed — it's a one-shot editor task.
-- `services/global_chat/tests/test_workflow_chat_qualitative.py`, `services/global_chat/tests/test_job_chat_qualitative.py`, `services/workflow_chat/tests/test_qualitative.py`, `services/job_chat/tests/test_qualitative.py` — the prose at the top of each test (in `print()` statements) becomes `# notes`; `content`/`context` become spec sections; qualitative asserts become `# assertions` bullets. Drop the resulting markdown files into the relevant service's `acceptance/` folder (use `services/global_chat/tests/acceptance/` for cross-service tests targeting the orchestrator). Any machine-checkable asserts move to integration.
-- `temp_test_empty_response_guard.py` (recently added in #487 to all three chat services) — these are service-tier candidates (mocked LLM, structural assertion), not acceptance. Flagged here so they don't accidentally end up in `acceptance/` during migration.
-
-Migration is opt-in, one file at a time.
+**Overlap rule:** a test lives in exactly one tier. A test whose only purpose is structural is integration. A test that has *any* quality criterion the LLM judge should evaluate is acceptance — even if it also has structural asserts.
 
 ---
 
-## 16. What this tier deliberately does NOT do
+## 15. What this tier deliberately does NOT do
 
-- **No top-level `tests/` tree.** Specs live under their service.
-- **No `apollo/testing/` peer of `services/`.** The shared package is `services/testing/` — that's what shipped in #486.
+- **No top-level `tests/` tree.** Tests live under their service.
+- **No `apollo/testing/` peer of `services/`.** The shared package is `services/testing/` — what shipped in #486.
 - **No `services/llm_evaluator/` service.** Judge is a helper module.
-- **No custom acceptance runner.** Pytest collects specs; that's it.
-- **No `bless` / `differ` / `migrate-questions` / `review` subcommands.** The first two make sense if we adopt golden-file diffing; we don't on day one (Langfuse's dataset-runs comparison is the primitive). The last two are one-off editor tasks.
+- **No custom acceptance runner, no markdown-spec DSL.** The criteria field is a plain Python list; everything else is normal pytest.
+- **No `bless` / `differ` / `migrate-questions` / `review` subcommands.**
 - **No `golden/` git tree.** Model drift is tracked in Langfuse.
-- **No HTML reporter.** `pytest-html` is a line in `pyproject.toml` the day we want it.
-- **No per-spec cost caps, budget estimator, `list`/`lint` commands, skip-on-no-change mode.** Defer until bills say otherwise.
-- **No `criteria_mode: weighted` with per-criterion weights.** `all` or `any` across criteria. Add weighting when a spec genuinely needs it.
+- **No HTML reporter.** `pytest-html` the day it's asked for.
+- **No per-spec cost caps, budget estimator, `list`/`lint` commands.** Defer until bills say otherwise.
+- **No exhaustive criteria.** Open-ended general-flag instruction in the judge prompt is the safety net.
 
 ---
 
 ## Summary
 
-Acceptance = markdown specs in `services/<svc>/tests/acceptance/` + a `pytest_collect_file` hook in the existing root `conftest.py` + a tiny judge helper in `services/testing/judge.py` + Langfuse scores. The `acceptance` marker is already declared in `pyproject.toml` and auto-applied by the existing path-based hook — no new marker plumbing. No new Apollo service, no custom runner, no golden tree, no top-level `tests/` directory. Runs via the standard pytest mechanism under a label-gated GH Actions workflow. Adding a sub-agent or tool means dropping markdown files under that service's `acceptance/` folder. The judge promotes to a service the day it has a second caller; nothing else changes. Acceptance depends on the integration tier shipping `services/testing/server.py` (`apollo_server` fixture + `ApolloClient`) — it can't run before that lands, though the collector itself can ship earlier in the scaffolding PR.
+Acceptance tests are **Python files** in `services/<svc>/tests/acceptance/` — same shape as today's `test_qualitative.py`, with three additions: a `QUALITY_CRITERIA = [...]` module constant (the "edit me" surface for non-technical contributors), `build_<svc>_chat_payload(...)` for intuitive payload construction, and a `judge.evaluate(...)` call alongside the existing structural assertions. The judge grades the listed criteria *and* flags anything else notable, so the criteria list never has to be exhaustive. All existing structural assertions are preserved verbatim; nothing about how the current tests reason about yaml shape, page prefixes, tool-call sequencing, or attachments changes. The judge, payload builders, and response helpers live in the shared `services/testing/` package. Acceptance depends on the integration tier's `apollo_server` fixture + `ApolloClient`; it can't run before those land, though the judge and payload builders can ship earlier.
diff --git a/conftest.py b/conftest.py
index dd03bd8..2d70b9f 100644
--- a/conftest.py
+++ b/conftest.py
@@ -11,6 +11,11 @@
 import pytest
 
 
+# Shared fixtures (apollo_client, ...) live in services/testing/fixtures.py.
+# Registered globally so any test can request them by name.
+pytest_plugins = ["testing.fixtures"]
+
+
 _TIER_DIRS = ("unit", "service", "integration", "acceptance")
 
 _BLOCKED_TARGETS = (
diff --git a/services/global_chat/tests/acceptance/__init__.py b/services/global_chat/tests/acceptance/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/services/global_chat/tests/acceptance/test_commcare_to_dhis2_with_job_code.py b/services/global_chat/tests/acceptance/test_commcare_to_dhis2_with_job_code.py
new file mode 100644
index 0000000..7274a1e
--- /dev/null
+++ b/services/global_chat/tests/acceptance/test_commcare_to_dhis2_with_job_code.py
@@ -0,0 +1,58 @@
+"""From-scratch CommCare→DHIS2 workflow with job code for both steps. No
+existing YAML, no history. The planner should be invoked, call the workflow
+agent to produce a two-job workflow, then call the job code agent at least
+twice to fill in the bodies."""
+
+import yaml
+
+from testing import judge
+from testing.payloads import build_global_chat_payload
+from testing.responses import assert_routed_to, assert_agent_calls, get_attachment
+from testing.yaml_assertions import assert_yaml_has_ids, assert_yaml_jobs_have_body
+
+
+QUALITY_CRITERIA = [
+    "The response explains the workflow's purpose in plain language a non-engineer can follow.",
+    "The job code for the CommCare step calls CommCare adaptor functions (e.g. submissions, forms, cases), not generic JavaScript.",
+    "The job code for the DHIS2 step calls DHIS2 adaptor functions (e.g. create, upsert, trackedEntities), not generic JavaScript.",
+    "The response does not leak an api_key or any value that looks like a secret.",
+]
+
+
+def test_commcare_to_dhis2_with_job_code(apollo_client):
+    payload = build_global_chat_payload(
+        user_message="Create a workflow that fetches patient cases from CommCare and registers them in DHIS2.",
+        history=[],
+    )
+
+    response = apollo_client.call("global_chat", payload)
+
+    # ---- Structural assertions ---------------------------------------------
+    assert response is not None
+    assert isinstance(response, dict)
+    assert_routed_to(response, "planner")
+
+    yaml_str = get_attachment(response, "workflow_yaml")
+    assert yaml_str, "Expected a workflow_yaml attachment"
+
+    parsed = yaml.safe_load(yaml_str)
+    assert "jobs" in parsed, "YAML must have a jobs section"
+    assert len(parsed["jobs"]) >= 2, f"Expected at least 2 jobs, got {len(parsed['jobs'])}"
+    assert "triggers" in parsed, "YAML must have a triggers section"
+
+    assert_yaml_has_ids(yaml_str)
+    assert_yaml_jobs_have_body(yaml_str)
+
+    assert_agent_calls(
+        response.get("meta") or {},
+        expected_agents=["planner", "workflow_agent", "job_agent"],
+        min_job_code_calls=2,
+    )
+
+    # ---- Quality assertions ------------------------------------------------
+    verdict = judge.evaluate(
+        criteria=QUALITY_CRITERIA,
+        candidate=response,
+        test_notes=__doc__,
+    )
+    assert verdict.passed, verdict.summary
diff --git a/services/job_chat/tests/acceptance/__init__.py b/services/job_chat/tests/acceptance/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/services/job_chat/tests/acceptance/test_adaptor_context_switching.py b/services/job_chat/tests/acceptance/test_adaptor_context_switching.py
new file mode 100644
index 0000000..7e94174
--- /dev/null
+++ b/services/job_chat/tests/acceptance/test_adaptor_context_switching.py
@@ -0,0 +1,60 @@
+"""User was on a Salesforce job page and asked "How do I get data?"; the
+assistant answered with SOQL. The user has now navigated to a DHIS2 job page
+and asks the same question again. The model should notice the page-prefix
+change and switch context to DHIS2-specific guidance."""
+
+from testing import judge
+from testing.payloads import build_job_chat_payload
+from testing.responses import latest_user_message
+
+
+QUALITY_CRITERIA = [
+    "The response is specifically about fetching data from DHIS2 — not from Salesforce.",
+    "The response references DHIS2 concepts (tracker, data values, events, programs, etc.) rather than SOQL or SQL.",
+    "The response does not assume the previous Salesforce context still applies.",
+]
+
+
+def test_adaptor_context_switching(apollo_client):
+    payload = build_job_chat_payload(
+        user_message="How do I get data?",
+        history=[
+            {"role": "user", "content": "[pg:job_code/fetch-records/salesforce@9.0.3] How do I get data?"},
+            {"role": "assistant", "content": (
+                "To get data from Salesforce, you can use the `query()` operation with SOQL "
+                "(Salesforce Object Query Language). For example:\n\n"
+                "```js\nquery('SELECT Id, Name FROM Account WHERE Status = \"Active\"');\n```\n\n"
+                "This will fetch records from Salesforce and store them in `state.data`."
+            )},
+        ],
+        current_job_code="fn(state => {\n  return state;\n});",
+        current_adaptor="@openfn/language-dhis2@8.0.7",
+        current_page="fetch-data",
+        suggest_code=False,
+    )
+
+    response = apollo_client.call("job_chat", payload)
+
+    # ---- Structural assertions ---------------------------------------------
+    assert response is not None
+    assert "response" in response
+
+    # Page prefix was applied to the latest user turn in updated history
+    latest_user = latest_user_message(response)
+    assert latest_user is not None, "Expected at least one user message in updated history"
+    assert "[pg:job_code/fetch-data/dhis2@8.0.7]" in latest_user["content"], (
+        f"Expected DHIS2 page prefix in latest user message, got: {latest_user['content'][:200]}"
+    )
+
+    # DHIS2 mentioned in response text
+    assert "dhis" in response["response"].lower(), (
+        f"Expected DHIS2 to be mentioned in response. Got: {response['response'][:300]}"
+    )
+
+    # ---- Quality assertions ------------------------------------------------
+    verdict = judge.evaluate(
+        criteria=QUALITY_CRITERIA,
+        candidate=response,
+        test_notes=__doc__,
+    )
+    assert verdict.passed, verdict.summary
diff --git a/services/testing/README.md b/services/testing/README.md
index 8b30f9c..dcb1786 100644
--- a/services/testing/README.md
+++ b/services/testing/README.md
@@ -10,6 +10,22 @@ This directory is on the Python path via `pyproject.toml`
 
 - `yaml_assertions.py` — pure-function YAML structural assertions, safe for
   every tier (unit included).
+- `judge.py` — LLM-as-judge helper for acceptance tests. Evaluates chat-service
+  responses against natural-language criteria. Loads universal rules from
+  `judge_rules.md` at evaluation time.
+- `judge_rules.md` — universal rules prepended to every acceptance judge
+  evaluation. Edit this file to add project-wide standards (voice, style,
+  refusal handling, etc.). One rule per bullet. Empty file = no universal rules.
+- `payloads.py` — `build_global_chat_payload`, `build_workflow_chat_payload`,
+  `build_job_chat_payload`. Intuitive named kwargs that translate to the JSON
+  shape each service expects.
+- `responses.py` — `get_attachment`, `assert_routed_to`, `assert_agent_calls`.
+  Shared response helpers used across acceptance tests.
+- `apollo_client.py` — `ApolloClient` for dispatching to a chat service.
+  Currently a subprocess-based stub; the integration tier will replace its
+  internals with a real HTTP client.
+- `fixtures.py` — pytest fixtures (`apollo_client`). Registered via
+  `pytest_plugins = ["testing.fixtures"]` in the repo-root `conftest.py`.
 
 ## Why under `services/` and not a top-level `tests/`?
 
diff --git a/services/testing/apollo_client.py b/services/testing/apollo_client.py
new file mode 100644
index 0000000..cd0b01f
--- /dev/null
+++ b/services/testing/apollo_client.py
@@ -0,0 +1,71 @@
+"""Client for dispatching to chat services in acceptance tests.
+
+This is a **stub** — it shells out to `services/entry.py` via subprocess, the
+same pattern the existing `tests/test_utils.py` files use. The integration
+tier will replace the implementation with a real HTTP client backed by a
+session-scoped `apollo_server` fixture (`bun run start`), keeping the same
+`.call()` signature so acceptance tests don't need to change.
+
+The stub is deliberately minimal:
+- One method: `.call(service_name, payload)`.
+- No streaming, no WebSocket support — acceptance dispatches synchronous JSON.
+- No retry, no timeout config — failures surface as `RuntimeError` for the
+  test to handle.
+"""
+
+import json
+import subprocess
+import sys
+import tempfile
+from pathlib import Path
+from typing import Any
+
+
+_ENTRY_PY = Path(__file__).parent.parent / "entry.py"
+_SERVICES_DIR = Path(__file__).parent.parent
+
+
+class ApolloClient:
+    """Dispatches a JSON payload to a chat service and returns the response dict.
+
+    Today: spawns `python entry.py <service>` per call.
+    Future (integration tier): POSTs to a long-lived bun server.
+    """
+
+    def call(self, service_name: str, payload: dict) -> dict[str, Any]:
+        """Invoke `service_name` with `payload`. Returns the parsed JSON response.
+
+        Raises RuntimeError if the service exits non-zero.
+        """
+        with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as input_file:
+            json.dump(payload, input_file, indent=2)
+            input_path = input_file.name
+        with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as output_file:
+            output_path = output_file.name
+
+        try:
+            result = subprocess.run(
+                [
+                    sys.executable,
+                    str(_ENTRY_PY),
+                    service_name,
+                    "--input", input_path,
+                    "--output", output_path,
+                ],
+                capture_output=True,
+                text=True,
+                cwd=_SERVICES_DIR,
+            )
+            if result.returncode != 0:
+                raise RuntimeError(
+                    f"{service_name} exited {result.returncode}.\n"
+                    f"stderr:\n{result.stderr}"
+                )
+            with open(output_path) as f:
+                return json.load(f)
+        finally:
+            for path in (input_path, output_path):
+                try:
+                    Path(path).unlink()
+                except OSError:
+                    pass
diff --git a/services/testing/fixtures.py b/services/testing/fixtures.py
new file mode 100644
index 0000000..fb15f72
--- /dev/null
+++ b/services/testing/fixtures.py
@@ -0,0 +1,20 @@
+"""Pytest fixtures shared across acceptance and integration tests.
+
+Registered via `pytest_plugins = ["testing.fixtures"]` in the repo-root
+`conftest.py` so any test can request these fixtures by name.
+"""
+
+import pytest
+
+from testing.apollo_client import ApolloClient
+
+
+@pytest.fixture(scope="session")
+def apollo_client() -> ApolloClient:
+    """Session-scoped client for dispatching to chat services.
+
+    Today: subprocess-based stub. The integration tier will swap the
+    underlying implementation for a real HTTP client backed by a long-lived
+    bun server, without changing this fixture's interface.
+    """
+    return ApolloClient()
diff --git a/services/testing/judge.py b/services/testing/judge.py
new file mode 100644
index 0000000..cd618b0
--- /dev/null
+++ b/services/testing/judge.py
@@ -0,0 +1,257 @@
+"""LLM-as-judge for acceptance tests.
+
+Evaluates a chat service's response against a small list of natural-language
+criteria, returning a structured verdict.
+
+Three layers feed the judge prompt:
+1. Universal rules — loaded from `judge_rules.md` next to this file. Apply to
+   every evaluation. Edit the markdown file to change them; no Python touched.
+2. Per-test criteria — passed in via `evaluate(criteria=[...])`.
+3. Open-ended "flag anything else notable" — hardcoded at the end of the
+   prompt. Means the criteria list never has to be exhaustive.
+
+Usage:
+    from testing import judge
+
+    verdict = judge.evaluate(
+        criteria=["The response uses British English spelling.", ...],
+        candidate=response_dict,
+        test_notes=__doc__,
+    )
+    assert verdict.passed, verdict.summary
+"""
+
+import json
+import os
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Optional
+
+from anthropic import Anthropic
+
+
+DEFAULT_MODEL = "claude-sonnet-4-5"
+_RULES_PATH = Path(__file__).parent / "judge_rules.md"
+
+
+@dataclass
+class CriterionResult:
+    criterion: str
+    passed: bool
+    reasoning: str
+
+
+@dataclass
+class GeneralFlag:
+    description: str
+    severity: str  # "note" or "regression"
+
+
+@dataclass
+class Verdict:
+    passed: bool
+    score: float
+    criteria: list[CriterionResult]
+    general_flags: list[GeneralFlag]
+    summary: str
+    judge_usage: dict = field(default_factory=dict)
+
+
+def _load_universal_rules() -> str:
+    """Read the universal-rules markdown file. Empty string if absent or empty."""
+    if not _RULES_PATH.exists():
+        return ""
+    text = _RULES_PATH.read_text().strip()
+    return text
+
+
+def _build_system_prompt() -> str:
+    universal = _load_universal_rules()
+    parts = [
+        "You are a strict but fair quality reviewer for an AI assistant's responses.",
+        "You will be given (a) optional universal rules that apply to every response, "
+        "(b) a list of test-specific criteria, and (c) the AI assistant's full response "
+        "to evaluate.",
+        "",
+        "Return JSON with this exact shape:",
+        "{",
+        '  "criteria": [{"criterion": str, "passed": bool, "reasoning": str}, ...],',
+        '  "general_flags": [{"description": str, "severity": "note" | "regression"}, ...]',
+        "}",
+        "",
+        "For each listed criterion, return one object in the same order as given.",
+        "",
+        "Additionally, flag anything else in the response that looks like a problem — "
+        "tone drift, hedging, hallucinated facts, leaked secrets or API keys, broken "
+        "formatting, factual errors, or anything else that would make a reviewer "
+        "pause — even if no criterion covers it. Mark each flag with severity:",
+        "  - 'note': minor or informational",
+        "  - 'regression': would surprise a reviewer or hurt a user",
+        "If nothing is notable, return an empty general_flags array.",
+    ]
+    if universal:
+        parts += [
+            "",
+            "UNIVERSAL RULES (apply to every response):",
+            universal,
+        ]
+    return "\n".join(parts)
+
+
+def _build_user_prompt(
+    criteria: list[str],
+    candidate: dict,
+    test_notes: Optional[str],
+) -> str:
+    parts = []
+    if test_notes:
+        parts += [
+            "TEST CONTEXT (background; do not grade against this directly):",
+            test_notes.strip(),
+            "",
+        ]
+    parts += ["CRITERIA TO EVALUATE:"]
+    if criteria:
+        for i, c in enumerate(criteria, 1):
+            parts.append(f"  {i}. {c}")
+    else:
+        parts.append("  (none — rely on universal rules and general_flags only)")
+    parts += [
+        "",
+        "CANDIDATE RESPONSE (the AI assistant's full output, as JSON):",
+        json.dumps(candidate, indent=2, default=str),
+    ]
+    return "\n".join(parts)
+
+
+def _parse_verdict(
+    raw_text: str,
+    criteria: list[str],
+    usage: dict,
+) -> Verdict:
+    """Parse JSON judge output into a Verdict. Lenient: missing fields → defaults."""
+    try:
+        data = json.loads(raw_text)
+    except json.JSONDecodeError as e:
+        return Verdict(
+            passed=False,
+            score=0.0,
+            criteria=[],
+            general_flags=[GeneralFlag(description=f"judge_error: {e}", severity="regression")],
+            summary=f"judge_error: failed to parse JSON output\n---\n{raw_text[:500]}",
+            judge_usage=usage,
+        )
+
+    raw_criteria = data.get("criteria", [])
+    parsed_criteria = []
+    for i, criterion_text in enumerate(criteria):
+        if i < len(raw_criteria):
+            entry = raw_criteria[i]
+            parsed_criteria.append(CriterionResult(
+                criterion=criterion_text,
+                passed=bool(entry.get("passed", False)),
+                reasoning=str(entry.get("reasoning", "")),
+            ))
+        else:
+            parsed_criteria.append(CriterionResult(
+                criterion=criterion_text,
+                passed=False,
+                reasoning="judge_error: no verdict returned for this criterion",
+            ))
+
+    raw_flags = data.get("general_flags", []) or []
+    parsed_flags = [
+        GeneralFlag(
+            description=str(f.get("description", "")),
+            severity=str(f.get("severity", "note")),
+        )
+        for f in raw_flags
+    ]
+
+    all_criteria_passed = all(c.passed for c in parsed_criteria) if parsed_criteria else True
+    has_regression = any(f.severity == "regression" for f in parsed_flags)
+    passed = all_criteria_passed and not has_regression
+    score = (sum(1 for c in parsed_criteria if c.passed) / len(parsed_criteria)) if parsed_criteria else 1.0
+
+    summary = _format_summary(parsed_criteria, parsed_flags, passed)
+
+    return Verdict(
+        passed=passed,
+        score=score,
+        criteria=parsed_criteria,
+        general_flags=parsed_flags,
+        summary=summary,
+        judge_usage=usage,
+    )
+
+
+def _format_summary(criteria: list[CriterionResult], flags: list[GeneralFlag], passed: bool) -> str:
+    lines = [f"Verdict: {'PASS' if passed else 'FAIL'}"]
+    if criteria:
+        lines.append("")
+        lines.append("Criteria:")
+        for c in criteria:
+            mark = "✓" if c.passed else "✗"
+            lines.append(f"  {mark} {c.criterion}")
+            if c.reasoning:
+                lines.append(f"      → {c.reasoning}")
+    if flags:
+        lines.append("")
+        lines.append("General flags:")
+        for f in flags:
+            lines.append(f"  [{f.severity}] {f.description}")
+    return "\n".join(lines)
+
+
+def evaluate(
+    *,
+    criteria: list[str],
+    candidate: dict,
+    test_notes: Optional[str] = None,
+    model: str = DEFAULT_MODEL,
+    client: Optional[Anthropic] = None,
+) -> Verdict:
+    """Evaluate a candidate response against criteria using an LLM judge.
+
+    Args:
+        criteria: List of natural-language criteria. Can be empty — universal
+            rules and general_flags still apply.
+        candidate: Full response dict from the chat service. Whatever it
+            contains is shown verbatim to the judge.
+        test_notes: Optional background context (typically the test's __doc__).
+            Shown to the judge but not graded against directly.
+        model: Judge model. Defaults to Sonnet.
+        client: Optional Anthropic client. Constructed from env if not given.
+
+    Returns:
+        A Verdict. Test code typically asserts on verdict.passed and uses
+        verdict.summary as the failure message.
+    """
+    if client is None:
+        api_key = os.environ.get("ANTHROPIC_API_KEY")
+        if not api_key:
+            raise RuntimeError(
+                "judge.evaluate requires ANTHROPIC_API_KEY. "
+                "Acceptance tests are run on demand against real models."
+            )
+        client = Anthropic(api_key=api_key)
+
+    system_prompt = _build_system_prompt()
+    user_prompt = _build_user_prompt(criteria, candidate, test_notes)
+
+    response = client.messages.create(
+        model=model,
+        max_tokens=4096,
+        system=system_prompt,
+        messages=[
+            {"role": "user", "content": user_prompt},
+            {"role": "assistant", "content": "{"},  # prefill to force JSON
+        ],
+    )
+
+    raw_text = "{" + response.content[0].text
+    usage = {
+        "input_tokens": response.usage.input_tokens,
+        "output_tokens": response.usage.output_tokens,
+    }
+    return _parse_verdict(raw_text, criteria, usage)
diff --git a/services/testing/judge_rules.md b/services/testing/judge_rules.md
new file mode 100644
index 0000000..e69de29
diff --git a/services/testing/payloads.py b/services/testing/payloads.py
new file mode 100644
index 0000000..25576ac
--- /dev/null
+++ b/services/testing/payloads.py
@@ -0,0 +1,188 @@
+"""Payload builders for acceptance tests.
+
+The function signatures are deliberately closer to user-facing concepts
+(`current_job_code`, `current_adaptor`, `previous_page`) than the underlying
+JSON shape (`context.expression`, `context.adaptor`, `meta.last_page`). The
+goal is that a contributor who has not memorised every service's payload spec
+can still author a test from these signatures alone.
+
+Each builder is "set if not None" — pass only the fields a test cares about.
+"""
+
+from typing import Any, Optional
+
+
+def build_global_chat_payload(
+    *,
+    user_message: str,
+    history: Optional[list[dict]] = None,
+    workflow_yaml: Optional[str] = None,
+    page: Optional[str] = None,
+    attachments: Optional[list[dict]] = None,
+    api_key: Optional[str] = None,
+    stream: bool = False,
+) -> dict:
+    """Build a global_chat service payload.
+
+    global_chat is the orchestrator entry point — it routes to workflow_chat,
+    job_chat, or the planner depending on context. For workflow scenarios pass
+    `workflow_yaml`; for job-code scenarios the planner will extract the
+    relevant job from the YAML using `page`.
+    """
+    payload: dict[str, Any] = {
+        "content": user_message,
+        "history": history or [],
+    }
+    if workflow_yaml is not None:
+        payload["workflow_yaml"] = workflow_yaml
+    if page is not None:
+        payload["page"] = page
+    if attachments is not None:
+        payload["attachments"] = attachments
+    if api_key is not None:
+        payload["api_key"] = api_key
+    if stream:
+        payload["options"] = {"stream": True}
+    return payload
+
+
+def build_workflow_chat_payload(
+    *,
+    user_message: Optional[str] = None,
+    existing_yaml: str = "",
+    history: Optional[list[dict]] = None,
+    errors: Optional[str] = None,
+    current_page: Optional[str] = None,
+    previous_page: Optional[dict] = None,
+    api_key: Optional[str] = None,
+) -> dict:
+    """Build a workflow_chat service payload.
+
+    Args:
+        user_message: The user's latest message. Required unless `errors` is set.
+        existing_yaml: Current workflow YAML the user is editing.
+        history: Chat history as a list of {role, content} dicts.
+        errors: An error string. When set, replaces `content` to put the
+            service in error-correction mode.
+        current_page: The page the user is currently on (e.g. workflow name).
+            Threaded into `context.page_name`.
+        previous_page: Where the user navigated from. Threaded into
+            `meta.last_page`. Shape: {"type": "job_code" | "workflow", "name": str, "adaptor": str}.
+        api_key: Optional Anthropic API key override.
+    """
+    payload: dict[str, Any] = {
+        "existing_yaml": existing_yaml,
+        "history": history or [],
+    }
+    if user_message is not None:
+        payload["content"] = user_message
+    if errors is not None:
+        payload["errors"] = errors
+
+    context: dict[str, Any] = {}
+    if current_page is not None:
+        context["page_name"] = current_page
+    if context:
+        payload["context"] = context
+
+    meta: dict[str, Any] = {}
+    if previous_page is not None:
+        meta["last_page"] = previous_page
+    if meta:
+        payload["meta"] = meta
+
+    if api_key is not None:
+        payload["api_key"] = api_key
+    return payload
+
+
+def build_job_chat_payload(
+    *,
+    user_message: str,
+    history: Optional[list[dict]] = None,
+    current_job_code: Optional[str] = None,
+    current_adaptor: Optional[str] = None,
+    project_adaptors: Optional[list[str]] = None,
+    current_page: Optional[str] = None,
+    project_id: Optional[str] = None,
+    job_id: Optional[str] = None,
+    input_data: Any = None,
+    output_data: Any = None,
+    log_data: Any = None,
+    rag_results: Optional[list[dict]] = None,
+    rag_queries: Optional[list[str]] = None,
+    previous_page: Optional[dict] = None,
+    suggest_code: Optional[bool] = None,
+    api_key: Optional[str] = None,
+    stream: Optional[bool] = None,
+    download_adaptor_docs: Optional[bool] = None,
+) -> dict:
+    """Build a job_chat service payload.
+
+    Args:
+        user_message: The user's latest message.
+        history: Chat history as a list of {role, content} dicts.
+        current_job_code: The job code currently in the editor. → context.expression
+        current_adaptor: The adaptor specifier (e.g. "@openfn/language-http@6.5.4"). → context.adaptor
+        project_adaptors: Other adaptors used in the project. → context.adaptors
+        current_page: Current page / job name. → context.page_name
+        project_id: → context.projectId
+        job_id: → context.jobId
+        input_data / output_data / log_data: Sample data the user has available.
+        rag_results: Pre-injected RAG search results. → meta.rag.search_results
+        rag_queries: Pre-injected RAG search queries. → meta.rag.search_queries
+        previous_page: Where the user navigated from. → meta.last_page
+        suggest_code: Enable code-suggestion mode (returns suggested_code in response).
+        api_key: Optional Anthropic API key override.
+        stream: Enable streaming.
+        download_adaptor_docs: Whether to load adaptor docs (default True in service).
+    """
+    payload: dict[str, Any] = {
+        "content": user_message,
+        "history": history or [],
+    }
+
+    context: dict[str, Any] = {}
+    if current_job_code is not None:
+        context["expression"] = current_job_code
+    if current_adaptor is not None:
+        context["adaptor"] = current_adaptor
+    if project_adaptors is not None:
+        context["adaptors"] = project_adaptors
+    if current_page is not None:
+        context["page_name"] = current_page
+    if project_id is not None:
+        context["projectId"] = project_id
+    if job_id is not None:
+        context["jobId"] = job_id
+    if input_data is not None:
+        context["input"] = input_data
+    if output_data is not None:
+        context["output"] = output_data
+    if log_data is not None:
+        context["log"] = log_data
+    if context:
+        payload["context"] = context
+
+    meta: dict[str, Any] = {}
+    if rag_results is not None or rag_queries is not None:
+        rag: dict[str, Any] = {}
+        if rag_results is not None:
+            rag["search_results"] = rag_results
+        if rag_queries is not None:
+            rag["search_queries"] = rag_queries
+        meta["rag"] = rag
+    if previous_page is not None:
+        meta["last_page"] = previous_page
+    if meta:
+        payload["meta"] = meta
+
+    if api_key is not None:
+        payload["api_key"] = api_key
+    if suggest_code is not None:
+        payload["suggest_code"] = suggest_code
+    if stream is not None:
+        payload["stream"] = stream
+    if download_adaptor_docs is not None:
+        payload["download_adaptor_docs"] = download_adaptor_docs
+    return payload
diff --git a/services/testing/responses.py b/services/testing/responses.py
new file mode 100644
index 0000000..07ded4a
--- /dev/null
+++ b/services/testing/responses.py
@@ -0,0 +1,90 @@
+"""Shared response helpers for acceptance tests.
+
+Originally lived as duplicate code in `services/global_chat/tests/test_utils.py`
+and the equivalent for other services. Centralised here so all acceptance tests
+share one implementation.
+"""
+
+from typing import Any, Optional
+
+
+def get_attachment(response: dict, attachment_type: str) -> Optional[str]:
+    """Return the `content` of the first attachment matching `attachment_type`.
+
+    Works for any service that returns `attachments: [{"type": ..., "content": ...}]`
+    (today: global_chat). Returns None if no matching attachment.
+    """
+    for attachment in response.get("attachments", []) or []:
+        if attachment.get("type") == attachment_type:
+            return attachment.get("content")
+    return None
+
+
+def assert_routed_to(response: dict, expected_agent: str, *, context: str = "") -> None:
+    """Assert `expected_agent` is present in `response["meta"]["agents"]`.
+
+    `meta.agents` is a list like ["router", "planner"] or ["router", "workflow_agent"].
+    Use this to verify the orchestrator routed the request the way you expected.
+    """
+    assert response is not None, f"{context}: response is None"
+    assert isinstance(response, dict), f"{context}: response is not a dict ({type(response).__name__})"
+
+    meta = response.get("meta") or {}
+    agents = meta.get("agents") or []
+    assert expected_agent in agents, (
+        f"{context}: expected '{expected_agent}' in meta.agents, got {agents}"
+    )
+
+
+def assert_agent_calls(
+    meta: dict,
+    *,
+    expected_agents: list[str],
+    min_job_code_calls: int = 0,
+    context: str = "",
+) -> None:
+    """Assert the planner orchestrated the expected sub-agents in the right order.
+
+    Checks:
+    - Every agent in `expected_agents` appears in `meta["agents"]`.
+    - `meta["tool_calls"]` contains at least one `call_workflow_agent`.
+    - `meta["tool_calls"]` contains at least `min_job_code_calls` of `call_job_code_agent`.
+    - Every `call_job_code_agent` comes after the first `call_workflow_agent`.
+
+    Used by global_chat planner-chain tests.
+    """
+    agents = meta.get("agents") or []
+    for agent in expected_agents:
+        assert agent in agents, f"{context}: expected '{agent}' in agents, got {agents}"
+
+    tool_calls = meta.get("tool_calls") or []
+    tool_names = [call.get("tool") for call in tool_calls]
+
+    assert "call_workflow_agent" in tool_names, (
+        f"{context}: expected call_workflow_agent in tool_calls, got {tool_names}"
+    )
+
+    job_code_indices = [i for i, name in enumerate(tool_names) if name == "call_job_code_agent"]
+    assert len(job_code_indices) >= min_job_code_calls, (
+        f"{context}: expected at least {min_job_code_calls} call_job_code_agent calls, "
+        f"got {len(job_code_indices)}. Tool calls: {tool_names}"
+    )
+
+    workflow_idx = tool_names.index("call_workflow_agent")
+    for j in job_code_indices:
+        assert j > workflow_idx, (
+            f"{context}: call_job_code_agent at index {j} came before "
+            f"call_workflow_agent at index {workflow_idx}. Tool calls: {tool_names}"
+        )
+
+
+def latest_user_message(response: dict) -> Optional[dict]:
+    """Return the most recent `role=user` message from `response["history"]`.
+
+    Useful for verifying page-prefix tagging applied to the user's input.
+    """
+    history = response.get("history") or []
+    for entry in reversed(history):
+        if entry.get("role") == "user":
+            return entry
+    return None
diff --git a/services/workflow_chat/tests/acceptance/__init__.py b/services/workflow_chat/tests/acceptance/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/services/workflow_chat/tests/acceptance/test_navigation_job_to_workflow.py b/services/workflow_chat/tests/acceptance/test_navigation_job_to_workflow.py
new file mode 100644
index 0000000..80a6ba4
--- /dev/null
+++ b/services/workflow_chat/tests/acceptance/test_navigation_job_to_workflow.py
@@ -0,0 +1,110 @@
+"""User has just navigated from a job-code editor (where they were discussing
+HTTP error handling) to a workflow editor and asks to add a new step. The
+model should infer the context switch and respond about the workflow as a
+structure, not continue talking about job-code error handling."""
+
+import yaml
+
+from testing import judge
+from testing.payloads import build_workflow_chat_payload
+from testing.yaml_assertions import assert_no_special_chars
+
+
+QUALITY_CRITERIA = [
+    "The response talks about the workflow as a structure (jobs, edges, triggers), not about job-code-level error handling.",
+    "The tone is warm and collaborative, not clinical or terse.",
+    "If the response proposes a new email step, the rationale is plausible (e.g. mentions notification, summary, or alerting).",
+]
+
+
+EXISTING_YAML = """name: data-pipeline
+jobs:
+  fetch-source-data:
+    id: job-fetch-id
+    name: Fetch Source Data
+    adaptor: '@openfn/language-http@6.5.4'
+    body: 'get("https://source.api/data");'
+  transform-data:
+    id: job-transform-id
+    name: Transform Data
+    adaptor: '@openfn/language-common@latest'
+    body: 'fn(state => { return { ...state, transformed: true }; });'
+  save-to-database:
+    id: job-save-id
+    name: Save to Database
+    adaptor: '@openfn/language-http@6.5.4'
+    body: 'post("https://db.api/save", state => state.data);'
+triggers:
+  webhook:
+    id: trigger-webhook-id
+    type: webhook
+    enabled: false
+edges:
+  webhook->fetch-source-data:
+    id: edge-webhook-fetch-id
+    source_trigger: webhook
+    target_job: fetch-source-data
+    condition_type: always
+    enabled: true
+  fetch-source-data->transform-data:
+    id: edge-fetch-transform-id
+    source_job: fetch-source-data
+    target_job: transform-data
+    condition_type: on_job_success
+    enabled: true
+  transform-data->save-to-database:
+    id: edge-transform-save-id
+    source_job: transform-data
+    target_job: save-to-database
+    condition_type: on_job_success
+    enabled: true
+"""
+
+
+def test_navigation_job_to_workflow(apollo_client):
+    payload = build_workflow_chat_payload(
+        existing_yaml=EXISTING_YAML,
+        history=[
+            {"role": "user", "content": "[pg:job_code/transform-data/http] Can you add error handling to this HTTP request?"},
+            {"role": "assistant", "content": "I'll add try-catch error handling to catch any request failures in your HTTP job."},
+            {"role": "user", "content": "[pg:job_code/transform-data/http] Also add retry logic with backoff"},
+            {"role": "assistant", "content": "I'll add exponential backoff retry logic to handle transient failures."},
+        ],
+        user_message="Add a step to send the results via email",
+        current_page="data-pipeline",
+        previous_page={"type": "job_code", "name": "transform-data", "adaptor": "http"},
+    )
+
+    response = apollo_client.call("workflow_chat", payload)
+
+    # ---- Structural assertions ---------------------------------------------
+    assert response["response_yaml"], "Model should have generated YAML"
+
+    yaml_obj = yaml.safe_load(response["response_yaml"])
+    assert "jobs" in yaml_obj
+
+    job_names = [job.get("name", "").lower() for job in yaml_obj["jobs"].values()]
+    assert any("email" in n or "mail" in n or "send" in n for n in job_names), (
+        f"Email job not found in workflow. Job names: {job_names}"
+    )
+
+    orig_yaml_obj = yaml.safe_load(EXISTING_YAML)
+    assert len(yaml_obj["jobs"]) > len(orig_yaml_obj["jobs"]), (
+        f"Expected a new job to be added. Original: {len(orig_yaml_obj['jobs'])}, "
+        f"new: {len(yaml_obj['jobs'])}"
+    )
+
+    response_text = response["response"].lower()
+    assert not any(phrase in response_text for phrase in ["try", "catch", "retry", "backoff", "error handling in the code"]), (
+        "Response should be about workflow structure, not job-code error handling"
+    )
+
+    assert_no_special_chars(response["response_yaml"])
+
+    # ---- Quality assertions ------------------------------------------------
+    verdict = judge.evaluate(
+        criteria=QUALITY_CRITERIA,
+        candidate=response,
+        test_notes=__doc__,
+    )
+    assert verdict.passed, verdict.summary

From 3f7711ec6fcf94095009558b71cd940887b3fd63 Mon Sep 17 00:00:00 2001
From: "Hanna Paasivirta (OpenFn)" <hanna@openfn.org>
Date: Wed, 13 May 2026 17:21:10 +0900
Subject: [PATCH 03/15] move qualitative tests

---
 ...o_dhis2_tracker_with_specific_functions.py |  44 ++
 ..._gsheets_transform_salesforce_with_cron.py |  35 +
 ...to_salesforce_three_steps_with_job_code.py |  52 ++
 .../test_vague_gmail_to_database.py           |  32 +
 .../tests/test_planner_multistep.py           | 218 -------
 .../tests/acceptance/test_basic_input.py      |  51 ++
 .../acceptance/test_contextualised_input.py   |  94 +++
 .../acceptance/test_duplicate_sections.py     |  92 +++
 .../test_duplicate_sections_additional.py     |  60 ++
 .../test_navigation_workflow_to_job.py        |  65 ++
 services/job_chat/tests/test_qualitative.py   | 330 ----------
 .../tests/acceptance/test_basic_input.py      |  40 ++
 .../acceptance/test_conversational_turn.py    |  88 +++
 .../tests/acceptance/test_edit_job_code.py    |  83 +++
 .../tests/acceptance/test_error_field.py      |  82 +++
 .../acceptance/test_input_second_turn.py      | 107 ++++
 .../tests/acceptance/test_long_yaml.py        | 173 +++++
 .../tests/acceptance/test_simple_lang_bug.py  |  45 ++
 .../acceptance/test_single_trigger_node.py    |  83 +++
 .../workflow_chat/tests/test_qualitative.py   | 597 ------------------
 20 files changed, 1226 insertions(+), 1145 deletions(-)
 create mode 100644 services/global_chat/tests/acceptance/test_commcare_to_dhis2_tracker_with_specific_functions.py
 create mode 100644 services/global_chat/tests/acceptance/test_gsheets_transform_salesforce_with_cron.py
 create mode 100644 services/global_chat/tests/acceptance/test_http_to_salesforce_three_steps_with_job_code.py
 create mode 100644 services/global_chat/tests/acceptance/test_vague_gmail_to_database.py
 delete mode 100644 services/global_chat/tests/test_planner_multistep.py
 create mode 100644 services/job_chat/tests/acceptance/test_basic_input.py
 create mode 100644 services/job_chat/tests/acceptance/test_contextualised_input.py
 create mode 100644 services/job_chat/tests/acceptance/test_duplicate_sections.py
 create mode 100644 services/job_chat/tests/acceptance/test_duplicate_sections_additional.py
 create mode 100644 services/job_chat/tests/acceptance/test_navigation_workflow_to_job.py
 delete mode 100644 services/job_chat/tests/test_qualitative.py
 create mode 100644 services/workflow_chat/tests/acceptance/test_basic_input.py
 create mode 100644 services/workflow_chat/tests/acceptance/test_conversational_turn.py
 create mode 100644 services/workflow_chat/tests/acceptance/test_edit_job_code.py
 create mode 100644 services/workflow_chat/tests/acceptance/test_error_field.py
 create mode 100644 services/workflow_chat/tests/acceptance/test_input_second_turn.py
 create mode 100644 services/workflow_chat/tests/acceptance/test_long_yaml.py
 create mode 100644 services/workflow_chat/tests/acceptance/test_simple_lang_bug.py
 create mode 100644 services/workflow_chat/tests/acceptance/test_single_trigger_node.py
 delete mode 100644 services/workflow_chat/tests/test_qualitative.py

diff --git a/services/global_chat/tests/acceptance/test_commcare_to_dhis2_tracker_with_specific_functions.py b/services/global_chat/tests/acceptance/test_commcare_to_dhis2_tracker_with_specific_functions.py
new file mode 100644
index 0000000..fc7fe97
--- /dev/null
+++ b/services/global_chat/tests/acceptance/test_commcare_to_dhis2_tracker_with_specific_functions.py
@@ -0,0 +1,44 @@
+"""User provides specific function-level instructions (each, fields, field,
+tracker.import with strategy CREATE_AND_UPDATE, fn for logging). The planner
+should pass that detail through to the job code agent and the generated code
+should use those exact functions — not generic substitutes."""
+
+from testing import judge
+from testing.payloads import build_global_chat_payload
+
+
+QUALITY_CRITERIA = [
+    "The generated job code uses the specific functions the user named: each, fields, field, and tracker.import with strategy CREATE_AND_UPDATE.",
+    "The generated job code includes a fn() step that logs the import summary (state.data.stats).",
+    "The mapping correctly maps case_id to trackedEntity, owner_name to a DHIS2 attribute, and date_modified to enrollmentDate as specified.",
+]
+
+
+def test_commcare_to_dhis2_tracker_with_specific_functions(apollo_client):
+    payload = build_global_chat_payload(
+        user_message=(
+            "Can you build a workflow that runs daily at 6am and syncs cases from "
+            "CommCare to DHIS2 Tracker? It should have 4 steps: "
+            "1. Fetch closed cases from CommCare from the last 24 hours. "
+            "2. Use each() to iterate over the cases and use fields() and field() to "
+            "map each case to a DHIS2 tracked entity instance — map case_id to "
+            "trackedEntity, owner_name to a DHIS2 attribute, and date_modified to "
+            "enrollmentDate. "
+            "3. Import the mapped entities to DHIS2 using tracker.import() with "
+            "strategy CREATE_AND_UPDATE and the async option set to false. "
+            "4. Use fn() to log the import summary from state.data.stats to the console."
+        ),
+        history=[],
+    )
+
+    response = apollo_client.call("global_chat", payload)
+
+    # ---- Structural assertions ---------------------------------------------
+    assert response is not None
+    assert isinstance(response, dict)
+    assert "response" in response
+    assert len(response["response"]) > 0
+
+    # ---- Quality assertions ------------------------------------------------
+    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
+    assert verdict.passed, verdict.summary
diff --git a/services/global_chat/tests/acceptance/test_gsheets_transform_salesforce_with_cron.py b/services/global_chat/tests/acceptance/test_gsheets_transform_salesforce_with_cron.py
new file mode 100644
index 0000000..9078a57
--- /dev/null
+++ b/services/global_chat/tests/acceptance/test_gsheets_transform_salesforce_with_cron.py
@@ -0,0 +1,35 @@
+"""Semi-specific request: cron trigger at midnight, fetch from Google Sheets,
+transform, upsert to Salesforce. The Salesforce upsert step requires field
+mapping decisions the user hasn't provided. The planner should acknowledge
+the missing details rather than inventing field mappings silently."""
+
+from testing import judge
+from testing.payloads import build_global_chat_payload
+
+
+QUALITY_CRITERIA = [
+    "The response acknowledges that the Salesforce upsert needs field-mapping details from the user (object type, key fields, source-to-target mapping).",
+    "If the response generates job code or YAML, it does not silently fabricate field mappings the user did not provide.",
+]
+
+
+def test_gsheets_transform_salesforce_with_cron(apollo_client):
+    payload = build_global_chat_payload(
+        user_message=(
+            "Can you make a workflow that triggers at midnight, fetches data from "
+            "Google Sheets, transforms it, and upserts it into Salesforce?"
+        ),
+        history=[],
+    )
+
+    response = apollo_client.call("global_chat", payload)
+
+    # ---- Structural assertions ---------------------------------------------
+    assert response is not None
+    assert isinstance(response, dict)
+    assert "response" in response
+    assert len(response["response"]) > 0
+
+    # ---- Quality assertions ------------------------------------------------
+    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
+    assert verdict.passed, verdict.summary
diff --git a/services/global_chat/tests/acceptance/test_http_to_salesforce_three_steps_with_job_code.py b/services/global_chat/tests/acceptance/test_http_to_salesforce_three_steps_with_job_code.py
new file mode 100644
index 0000000..243f92e
--- /dev/null
+++ b/services/global_chat/tests/acceptance/test_http_to_salesforce_three_steps_with_job_code.py
@@ -0,0 +1,52 @@
+"""From-scratch three-step HTTP→transform→Salesforce workflow with job code
+for all three steps. The planner should call the workflow agent to produce a
+three-job workflow, then call the job code agent at least three times to fill
+in the bodies."""
+
+import yaml
+
+from testing import judge
+from testing.payloads import build_global_chat_payload
+from testing.responses import assert_routed_to, assert_agent_calls, get_attachment
+from testing.yaml_assertions import assert_yaml_has_ids, assert_yaml_jobs_have_body
+
+
+QUALITY_CRITERIA = [
+    "Each job's body uses functions appropriate to its adaptor (HTTP get/post for the fetch step, JS for transform, Salesforce upsert for the destination).",
+]
+
+
+def test_http_to_salesforce_three_steps_with_job_code(apollo_client):
+    payload = build_global_chat_payload(
+        user_message=(
+            "Build a workflow that can fetch records from an HTTP endpoint, "
+            "transform the data, and upsert contacts to Salesforce."
+        ),
+        history=[],
+    )
+
+    response = apollo_client.call("global_chat", payload)
+
+    # ---- Structural assertions ---------------------------------------------
+    assert response is not None
+    assert isinstance(response, dict)
+    assert_routed_to(response, "planner")
+
+    yaml_str = get_attachment(response, "workflow_yaml")
+    assert yaml_str, "Expected a workflow_yaml attachment"
+
+    parsed = yaml.safe_load(yaml_str)
+    assert "jobs" in parsed
+    assert len(parsed["jobs"]) >= 3, f"Expected at least 3 jobs, got {len(parsed['jobs'])}"
+    assert_yaml_has_ids(yaml_str)
+    assert_yaml_jobs_have_body(yaml_str)
+
+    assert_agent_calls(
+        response.get("meta") or {},
+        expected_agents=["planner", "workflow_agent", "job_agent"],
+        min_job_code_calls=3,
+    )
+
+    # ---- Quality assertions ------------------------------------------------
+    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
+    assert verdict.passed, verdict.summary
diff --git a/services/global_chat/tests/acceptance/test_vague_gmail_to_database.py b/services/global_chat/tests/acceptance/test_vague_gmail_to_database.py
new file mode 100644
index 0000000..d73f2ea
--- /dev/null
+++ b/services/global_chat/tests/acceptance/test_vague_gmail_to_database.py
@@ -0,0 +1,32 @@
+"""Vague request: "fetch my data from gmail and send it to my database". No
+specifics on which gmail data, which database, how to map between them. The
+planner should surface the ambiguity (or ask clarifying questions) rather
+than silently inventing details."""
+
+from testing import judge
+from testing.payloads import build_global_chat_payload
+
+
+QUALITY_CRITERIA = [
+    "The response surfaces the ambiguities in the user's request (e.g. which gmail data, which database, how to authenticate) rather than silently inventing unstated requirements.",
+    "If the response asks clarifying questions, they are concrete and answerable, not generic.",
+]
+
+
+def test_vague_gmail_to_database(apollo_client):
+    payload = build_global_chat_payload(
+        user_message="I want to fetch my data from gmail and send it to my database",
+        history=[],
+    )
+
+    response = apollo_client.call("global_chat", payload)
+
+    # ---- Structural assertions ---------------------------------------------
+    assert response is not None
+    assert isinstance(response, dict)
+    assert "response" in response, "Expected a text response"
+    assert len(response["response"]) > 0, "Expected non-empty response"
+
+    # ---- Quality assertions ------------------------------------------------
+    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
+    assert verdict.passed, verdict.summary
diff --git a/services/global_chat/tests/test_planner_multistep.py b/services/global_chat/tests/test_planner_multistep.py
deleted file mode 100644
index 9022cba..0000000
--- a/services/global_chat/tests/test_planner_multistep.py
+++ /dev/null
@@ -1,218 +0,0 @@
-import pytest
-import yaml
-from testing.yaml_assertions import (
-    assert_yaml_has_ids,
-    assert_yaml_jobs_have_body,
-)
-from .test_utils import (
-    assert_routed_to,
-    call_global_chat_service,
-    get_attachment,
-    make_service_input,
-    print_response_details,
-)
-
-
-def assert_agent_calls(meta, expected_agents, min_job_code_calls, context=""):
-    """
-    Assert that the planner called the expected sub-agents in the correct order.
-
-    Checks:
-    - workflow_agent and job_code_agent appear in meta["agents"]
-    - tool_calls show workflow_agent before any job_code_agent calls
-    - job_code_agent was called at least min_job_code_calls times
-    """
-    agents = meta.get("agents", [])
-    for agent in expected_agents:
-        assert agent in agents, (
-            f"{context}: Expected '{agent}' in agents list, got {agents}"
-        )
-
-    tool_calls = meta.get("tool_calls", [])
-    tool_names = [tc["tool"] for tc in tool_calls]
-
-    # workflow_agent must be called
-    assert "call_workflow_agent" in tool_names, (
-        f"{context}: Expected call_workflow_agent in tool_calls, got {tool_names}"
-    )
-
-    # job_code_agent must be called enough times
-    job_code_indices = [i for i, t in enumerate(tool_names) if t == "call_job_code_agent"]
-    assert len(job_code_indices) >= min_job_code_calls, (
-        f"{context}: Expected at least {min_job_code_calls} call_job_code_agent calls, "
-        f"got {len(job_code_indices)}. Tool calls: {tool_names}"
-    )
-
-    # workflow_agent must come before all job_code_agent calls
-    workflow_idx = tool_names.index("call_workflow_agent")
-    for j in job_code_indices:
-        assert j > workflow_idx, (
-            f"{context}: call_job_code_agent at index {j} came before "
-            f"call_workflow_agent at index {workflow_idx}. Tool calls: {tool_names}"
-        )
-
-
-def test_commcare_to_dhis2_with_job_code():
-    print("==================TEST==================")
-    print(
-        "Description: From scratch - create a two-job CommCare→DHIS2 workflow and generate "
-        "job code for both steps. No existing YAML, no history. Expects planner to orchestrate "
-        "workflow_agent then job_code_agent."
-    )
-    content = (
-        "Create a workflow that fetches patient cases from CommCare and registers them in DHIS2."
-    )
-    service_input = make_service_input(content=content, history=[])
-    response = call_global_chat_service(service_input)
-    print_response_details(response, test_name="test_commcare_to_dhis2_with_job_code", content=content)
-
-    assert response is not None
-    assert isinstance(response, dict)
-    assert_routed_to(response, "planner", context="test_commcare_to_dhis2_with_job_code")
-
-    # Should return a workflow YAML attachment
-    yaml_str = get_attachment(response, "workflow_yaml")
-    assert yaml_str is not None, "Expected a workflow_yaml attachment"
-
-    parsed = yaml.safe_load(yaml_str)
-    assert "jobs" in parsed, "YAML must have a jobs section"
-    assert len(parsed["jobs"]) >= 2, f"Expected at least 2 jobs, got {len(parsed['jobs'])}"
-    assert "triggers" in parsed, "YAML must have a triggers section"
-    assert_yaml_has_ids(yaml_str, context="test_commcare_to_dhis2_with_job_code")
-
-    # Verify correct agents called in correct order: planner -> workflow_agent -> job_code_agent x2
-    meta = response.get("meta", {})
-    assert_agent_calls(
-        meta,
-        expected_agents=["planner", "workflow_agent", "job_agent"],
-        min_job_code_calls=2,
-        context="test_commcare_to_dhis2_with_job_code",
-    )
-
-    # Every job must have a non-empty body
-    assert_yaml_jobs_have_body(yaml_str, context="test_commcare_to_dhis2_with_job_code")
-
-
-def test_http_to_salesforce_three_steps_with_job_code():
-    print("==================TEST==================")
-    print(
-        "Description: From scratch - create a three-step HTTP→transform→Salesforce workflow "
-        "and generate job code for all steps. No existing YAML, no history. Expects planner "
-        "to orchestrate workflow_agent then multiple job_code_agent calls."
-    )
-    content = (
-        "Build a workflow that can fetch records from an HTTP endpoint, "
-        "transform the data, and upsert contacts to Salesforce."
-    )
-    service_input = make_service_input(content=content, history=[])
-    response = call_global_chat_service(service_input)
-    print_response_details(response, test_name="test_http_to_salesforce_three_steps_with_job_code", content=content)
-
-    assert response is not None
-    assert isinstance(response, dict)
-    assert_routed_to(response, "planner", context="test_http_to_salesforce_three_steps_with_job_code")
-
-    # Should return a workflow YAML attachment
-    yaml_str = get_attachment(response, "workflow_yaml")
-    assert yaml_str is not None, "Expected a workflow_yaml attachment"
-
-    parsed = yaml.safe_load(yaml_str)
-    assert "jobs" in parsed, "YAML must have a jobs section"
-    assert len(parsed["jobs"]) >= 3, f"Expected at least 3 jobs, got {len(parsed['jobs'])}"
-    assert_yaml_has_ids(yaml_str, context="test_http_to_salesforce_three_steps_with_job_code")
-
-    # Verify correct agents called in correct order: planner -> workflow_agent -> job_code_agent x3
-    meta = response.get("meta", {})
-    assert_agent_calls(
-        meta,
-        expected_agents=["planner", "workflow_agent", "job_agent"],
-        min_job_code_calls=3,
-        context="test_http_to_salesforce_three_steps_with_job_code",
-    )
-
-    # Every job must have a non-empty body
-    assert_yaml_jobs_have_body(yaml_str, context="test_http_to_salesforce_three_steps_with_job_code")
-
-
-def test_vague_gmail_to_database():
-    """Vague request with no adaptors or structure specified - see how planner handles ambiguity."""
-    print("==================TEST==================")
-    print(
-        "Description: Vague request - 'fetch data from gmail and send to database'. "
-        "Not enough info to fully construct a workflow. Exploring planner behavior."
-    )
-    content = "I want to fetch my data from gmail and send it to my database"
-    service_input = make_service_input(content=content, history=[])
-    response = call_global_chat_service(service_input)
-    print_response_details(response, test_name="test_vague_gmail_to_database", content=content)
-
-    assert response is not None
-    assert isinstance(response, dict)
-    assert "response" in response, "Expected a text response"
-    assert len(response["response"]) > 0, "Expected non-empty response"
-
-    meta = response.get("meta", {})
-    print(f"\n  Agents used: {meta.get('agents', [])}")
-    print(f"  Tool calls: {[tc['tool'] for tc in meta.get('tool_calls', [])]}")
-
-
-def test_gsheets_transform_salesforce_with_cron():
-    """More specific request with cron trigger, but still underspecified transform/upsert steps.
-    Using Salesforce as destination to imply specific field mapping requirements the user hasn't specified."""
-    print("==================TEST==================")
-    print(
-        "Description: Semi-specific request - cron trigger, google sheets, transform, salesforce upsert. "
-        "Salesforce upsert implies field mapping decisions the user hasn't specified. Exploring planner behavior."
-    )
-    content = (
-        "Can you make a workflow that triggers at midnight, fetches data from "
-        "Google Sheets, transforms it, and upserts it into Salesforce?"
-    )
-    service_input = make_service_input(content=content, history=[])
-    response = call_global_chat_service(service_input)
-    print_response_details(response, test_name="test_gsheets_transform_salesforce_with_cron", content=content)
-
-    assert response is not None
-    assert isinstance(response, dict)
-    assert "response" in response, "Expected a text response"
-    assert len(response["response"]) > 0, "Expected non-empty response"
-
-    meta = response.get("meta", {})
-    print(f"\n  Agents used: {meta.get('agents', [])}")
-    print(f"  Tool calls: {[tc['tool'] for tc in meta.get('tool_calls', [])]}")
-
-
-def test_commcare_to_dhis2_tracker_with_specific_functions():
-    """User provides specific function details - tests whether planner passes them faithfully to job code agent."""
-    print("==================TEST==================")
-    print(
-        "Description: User specifies exact DHIS2 tracker functions and common adaptor helpers. "
-        "Tests whether the planner passes function-level detail through to job code agent."
-    )
-    content = (
-        "Can you build a workflow that runs daily at 6am and syncs cases from CommCare to DHIS2 Tracker? "
-        "It should have 4 steps: "
-        "1. Fetch closed cases from CommCare from the last 24 hours. "
-        "2. Use each() to iterate over the cases and use fields() and field() to map each case "
-        "to a DHIS2 tracked entity instance — map case_id to trackedEntity, owner_name to a DHIS2 attribute, "
-        "and date_modified to enrollmentDate. "
-        "3. Import the mapped entities to DHIS2 using tracker.import() with strategy CREATE_AND_UPDATE "
-        "and the async option set to false. "
-        "4. Use fn() to log the import summary from state.data.stats to the console."
-    )
-    service_input = make_service_input(content=content, history=[])
-    response = call_global_chat_service(service_input)
-    print_response_details(response, test_name="test_commcare_to_dhis2_tracker_with_specific_functions", content=content)
-
-    assert response is not None
-    assert isinstance(response, dict)
-    assert "response" in response, "Expected a text response"
-    assert len(response["response"]) > 0, "Expected non-empty response"
-
-    meta = response.get("meta", {})
-    print(f"\n  Agents used: {meta.get('agents', [])}")
-    print(f"  Tool calls: {[tc['tool'] for tc in meta.get('tool_calls', [])]}")
-
-
-if __name__ == "__main__":
-    pytest.main([__file__, "-v", "-s"])
diff --git a/services/job_chat/tests/acceptance/test_basic_input.py b/services/job_chat/tests/acceptance/test_basic_input.py
new file mode 100644
index 0000000..06d5efc
--- /dev/null
+++ b/services/job_chat/tests/acceptance/test_basic_input.py
@@ -0,0 +1,51 @@
+"""Basic input: a simple job-code modification request. The service should
+return a response with both a text answer and a suggested_code patch when
+suggest_code=True is set."""
+
+from testing import judge
+from testing.payloads import build_job_chat_payload
+
+
+QUALITY_CRITERIA = []  # mostly structural; relies on universal rules + general flags
+
+
+JOB_CODE = """// Get data from external API
+get('https://api.example.com/data');
+
+// Process and transform data
+fn(state => {
+  const transformed = state.data.map(item => ({
+    id: item.id,
+    name: item.full_name,
+    status: item.active ? 'Active' : 'Inactive'
+  }));
+
+  return { ...state, transformed };
+});
+
+// Send transformed data to destination
+post('https://destination.org/upload', state => state.transformed);"""
+
+
+def test_basic_input(apollo_client):
+    payload = build_job_chat_payload(
+        user_message=(
+            "Can you add error handling to this job that will log the error message "
+            "and retry the operation once if the API call fails?"
+        ),
+        history=[],
+        current_job_code=JOB_CODE,
+        current_adaptor="@openfn/language-gmail@2.0.2",
+        suggest_code=True,
+    )
+
+    response = apollo_client.call("job_chat", payload)
+
+    # ---- Structural assertions ---------------------------------------------
+    assert response is not None
+    assert "response" in response
+    assert "suggested_code" in response
+
+    # ---- Quality assertions ------------------------------------------------
+    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
+    assert verdict.passed, verdict.summary
diff --git a/services/job_chat/tests/acceptance/test_contextualised_input.py b/services/job_chat/tests/acceptance/test_contextualised_input.py
new file mode 100644
index 0000000..303eefc
--- /dev/null
+++ b/services/job_chat/tests/acceptance/test_contextualised_input.py
@@ -0,0 +1,94 @@
+"""All payload fields populated: history, job code, multiple adaptors, ids,
+and pre-injected RAG search results in meta. The service should pick up on
+the RAG hints (HTTP adaptor error handling, retry logic) and produce a
+suggested_code that differs from the original."""
+
+from testing import judge
+from testing.payloads import build_job_chat_payload
+
+
+QUALITY_CRITERIA = [
+    "The response or suggested_code references the pre-injected RAG search results (HTTP adaptor error handling and retry logic) rather than ignoring them.",
+]
+
+
+JOB_CODE = """// Get data from external API
+get('https://api.example.com/data');
+
+// Process and transform data
+fn(state => {
+  const transformed = state.data.map(item => ({
+    id: item.id,
+    name: item.full_name,
+    status: item.active ? 'Active' : 'Inactive'
+  }));
+
+  return { ...state, transformed };
+});
+
+// Send transformed data to destination
+post('https://destination.org/upload', state => state.transformed);"""
+
+
+def test_contextualised_input(apollo_client):
+    payload = build_job_chat_payload(
+        user_message=(
+            "Can you add error handling to this job that will log the error message "
+            "and retry the operation once if the API call fails?"
+        ),
+        history=[
+            {
+                "role": "user",
+                "content": "I need to add error handling to my API integration job. What's the best approach?",
+            },
+            {
+                "role": "assistant",
+                "content": (
+                    "There are several approaches to handling errors in API calls. You can use "
+                    "try/catch blocks, implement retry logic, or use built-in error handling "
+                    "functions. Could you share your current job code so I can provide specific "
+                    "recommendations?"
+                ),
+            },
+        ],
+        current_job_code=JOB_CODE,
+        current_adaptor="@openfn/language-fhir-4@0.1.10",
+        project_adaptors=["@openfn/language-http", "@openfn/language-common"],
+        job_id="job-abc123",
+        project_id="project-xyz789",
+        rag_results=[
+            {
+                "title": "HTTP Adaptor Error Handling",
+                "url": "https://docs.openfn.org/adaptors/http#error-handling",
+                "content": (
+                    "The HTTP adaptor provides mechanisms for handling connection errors and "
+                    "retrying failed requests. Use the maxRetries option to specify retry attempts."
+                ),
+            },
+            {
+                "title": "Common Adaptor Documentation",
+                "url": "https://docs.openfn.org/adaptors/common#error-handling",
+                "content": (
+                    "Error handling can be implemented using standard JavaScript try/catch "
+                    "blocks or with the withError helper function."
+                ),
+            },
+        ],
+        rag_queries=["http adaptor error handling", "openfn retry logic"],
+        suggest_code=True,
+    )
+
+    response = apollo_client.call("job_chat", payload)
+
+    # ---- Structural assertions ---------------------------------------------
+    assert response is not None
+    assert "response" in response
+    assert "suggested_code" in response
+    assert "meta" in response
+    assert "usage" in response
+    assert response["suggested_code"] is not None, "suggested_code should not be None"
+    assert response["suggested_code"] != JOB_CODE, "suggested_code should differ from the original"
+
+    # ---- Quality assertions ------------------------------------------------
+    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
+    assert verdict.passed, verdict.summary
diff --git a/services/job_chat/tests/acceptance/test_duplicate_sections.py b/services/job_chat/tests/acceptance/test_duplicate_sections.py
new file mode 100644
index 0000000..a06b36c
--- /dev/null
+++ b/services/job_chat/tests/acceptance/test_duplicate_sections.py
@@ -0,0 +1,92 @@
+"""Code contains three identical line_item creation blocks. The user asks
+for validation added only to the SECOND one. The service should apply the
+change to the right block — not all three, not the wrong one — using enough
+surrounding context to disambiguate."""
+
+from testing import judge
+from testing.payloads import build_job_chat_payload
+
+
+QUALITY_CRITERIA = [
+    "The change is applied to the second line_item creation only — the first and third remain unchanged.",
+    "The applied change adds a check that Barcode__c is not empty before creating the record.",
+]
+
+
+JOB_CODE = """each(
+  dataPath('data[*]'),
+  combine(
+    create(
+      'transaction__c',
+      fields(
+        field('Transaction_Date__c', dataValue('today')),
+        relationship(
+          'Person_Responsible__r',
+          'Staff_ID_Code__c',
+          dataValue('person_code')
+        ),
+        field('metainstanceid__c', dataValue('*meta-instance-id*'))
+      )
+    ),
+    each(
+      merge(
+        dataPath('line_items[*]'),
+        fields(
+          field('end', dataValue('time_end')),
+          field('parentId', lastReferenceValue('id'))
+        )
+      ),
+      create(
+        'line_item__c',
+        fields(
+          field('transaction__c', dataValue('parentId')),
+          field('Barcode__c', dataValue('product_barcode')),
+          field('ODK_Form_Completed__c', dataValue('end'))
+        )
+      ),
+      create(
+        'line_item__c',
+        fields(
+          field('transaction__c', dataValue('parentId')),
+          field('Barcode__c', dataValue('product_barcode')),
+          field('ODK_Form_Completed__c', dataValue('end'))
+        )
+      ),
+      create(
+        'line_item__c',
+        fields(
+          field('transaction__c', dataValue('parentId')),
+          field('Barcode__c', dataValue('product_barcode')),
+          field('ODK_Form_Completed__c', dataValue('end'))
+        )
+      )
+    )
+  )
+);"""
+
+
+def test_duplicate_sections(apollo_client):
+    payload = build_job_chat_payload(
+        user_message=(
+            "In this job, I want to add a validation step only for the second "
+            "line item creation, to check if the Barcode__c is not empty before "
+            "creating the record. How can I do that?"
+        ),
+        history=[],
+        current_job_code=JOB_CODE,
+        current_adaptor="@openfn/language-dhis2@8.0.1",
+        suggest_code=True,
+    )
+
+    response = apollo_client.call("job_chat", payload)
+
+    # ---- Structural assertions ---------------------------------------------
+    assert response is not None
+    assert "response" in response
+    assert "suggested_code" in response
+    assert response["suggested_code"] is not None
+    assert response["suggested_code"] != JOB_CODE, "suggested_code should differ from the original"
+
+    # ---- Quality assertions ------------------------------------------------
+    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
+    assert verdict.passed, verdict.summary
diff --git a/services/job_chat/tests/acceptance/test_duplicate_sections_additional.py b/services/job_chat/tests/acceptance/test_duplicate_sections_additional.py
new file mode 100644
index 0000000..0f6842f
--- /dev/null
+++ b/services/job_chat/tests/acceptance/test_duplicate_sections_additional.py
@@ -0,0 +1,60 @@
+"""Six identical POST calls. The user asks for retry-on-failure error handling
+on the THIRD one only. The service must use enough context to identify the
+right call, apply the change only there, and not accidentally drop any of
+the other five."""
+
+from testing import judge
+from testing.payloads import build_job_chat_payload
+
+
+QUALITY_CRITERIA = [
+    "Error handling is added only to the third POST call — the others remain unchanged.",
+    "All six POST calls are still present in the suggested code (none accidentally removed).",
+]
+
+
+JOB_CODE = """// Process and prepare data
+fn(state => {
+  const items = state.data.items.map(item => ({
+    id: item.id,
+    name: item.name,
+    status: 'pending'
+  }));
+
+  return { ...state, items };
+});
+
+post('https://api.example.com/endpoint', state => state.items);
+
+post('https://api.example.com/endpoint', state => state.items);
+
+post('https://api.example.com/endpoint', state => state.items);
+
+post('https://api.example.com/endpoint', state => state.items);
+
+post('https://api.example.com/endpoint', state => state.items);
+
+post('https://api.example.com/endpoint', state => state.items);"""
+
+
+def test_duplicate_sections_additional(apollo_client):
+    payload = build_job_chat_payload(
+        user_message="I need to add error handling only to the third POST request to retry once if it fails.",
+        history=[],
+        current_job_code=JOB_CODE,
+        current_adaptor="@openfn/language-mailchimp@1.0.19",
+        suggest_code=True,
+    )
+
+    response = apollo_client.call("job_chat", payload)
+
+    # ---- Structural assertions ---------------------------------------------
+    assert response is not None
+    assert "response" in response
+    assert "suggested_code" in response
+    assert response["suggested_code"] is not None
+    assert response["suggested_code"] != JOB_CODE, "suggested_code should differ from the original"
+
+    # ---- Quality assertions ------------------------------------------------
+    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
+    assert verdict.passed, verdict.summary
diff --git a/services/job_chat/tests/acceptance/test_navigation_workflow_to_job.py b/services/job_chat/tests/acceptance/test_navigation_workflow_to_job.py
new file mode 100644
index 0000000..a50dba3
--- /dev/null
+++ b/services/job_chat/tests/acceptance/test_navigation_workflow_to_job.py
@@ -0,0 +1,65 @@
+"""User was on a workflow editor discussing workflow structure, then navigated
+to a job editor and asked an abrupt question about the current code. The model
+should recognise the navigation (via meta.last_page) and respond about the
+job code, not continue talking about workflow structure."""
+
+from testing import judge
+from testing.payloads import build_job_chat_payload
+
+
+QUALITY_CRITERIA = [
+    "The response is about the current job code (the patient mapping), not about workflow structure.",
+]
+
+
+JOB_CODE = """fn(state => {
+  const patients = state.data.map(patient => ({
+    id: patient.patient_id,
+    name: patient.full_name,
+    dob: patient.date_of_birth
+  }));
+
+  return { ...state, patients };
+});
+
+post('https://destination.api/patients', state => state.patients);"""
+
+
+def test_navigation_workflow_to_job(apollo_client):
+    payload = build_job_chat_payload(
+        user_message="Add a log statement at the start",
+        history=[
+            {"role": "user", "content": "[pg:workflow/patient-sync] Create a workflow to sync patient data from source to destination"},
+            {"role": "assistant", "content": "I'll create a workflow with jobs to fetch patient data, transform it, and sync to the destination system."},
+            {"role": "user", "content": "[pg:workflow/patient-sync] Add validation between fetch and transform"},
+            {"role": "assistant", "content": "I'll add a validation job that checks the patient data before transformation."},
+        ],
+        current_job_code=JOB_CODE,
+        current_adaptor="@openfn/language-common@latest",
+        current_page="map-patient-data",
+        previous_page={"type": "workflow", "name": "patient-sync"},
+        suggest_code=True,
+    )
+
+    response = apollo_client.call("job_chat", payload)
+
+    # ---- Structural assertions ---------------------------------------------
+    assert response is not None
+    assert "response" in response
+    assert "suggested_code" in response
+    assert response["suggested_code"] is not None, "Model should have generated code for the job"
+
+    # Log statement was added
+    assert "console.log" in response["suggested_code"], (
+        f"Log statement not found in suggested code: {response['suggested_code'][:300]}"
+    )
+
+    # Response text is about job code, not workflow structure
+    response_text = response["response"].lower()
+    assert not any(word in response_text for word in ["workflow", "yaml", "trigger", "edge"]), (
+        f"Response should be about job code, not workflow structure. Response: {response_text[:300]}"
+    )
+
+    # ---- Quality assertions ------------------------------------------------
+    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
+    assert verdict.passed, verdict.summary
diff --git a/services/job_chat/tests/test_qualitative.py b/services/job_chat/tests/test_qualitative.py
deleted file mode 100644
index c2eaed5..0000000
--- a/services/job_chat/tests/test_qualitative.py
+++ /dev/null
@@ -1,330 +0,0 @@
-import pytest
-import json
-from .test_utils import call_job_chat_service, make_service_input, print_response_details
-
-
-def test_basic_input():
-    print("==================TEST==================")
-    print("Description: Basic input test. Check if the service can handle a simple input and generate a response.")
-    history = []
-    content = "Can you add error handling to this job that will log the error message and retry the operation once if the API call fails?"
-    context = {
-        "expression": '''// Get data from external API
-get('https://api.example.com/data');
-
-// Process and transform data
-fn(state => {
-  const transformed = state.data.map(item => ({
-    id: item.id,
-    name: item.full_name,
-    status: item.active ? 'Active' : 'Inactive'
-  }));
-  
-  return { ...state, transformed };
-});
-
-// Send transformed data to destination
-post('https://destination.org/upload', state => state.transformed);''',
-        "adaptor": "@openfn/language-gmail@2.0.2"
-    }
-    meta = {}
-    service_input = make_service_input(history=history, content=content, context=context, meta=meta, suggest_code=True)
-    response = call_job_chat_service(service_input)
-    print_response_details(response, "basic_input", content=content)
-    assert response is not None
-    assert "response" in response
-    assert "suggested_code" in response
-
-def test_contextualised_input():
-    print("==================TEST==================")
-    print("Description: Check if the service can handle an input that includes info for all fields (history, context, meta).")
-    
-    history = [
-        {"role": "user", "content": "I need to add error handling to my API integration job. What's the best approach?"},
-        {"role": "assistant", "content": "There are several approaches to handling errors in API calls. You can use try/catch blocks, implement retry logic, or use built-in error handling functions. Could you share your current job code so I can provide specific recommendations?"}
-    ]
-    
-    content = "Can you add error handling to this job that will log the error message and retry the operation once if the API call fails?"
-    
-    context = {
-        "expression": '''// Get data from external API
-get('https://api.example.com/data');
-
-// Process and transform data
-fn(state => {
-  const transformed = state.data.map(item => ({
-    id: item.id,
-    name: item.full_name,
-    status: item.active ? 'Active' : 'Inactive'
-  }));
-  
-  return { ...state, transformed };
-});
-
-// Send transformed data to destination
-post('https://destination.org/upload', state => state.transformed);''',
-        "adaptors": ["@openfn/language-http", "@openfn/language-common"],
-        "jobId": "job-abc123",
-        "projectId": "project-xyz789",
-        "adaptor": "@openfn/language-fhir-4@0.1.10"
-    }
-    
-    meta = {
-        "rag": {
-            "search_results": [
-                {
-                    "title": "HTTP Adaptor Error Handling",
-                    "url": "https://docs.openfn.org/adaptors/http#error-handling",
-                    "content": "The HTTP adaptor provides mechanisms for handling connection errors and retrying failed requests. Use the maxRetries option to specify retry attempts."
-                },
-                {
-                    "title": "Common Adaptor Documentation",
-                    "url": "https://docs.openfn.org/adaptors/common#error-handling",
-                    "content": "Error handling can be implemented using standard JavaScript try/catch blocks or with the withError helper function."
-                }
-            ],
-            "search_queries": ["http adaptor error handling", "openfn retry logic"]
-        }
-    }
-    
-    service_input = make_service_input(history=history, content=content, context=context, meta=meta, suggest_code=True)
-    response = call_job_chat_service(service_input)
-    print_response_details(response, "contextualised_input", content=content)
-    assert response is not None
-    assert "response" in response
-    assert "suggested_code" in response
-    assert "meta" in response
-    assert "usage" in response
-    assert response["suggested_code"] is not None, "JSON parsing failed - suggested_code is None"
-    assert response["suggested_code"] != context["expression"], "Suggested code should be different from the original code"
-
-def test_duplicate_sections():
-    print("==================TEST==================")
-    print("Description: Test if the service can apply a change to code with duplicate sections.")
-    history = []
-    content = "In this job, I want to add a validation step only for the second line item creation, to check if the Barcode__c is not empty before creating the record. How can I do that?"
-    context = {
-        "expression": '''each(
-  dataPath('data[*]'),
-  combine(
-    create(
-      'transaction__c',
-      fields(
-        field('Transaction_Date__c', dataValue('today')),
-        relationship(
-          'Person_Responsible__r',
-          'Staff_ID_Code__c',
-          dataValue('person_code')
-        ),
-        field('metainstanceid__c', dataValue('*meta-instance-id*'))
-      )
-    ),
-    each(
-      merge(
-        dataPath('line_items[*]'),
-        fields(
-          field('end', dataValue('time_end')),
-          field('parentId', lastReferenceValue('id'))
-        )
-      ),
-      create(
-        'line_item__c',
-        fields(
-          field('transaction__c', dataValue('parentId')),
-          field('Barcode__c', dataValue('product_barcode')),
-          field('ODK_Form_Completed__c', dataValue('end'))
-        )
-      ),
-      create(
-        'line_item__c',
-        fields(
-          field('transaction__c', dataValue('parentId')),
-          field('Barcode__c', dataValue('product_barcode')),
-          field('ODK_Form_Completed__c', dataValue('end'))
-        )
-      ),
-      create(
-        'line_item__c',
-        fields(
-          field('transaction__c', dataValue('parentId')),
-          field('Barcode__c', dataValue('product_barcode')),
-          field('ODK_Form_Completed__c', dataValue('end'))
-        )
-      )
-    )
-  )
-);''',
-        "adaptor": "@openfn/language-dhis2@8.0.1"
-    }
-    meta = {}
-    service_input = make_service_input(history=history, content=content, context=context, meta=meta, suggest_code=True)
-    response = call_job_chat_service(service_input)
-    print_response_details(response, "odk_duplicate_sections", content=content)
-    assert response is not None
-    assert "response" in response
-    assert "suggested_code" in response
-    assert response["suggested_code"] is not None, "JSON parsing failed - suggested_code is None"
-    assert response["suggested_code"] != context["expression"], "Suggested code should be different from the original code"
-
-def test_duplicate_sections_additional():
-    print("==================TEST==================")
-    print("Description: Another test to check if the service can handle duplicate sections, this time with more duplicates."
-          "Check whether it's able to provide enough context for the match to be unique, and check it doesn't accidentally delete sections")
-    history = []
-    content = "I need to add error handling only to the third POST request to retry once if it fails."
-    context = {
-        "expression": '''// Process and prepare data
-fn(state => {
-  const items = state.data.items.map(item => ({
-    id: item.id,
-    name: item.name,
-    status: 'pending'
-  }));
-  
-  return { ...state, items };
-});
-
-post('https://api.example.com/endpoint', state => state.items);
-
-post('https://api.example.com/endpoint', state => state.items);
-
-post('https://api.example.com/endpoint', state => state.items);
-
-post('https://api.example.com/endpoint', state => state.items);
-
-post('https://api.example.com/endpoint', state => state.items);
-
-post('https://api.example.com/endpoint', state => state.items);''',
-        "adaptor": "@openfn/language-mailchimp@1.0.19"
-    }
-    meta = {}
-    service_input = make_service_input(history=history, content=content, context=context, meta=meta, suggest_code=True)
-    response = call_job_chat_service(service_input)
-    print_response_details(response, "duplicate_post_sections", content=content)
-    assert response is not None
-    assert "response" in response
-    assert "suggested_code" in response
-    assert response["suggested_code"] is not None, "JSON parsing failed - suggested_code is None"
-    assert response["suggested_code"] != context["expression"], "Suggested code should be different from the original code"
-
-
-def test_navigation_workflow_to_job():
-    print("==================TEST==================")
-    print("Description: Testing cross-service navigation from workflow editor to job editor - model should infer context change")
-
-    # History shows user was on workflow page discussing workflow structure
-    history = [
-        {"role": "user", "content": "[pg:workflow/patient-sync] Create a workflow to sync patient data from source to destination"},
-        {"role": "assistant", "content": "I'll create a workflow with jobs to fetch patient data, transform it, and sync to the destination system."},
-        {"role": "user", "content": "[pg:workflow/patient-sync] Add validation between fetch and transform"},
-        {"role": "assistant", "content": "I'll add a validation job that checks the patient data before transformation."}
-    ]
-
-    # Now user is on job editor with a different page - abrupt question about current code
-    content = "Add a log statement at the start"
-
-    # Current context is job code, not workflow
-    context = {
-        "expression": '''fn(state => {
-  const patients = state.data.map(patient => ({
-    id: patient.patient_id,
-    name: patient.full_name,
-    dob: patient.date_of_birth
-  }));
-
-  return { ...state, patients };
-});
-
-post('https://destination.api/patients', state => state.patients);''',
-        "adaptor": "@openfn/language-common@latest",
-        "page_name": "map-patient-data"
-    }
-
-    # Meta shows navigation happened
-    meta = {
-        "last_page": {
-            "type": "workflow",
-            "name": "patient-sync"
-        }
-    }
-
-    service_input = make_service_input(history=history, content=content, context=context, meta=meta, suggest_code=True)
-    response = call_job_chat_service(service_input)
-    print_response_details(response, "navigation_workflow_to_job", content=content)
-
-    # Assertions to verify model correctly inferred navigation and responded about job code
-    assert response is not None
-    assert "response" in response
-    assert "suggested_code" in response
-    assert response["suggested_code"] is not None, "Model should have generated code for the job"
-
-    # Verify logging was added to the code
-    assert "console.log" in response["suggested_code"], "Log statement not found in suggested code"
-
-    # Verify response talks about job code, not workflow
-    response_text = response["response"].lower()
-    assert not any(word in response_text for word in ["workflow", "yaml", "trigger", "edge"]), \
-        "Response should be about job code, not workflow structure"
-
-    print("\n✓ Navigation test passed: Model correctly inferred navigation from workflow to job editor")
-
-def test_adaptor_context_switching():
-    print("==================TEST==================")
-    print("Description: Test that the model pays attention to page prefix changes and provides adaptor-specific answers")
-
-    # Simulate a conversation history where:
-    # 1. User was on a Salesforce job page and asked "How do I get data?"
-    # 2. Assistant answered with Salesforce-specific guidance (query, SOQL, etc.)
-    # 3. User has now navigated to a DHIS2 job page and asks the SAME question again
-    # Expected: The model should recognize the context switch and mention DHIS2-specific functions
-
-    history = [
-        {"role": "user", "content": "[pg:job_code/fetch-records/salesforce@9.0.3] How do I get data?"},
-        {"role": "assistant", "content": "To get data from Salesforce, you can use the `query()` operation with SOQL (Salesforce Object Query Language). For example:\n\n```js\nquery('SELECT Id, Name FROM Account WHERE Status = \"Active\"');\n```\n\nThis will fetch records from Salesforce and store them in `state.data`."}
-    ]
-
-    # Now user has navigated to a DHIS2 job page and asks the same question
-    content = "How do I get data?"
-
-    context = {
-        "expression": '''
-fn(state => {
-  return state;
-});''',
-        "adaptor": "@openfn/language-dhis2@8.0.7",
-        "page_name": "fetch-data"
-    }
-
-    meta = {}
-    service_input = make_service_input(history=history, content=content, context=context, meta=meta, suggest_code=False)
-    response = call_job_chat_service(service_input)
-    print_response_details(response, "adaptor_context_switching", content=content)
-
-    assert response is not None
-    assert "response" in response
-
-    response_text = response["response"].lower()
-    print(f"\n=== RESPONSE (DHIS2 Context) ===")
-    print(response["response"])
-
-    # Check that DHIS2-specific functions are mentioned
-    dhis2_mentioned = "dhis" in response_text
-    assert dhis2_mentioned, f"Expected DHIS2 to be mentioned in response when on DHIS2 page. Response: {response['response']}"
-
-    # Check the history was properly prefixed with the new page context
-    assert "history" in response
-    updated_history = response["history"]
-    assert len(updated_history) == 4  # 2 previous turns + 1 new turn = 4 messages
-
-    # Verify the latest user message has the correct DHIS2 prefix (with version)
-    latest_user_message = updated_history[2]
-    assert latest_user_message["role"] == "user"
-    assert "[pg:job_code/fetch-data/dhis2@8.0.7]" in latest_user_message["content"], "Expected DHIS2 page prefix with version in latest user message"
-
-    print(f"\n=== CONTEXT SWITCH VERIFICATION ===")
-    print(f"Previous context: Salesforce (from history)")
-    print(f"Current context: DHIS2 (from page prefix)")
-    print(f"DHIS2 mentioned in response: {dhis2_mentioned}")
-    print(f"Latest user message prefix: [pg:job_code/fetch-data/dhis2]")
-
-    print("\n✓ Adaptor context switching test passed: Model recognizes page prefix change and provides DHIS2-specific guidance")
\ No newline at end of file
diff --git a/services/workflow_chat/tests/acceptance/test_basic_input.py b/services/workflow_chat/tests/acceptance/test_basic_input.py
new file mode 100644
index 0000000..ef3017b
--- /dev/null
+++ b/services/workflow_chat/tests/acceptance/test_basic_input.py
@@ -0,0 +1,40 @@
+"""Basic input: empty yaml, simple request. The service should either generate
+a YAML or ask for more information. Structural checks ensure any generated YAML
+is well-formed."""
+
+from testing import judge
+from testing.payloads import build_workflow_chat_payload
+from testing.yaml_assertions import (
+    assert_no_special_chars,
+    assert_yaml_has_ids,
+    assert_yaml_jobs_have_body,
+)
+
+
+QUALITY_CRITERIA = []  # mostly structural; relies on universal rules + general flags
+
+
+def test_basic_input(apollo_client):
+    payload = build_workflow_chat_payload(
+        existing_yaml="",
+        history=[],
+        user_message=(
+            "Whenever fridge statistics are send to you, parse and aggregate "
+            "the data and upload to a collection in redis."
+        ),
+    )
+
+    response = apollo_client.call("workflow_chat", payload)
+
+    # ---- Structural assertions ---------------------------------------------
+    assert response is not None
+    assert isinstance(response, dict)
+
+    if response.get("response_yaml"):
+        assert_yaml_has_ids(response["response_yaml"])
+        assert_yaml_jobs_have_body(response["response_yaml"])
+        assert_no_special_chars(response["response_yaml"])
+
+    # ---- Quality assertions ------------------------------------------------
+    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
+    assert verdict.passed, verdict.summary
diff --git a/services/workflow_chat/tests/acceptance/test_conversational_turn.py b/services/workflow_chat/tests/acceptance/test_conversational_turn.py
new file mode 100644
index 0000000..23b4077
--- /dev/null
+++ b/services/workflow_chat/tests/acceptance/test_conversational_turn.py
@@ -0,0 +1,88 @@
+"""User asks a conversational question that should NOT lead to YAML changes.
+The service should respond with text only, or with a YAML that is identical
+to the existing one."""
+
+import yaml
+
+from testing import judge
+from testing.payloads import build_workflow_chat_payload
+from testing.yaml_assertions import assert_no_special_chars
+
+
+QUALITY_CRITERIA = [
+    "The response engages conversationally with the user's request for clarification, without unnecessarily restructuring or rewriting the workflow.",
+]
+
+
+EXISTING_YAML = """
+name: fridge-statistics-processing
+jobs:
+  parse-and-aggregate-fridge-data:
+    id: job-parse-id
+    name: Parse and Aggregate Fridge Data
+    adaptor: '@openfn/language-common@latest'
+    body: '| // Add data parsing and aggregation operations here'
+  upload-to-redis:
+    id: job-upload-id
+    name: Upload to Redis Collection
+    adaptor: '@openfn/language-redis@latest'
+    body: '| // Add Redis collection upload operations here'
+triggers:
+  webhook:
+    id: trigger-webhook-id
+    type: webhook
+    enabled: false
+edges:
+  webhook->parse-and-aggregate-fridge-data:
+    id: edge-webhook-parse-id
+    source_trigger: webhook
+    target_job: parse-and-aggregate-fridge-data
+    condition_type: always
+    enabled: true
+  parse-and-aggregate-fridge-data->upload-to-redis:
+    id: edge-parse-upload-id
+    source_job: parse-and-aggregate-fridge-data
+    target_job: upload-to-redis
+    condition_type: on_job_success
+    enabled: true
+"""
+
+
+def test_conversational_turn(apollo_client):
+    payload = build_workflow_chat_payload(
+        existing_yaml=EXISTING_YAML,
+        history=[
+            {
+                "role": "user",
+                "content": (
+                    "Whenever fridge statistics are send to you, parse and aggregate "
+                    "the data and upload to a collection in redis."
+                ),
+            },
+            {
+                "role": "assistant",
+                "content": (
+                    "I'll create a workflow that processes fridge statistics through a "
+                    "webhook trigger, then aggregates and stores the data in Redis."
+                ),
+            },
+        ],
+        user_message="Can you explain that better",
+    )
+
+    response = apollo_client.call("workflow_chat", payload)
+
+    # ---- Structural assertions ---------------------------------------------
+    assert response is not None
+    assert isinstance(response, dict)
+
+    response_yaml_str = response.get("response_yaml")
+    if response_yaml_str and str(response_yaml_str).strip():
+        orig_yaml = yaml.safe_load(EXISTING_YAML)
+        response_yaml = yaml.safe_load(response_yaml_str)
+        assert orig_yaml == response_yaml, "If YAML is present in response, it must be unchanged."
+        assert_no_special_chars(response_yaml_str)
+
+    # ---- Quality assertions ------------------------------------------------
+    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
+    assert verdict.passed, verdict.summary
diff --git a/services/workflow_chat/tests/acceptance/test_edit_job_code.py b/services/workflow_chat/tests/acceptance/test_edit_job_code.py
new file mode 100644
index 0000000..9ae938f
--- /dev/null
+++ b/services/workflow_chat/tests/acceptance/test_edit_job_code.py
@@ -0,0 +1,83 @@
+"""User asks workflow_chat to fill in job code, which is not its
+responsibility. The service should politely explain that and avoid generating
+or modifying actual job bodies. Any returned YAML should be unchanged from
+the existing one (or absent)."""
+
+from testing import judge
+from testing.payloads import build_workflow_chat_payload
+from testing.yaml_assertions import assert_no_special_chars
+
+
+QUALITY_CRITERIA = [
+    "The response politely explains that filling in job code is not workflow_chat's responsibility, or otherwise declines the request gracefully.",
+    "The response does not attempt to write actual job code into the workflow.",
+]
+
+
+EXISTING_YAML = """
+name: fridge-statistics-processing
+jobs:
+  parse-and-aggregate-fridge-data:
+    id: job-parse-id
+    name: Parse and Aggregate Fridge Data
+    adaptor: '@openfn/language-common@latest'
+    body: 'print("hello a")'
+  upload-to-redis:
+    id: job-upload-id
+    name: Upload to Redis Collection
+    adaptor: '@openfn/language-redis@latest'
+    body: 'print("hello a")'
+triggers:
+  webhook:
+    id: trigger-webhook-id
+    type: webhook
+    enabled: false
+edges:
+  webhook->parse-and-aggregate-fridge-data:
+    id: edge-webhook-parse-id
+    source_trigger: webhook
+    target_job: parse-and-aggregate-fridge-data
+    condition_type: always
+    enabled: true
+  parse-and-aggregate-fridge-data->upload-to-redis:
+    id: edge-parse-upload-id
+    source_job: parse-and-aggregate-fridge-data
+    target_job: upload-to-redis
+    condition_type: on_job_success
+    enabled: true
+"""
+
+
+def test_edit_job_code(apollo_client):
+    payload = build_workflow_chat_payload(
+        existing_yaml=EXISTING_YAML,
+        history=[
+            {
+                "role": "user",
+                "content": (
+                    "Whenever fridge statistics are send to you, parse and aggregate "
+                    "the data and upload to a collection in redis."
+                ),
+            },
+            {
+                "role": "assistant",
+                "content": (
+                    "I'll create a workflow that processes fridge statistics through a "
+                    "webhook trigger, then aggregates and stores the data in Redis."
+                ),
+            },
+        ],
+        user_message="Can you also fill in the job code for all the steps",
+    )
+
+    response = apollo_client.call("workflow_chat", payload)
+
+    # ---- Structural assertions ---------------------------------------------
+    assert response is not None
+    assert isinstance(response, dict)
+    if response.get("response_yaml"):
+        assert_no_special_chars(response["response_yaml"])
+
+    # ---- Quality assertions ------------------------------------------------
+    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
+    assert verdict.passed, verdict.summary
diff --git a/services/workflow_chat/tests/acceptance/test_error_field.py b/services/workflow_chat/tests/acceptance/test_error_field.py
new file mode 100644
index 0000000..e8fad38
--- /dev/null
+++ b/services/workflow_chat/tests/acceptance/test_error_field.py
@@ -0,0 +1,82 @@
+"""The service is invoked with an `errors` field (replacing `content`) — used
+when the user's previous workflow attempt produced a validation error. The
+service should acknowledge the error and produce a corrected workflow."""
+
+from testing import judge
+from testing.payloads import build_workflow_chat_payload
+from testing.yaml_assertions import assert_no_special_chars
+
+
+QUALITY_CRITERIA = [
+    "The response acknowledges the reported error rather than ignoring it.",
+    "Any returned workflow YAML attempts to fix the cause of the error (in this case, an invalid adaptor).",
+]
+
+
+EXISTING_YAML = """
+name: fridge-statistics-processing
+jobs:
+  parse-and-aggregate-fridge-data:
+    id: job-parse-id
+    name: Parse and Aggregate Fridge Data
+    adaptor: '@openfn/language-commons@latest'
+    body: '| // Add data parsing and aggregation operations here'
+  upload-to-redis:
+    id: job-upload-id
+    name: Upload to Redis Collection
+    adaptor: '@openfn/language-redis@latest'
+    body: '| // Add Redis collection upload operations here'
+triggers:
+  webhook:
+    id: trigger-webhook-id
+    type: webhook
+    enabled: false
+edges:
+  webhook->parse-and-aggregate-fridge-data:
+    id: edge-webhook-parse-id
+    source_trigger: webhook
+    target_job: parse-and-aggregate-fridge-data
+    condition_type: always
+    enabled: true
+  parse-and-aggregate-fridge-data->upload-to-redis:
+    id: edge-parse-upload-id
+    source_job: parse-and-aggregate-fridge-data
+    target_job: upload-to-redis
+    condition_type: on_job_success
+    enabled: true
+"""
+
+
+def test_error_field(apollo_client):
+    payload = build_workflow_chat_payload(
+        existing_yaml=EXISTING_YAML,
+        history=[
+            {
+                "role": "user",
+                "content": (
+                    "Whenever fridge statistics are send to you, parse and aggregate "
+                    "the data and upload to a collection in redis."
+                ),
+            },
+            {
+                "role": "assistant",
+                "content": (
+                    "I'll create a workflow that processes fridge statistics through a "
+                    "webhook trigger, then aggregates and stores the data in Redis."
+                ),
+            },
+        ],
+        errors="adaptor error",
+    )
+
+    response = apollo_client.call("workflow_chat", payload)
+
+    # ---- Structural assertions ---------------------------------------------
+    assert response is not None
+    assert isinstance(response, dict)
+    if response.get("response_yaml"):
+        assert_no_special_chars(response["response_yaml"])
+
+    # ---- Quality assertions ------------------------------------------------
+    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
+    assert verdict.passed, verdict.summary
diff --git a/services/workflow_chat/tests/acceptance/test_input_second_turn.py b/services/workflow_chat/tests/acceptance/test_input_second_turn.py
new file mode 100644
index 0000000..31faf35
--- /dev/null
+++ b/services/workflow_chat/tests/acceptance/test_input_second_turn.py
@@ -0,0 +1,107 @@
+"""Second conversation turn requesting a change to the YAML. The service
+should preserve every job from the original YAML while applying the requested
+addition (data deduplication before validation)."""
+
+from testing import judge
+from testing.payloads import build_workflow_chat_payload
+from testing.yaml_assertions import (
+    assert_no_special_chars,
+    assert_yaml_section_contains_all,
+)
+
+
+QUALITY_CRITERIA = []  # mostly structural
+
+
+EXISTING_YAML = """
+name: CommCare-to-DHIS2-Patient-Integration
+jobs:
+  receive-commcare-data:
+    id: job-receive-id
+    name: Receive CommCare Patient Data
+    adaptor: '@openfn/language-commcare@latest'
+    body: 'PLACEHOLDER 1'
+  validate-patient-data:
+    id: job-validate-id
+    name: Validate Patient Data
+    adaptor: '@openfn/language-common@latest'
+    body: 'PLACEHOLDER 2'
+  log-validation-errors:
+    id: job-log-id
+    name: Log Validation Errors to Google Sheets
+    adaptor: '@openfn/language-googlesheets@latest'
+    body: 'PLACEHOLDER 3'
+  transform-and-upload-to-dhis2:
+    id: job-transform-id
+    name: Transform and Upload to DHIS2
+    adaptor: '@openfn/language-dhis2@latest'
+    body: 'PLACEHOLER 4'
+triggers:
+  webhook:
+    id: trigger-webhook-id
+    type: webhook
+    enabled: false
+edges:
+  webhook->receive-commcare-data:
+    id: edge-webhook-receive-id
+    source_trigger: webhook
+    target_job: receive-commcare-data
+    condition_type: always
+    enabled: true
+  receive-commcare-data->validate-patient-data:
+    id: edge-receive-validate-id
+    source_job: receive-commcare-data
+    target_job: validate-patient-data
+    condition_type: on_job_success
+    enabled: true
+  validate-patient-data->log-validation-errors:
+    id: edge-validate-log-id
+    source_job: validate-patient-data
+    target_job: log-validation-errors
+    condition_type: on_job_failure
+    enabled: true
+  validate-patient-data->transform-and-upload-to-dhis2:
+    id: edge-validate-transform-id
+    source_job: validate-patient-data
+    target_job: transform-and-upload-to-dhis2
+    condition_type: on_job_success
+    enabled: true
+"""
+
+
+def test_input_second_turn(apollo_client):
+    payload = build_workflow_chat_payload(
+        existing_yaml=EXISTING_YAML,
+        history=[
+            {
+                "role": "user",
+                "content": (
+                    "Set up an OpenFn workflow to automatically receive new patient data from "
+                    "CommCare, validate the data and if there's an issue log it to a google "
+                    "sheet, otherwise map it to the DHIS2 data model, and load it into the "
+                    "DHIS2 national health information system"
+                ),
+            },
+            {
+                "role": "assistant",
+                "content": (
+                    "I'll create a workflow to process CommCare patient data. The flow will "
+                    "receive data via webhook, validate it, then either log issues to Google "
+                    "Sheets or transform and send valid data to DHIS2."
+                ),
+            },
+        ],
+        user_message="Actually, let's add data deduplication before validation to prevent duplicate patient records",
+    )
+
+    response = apollo_client.call("workflow_chat", payload)
+
+    # ---- Structural assertions ---------------------------------------------
+    assert response is not None
+    assert isinstance(response, dict)
+    assert_yaml_section_contains_all(EXISTING_YAML, response.get("response_yaml", ""), "jobs")
+    assert_no_special_chars(response["response_yaml"])
+
+    # ---- Quality assertions ------------------------------------------------
+    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
+    assert verdict.passed, verdict.summary
diff --git a/services/workflow_chat/tests/acceptance/test_long_yaml.py b/services/workflow_chat/tests/acceptance/test_long_yaml.py
new file mode 100644
index 0000000..c741afe
--- /dev/null
+++ b/services/workflow_chat/tests/acceptance/test_long_yaml.py
@@ -0,0 +1,173 @@
+"""Long workflow YAML + multi-turn history. The service should preserve every
+existing job and edge while adding the new mailgun bulk-email step the user
+requested. Tests that the response isn't truncated or stripped of structure."""
+
+from testing import judge
+from testing.payloads import build_workflow_chat_payload
+from testing.yaml_assertions import (
+    assert_no_special_chars,
+    assert_yaml_section_contains_all,
+)
+
+
+QUALITY_CRITERIA = [
+    "The new bulk-email step is plausibly integrated into the existing pipeline (positioned after the Asana update, as the user requested).",
+]
+
+
+EXISTING_YAML = """
+name: Data-Integration-and-Reporting
+jobs:
+  Retrieve-Google-Sheets-Data:
+    id: job-retrieve-gsheets
+    name: Retrieve Google Sheets Data
+    adaptor: "@openfn/language-googlesheets@latest"
+    body: // PLACEHOLDER 1
+  Retrieve-NetSuite-Data:
+    id: job-retrieve-netsuite
+    name: Retrieve NetSuite Data
+    adaptor: "@openfn/language-http@latest"
+    body: // PLACEHOLDER 2
+  Retrieve-Ferntech-Data:
+    id: job-retrieve-ferntech
+    name: Retrieve Ferntech Data
+    adaptor: "@openfn/language-http@latest"
+    body: // PLACEHOLDER 3
+  Process-Combined-Data:
+    id: job-process-combined
+    name: Process Combined Data
+    adaptor: "@openfn/language-common@latest"
+    body: // PLACEHOLDER 4
+  Send-Email-Report:
+    id: job-send-email
+    name: Send Email Report
+    adaptor: "@openfn/language-gmail@latest"
+    body: // PLACEHOLDER 5a
+  write-to-sheet:
+    id: job-write-sheet
+    name: write to sheet
+    adaptor: "@openfn/language-googlesheets@3.0.13"
+    body: // PLACEHOLDER 5b
+  Summarise-with-claude:
+    id: job-summarise-claude
+    name: Summarise with claude
+    adaptor: "@openfn/language-claude@1.0.7"
+    body: // PLACEHOLDER 5c
+  Email-summary:
+    id: job-email-summary
+    name: Email summary
+    adaptor: "@openfn/language-gmail@1.3.0"
+    body: // PLACEHOLDER 6
+  Update-asana:
+    id: job-update-asana
+    name: Update asana
+    adaptor: "@openfn/language-asana@4.1.0"
+    body: // PLACEHOLDER 7
+triggers:
+  webhook:
+    id: trigger-webhook
+    type: webhook
+    enabled: false
+edges:
+  webhook->Retrieve-Google-Sheets-Data:
+    id: edge-webhook-gsheets
+    source_trigger: webhook
+    target_job: Retrieve-Google-Sheets-Data
+    condition_type: always
+    enabled: true
+  Retrieve-Google-Sheets-Data->Retrieve-NetSuite-Data:
+    id: edge-gsheets-netsuite
+    source_job: Retrieve-Google-Sheets-Data
+    target_job: Retrieve-NetSuite-Data
+    condition_type: on_job_success
+    enabled: true
+  Retrieve-NetSuite-Data->Retrieve-Ferntech-Data:
+    id: edge-netsuite-ferntech
+    source_job: Retrieve-NetSuite-Data
+    target_job: Retrieve-Ferntech-Data
+    condition_type: on_job_success
+    enabled: true
+  Retrieve-Ferntech-Data->Process-Combined-Data:
+    id: edge-ferntech-combined
+    source_job: Retrieve-Ferntech-Data
+    target_job: Process-Combined-Data
+    condition_type: on_job_success
+    enabled: true
+  Process-Combined-Data->Send-Email-Report:
+    id: edge-combined-email
+    source_job: Process-Combined-Data
+    target_job: Send-Email-Report
+    condition_type: on_job_success
+    enabled: true
+  Process-Combined-Data->write-to-sheet:
+    id: edge-combined-sheet
+    source_job: Process-Combined-Data
+    target_job: write-to-sheet
+    condition_type: on_job_success
+    enabled: true
+  Process-Combined-Data->Summarise-with-claude:
+    id: edge-combined-summarise
+    source_job: Process-Combined-Data
+    target_job: Summarise-with-claude
+    condition_type: on_job_success
+    enabled: true
+  Summarise-with-claude->Email-summary:
+    id: edge-summarise-email
+    source_job: Summarise-with-claude
+    target_job: Email-summary
+    condition_type: on_job_success
+    enabled: true
+  Email-summary->Update-asana:
+    id: edge-email-asana
+    source_job: Email-summary
+    target_job: Update-asana
+    condition_type: on_job_success
+    enabled: true
+"""
+
+
+# History is shortened to user/assistant pairs — the full reproduction lives in
+# the original test_qualitative.py and isn't needed to exercise the bug.
+HISTORY = [
+    {
+        "role": "user",
+        "content": (
+            "I need to create a comprehensive data integration workflow that pulls "
+            "data from Google Sheets, NetSuite, and Ferntech, then processes "
+            "everything together and creates various reports and notifications."
+        ),
+    },
+    {"role": "assistant", "content": "Absolutely! Let's start by setting up the workflow to retrieve data."},
+    {"role": "user", "content": "Once the data is retrieved, I want to process all the combined data together."},
+    {"role": "assistant", "content": "Great, I'll add a processing job."},
+    {"role": "user", "content": "After processing, I want to send an email report and also write the results back to a Google Sheet."},
+    {"role": "assistant", "content": "Understood. Added Send-Email-Report and write-to-sheet."},
+    {"role": "user", "content": "Can we also use Claude AI to summarize the processed data, then email and update Asana?"},
+    {"role": "assistant", "content": "Excellent — added Summarise-with-claude, Email-summary, Update-asana."},
+    {"role": "user", "content": "Can you make sure the workflow is robust to errors in any of the data retrieval steps?"},
+    {"role": "assistant", "content": "Added an error-handler with on_job_failure edges from each retrieval job."},
+]
+
+
+def test_long_yaml(apollo_client):
+    payload = build_workflow_chat_payload(
+        existing_yaml=EXISTING_YAML,
+        history=HISTORY,
+        user_message=(
+            "Perfect! One final addition - after updating Asana, I want to format "
+            "the data for bulk emailing and then send out bulk emails using Mailgun."
+        ),
+    )
+
+    response = apollo_client.call("workflow_chat", payload)
+
+    # ---- Structural assertions ---------------------------------------------
+    assert response is not None
+    assert isinstance(response, dict)
+    assert_yaml_section_contains_all(EXISTING_YAML, response.get("response_yaml", ""), "jobs")
+    assert_yaml_section_contains_all(EXISTING_YAML, response.get("response_yaml", ""), "edges")
+    assert_no_special_chars(response["response_yaml"])
+
+    # ---- Quality assertions ------------------------------------------------
+    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
+    assert verdict.passed, verdict.summary
diff --git a/services/workflow_chat/tests/acceptance/test_simple_lang_bug.py b/services/workflow_chat/tests/acceptance/test_simple_lang_bug.py
new file mode 100644
index 0000000..a1a9aba
--- /dev/null
+++ b/services/workflow_chat/tests/acceptance/test_simple_lang_bug.py
@@ -0,0 +1,45 @@
+"""User asks "are you there?" — the service should respond conversationally
+about itself. It should use simple, user-facing language and not mention
+internal data structures like YAML."""
+
+from testing import judge
+from testing.payloads import build_workflow_chat_payload
+from testing.yaml_assertions import (
+    assert_no_special_chars,
+    assert_yaml_has_ids,
+    assert_yaml_jobs_have_body,
+)
+
+
+QUALITY_CRITERIA = [
+    "The response describes the service's capabilities in plain, user-facing language.",
+    "The response does not expose internal implementation details such as YAML, schemas, or data formats.",
+]
+
+
+def test_simple_lang_bug(apollo_client):
+    payload = build_workflow_chat_payload(
+        existing_yaml="",
+        history=[],
+        user_message="are you there?",
+    )
+
+    response = apollo_client.call("workflow_chat", payload)
+
+    # ---- Structural assertions ---------------------------------------------
+    assert response is not None
+    assert isinstance(response, dict)
+
+    response_text = response.get("response", "")
+    assert "yaml" not in response_text.lower(), (
+        f"Response should not mention 'YAML', but got: {response_text}"
+    )
+
+    if response.get("response_yaml"):
+        assert_yaml_has_ids(response["response_yaml"])
+        assert_yaml_jobs_have_body(response["response_yaml"])
+        assert_no_special_chars(response["response_yaml"])
+
+    # ---- Quality assertions ------------------------------------------------
+    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
+    assert verdict.passed, verdict.summary
diff --git a/services/workflow_chat/tests/acceptance/test_single_trigger_node.py b/services/workflow_chat/tests/acceptance/test_single_trigger_node.py
new file mode 100644
index 0000000..e94c648
--- /dev/null
+++ b/services/workflow_chat/tests/acceptance/test_single_trigger_node.py
@@ -0,0 +1,83 @@
+"""User asks for a change that implies multiple nodes coming directly from
+the trigger. Only one node can come from the trigger in OpenFn — the service
+should respect the constraint by picking one job to run first and branching
+from there, not by adding multiple direct children of the trigger."""
+
+from testing import judge
+from testing.payloads import build_workflow_chat_payload
+from testing.yaml_assertions import assert_no_special_chars
+
+
+QUALITY_CRITERIA = [
+    "The proposed workflow respects the constraint that only one job can come directly from the trigger.",
+    "If the user's request implies multiple parallel steps from the trigger, the response restructures it so one job runs first and the others branch off after.",
+]
+
+
+EXISTING_YAML = """
+name: fridge-statistics-processing
+jobs:
+  parse-and-aggregate-fridge-data:
+    id: job-parse-id
+    name: Parse and Aggregate Fridge Data
+    adaptor: '@openfn/language-common@latest'
+    body: 'print("hello a")'
+  upload-to-redis:
+    id: job-upload-id
+    name: Upload to Redis Collection
+    adaptor: '@openfn/language-redis@latest'
+    body: 'print("hello b")'
+triggers:
+  webhook:
+    id: trigger-webhook-id
+    type: webhook
+    enabled: false
+edges:
+  webhook->parse-and-aggregate-fridge-data:
+    id: edge-webhook-parse-id
+    source_trigger: webhook
+    target_job: parse-and-aggregate-fridge-data
+    condition_type: always
+    enabled: true
+  parse-and-aggregate-fridge-data->upload-to-redis:
+    id: edge-parse-upload-id
+    source_job: parse-and-aggregate-fridge-data
+    target_job: upload-to-redis
+    condition_type: on_job_success
+    enabled: true
+"""
+
+
+def test_single_trigger_node(apollo_client):
+    payload = build_workflow_chat_payload(
+        existing_yaml=EXISTING_YAML,
+        history=[
+            {
+                "role": "user",
+                "content": (
+                    "Whenever fridge statistics are send to you, parse and aggregate "
+                    "the data and upload to a collection in redis."
+                ),
+            },
+            {
+                "role": "assistant",
+                "content": (
+                    "I'll create a workflow that processes fridge statistics through a "
+                    "webhook trigger, then aggregates and stores the data in Redis."
+                ),
+            },
+        ],
+        user_message="Actually I also want an email notification at the same time as the data is being parsed.",
+    )
+
+    response = apollo_client.call("workflow_chat", payload)
+
+    # ---- Structural assertions ---------------------------------------------
+    assert response is not None
+    assert isinstance(response, dict)
+    if response.get("response_yaml"):
+        assert_no_special_chars(response["response_yaml"])
+
+    # ---- Quality assertions ------------------------------------------------
+    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
+    assert verdict.passed, verdict.summary
diff --git a/services/workflow_chat/tests/test_qualitative.py b/services/workflow_chat/tests/test_qualitative.py
deleted file mode 100644
index 285f5e1..0000000
--- a/services/workflow_chat/tests/test_qualitative.py
+++ /dev/null
@@ -1,597 +0,0 @@
-#!/usr/bin/env python3
-
-import pytest
-import json
-import sys
-import tempfile
-import subprocess
-import yaml
-from pathlib import Path
-from testing.yaml_assertions import (
-    assert_no_special_chars,
-    assert_yaml_has_ids,
-    assert_yaml_jobs_have_body,
-    assert_yaml_section_contains_all,
-)
-from .test_utils import (
-    call_workflow_chat_service,
-    make_service_input,
-    print_response_details,
-)
-
-# ---- TESTS ----
-def test_basic_input():
-    print("==================TEST==================")
-    print("Description: Basic input test. Check if the service can handle a simple input "
-          "without a YAML and generate a message and a YAML or ask for more information.")
-    existing_yaml = """"""
-    history = []
-    content = "Whenever fridge statistics are send to you, parse and aggregate the data and upload to a collection in redis."
-    service_input = make_service_input(existing_yaml, history, content=content)
-    response = call_workflow_chat_service(service_input)
-    print_response_details(response, content=content)
-
-    assert response is not None
-    assert isinstance(response, dict)
-    # Check for id fields in generated YAML
-    if response.get("response_yaml"):
-        assert_yaml_has_ids(response["response_yaml"], context="test_basic_input")
-        assert_yaml_jobs_have_body(response["response_yaml"], context="test_basic_input")
-        assert_no_special_chars(response["response_yaml"], context="test_basic_input")
-
-def test_input_second_turn():
-    print("Description: Simple second conversation turn requesting a change to the YAML")
-    
-    existing_yaml = """
-name: CommCare-to-DHIS2-Patient-Integration
-jobs:
-  receive-commcare-data:
-    id: job-receive-id
-    name: Receive CommCare Patient Data
-    adaptor: '@openfn/language-commcare@latest'
-    body: 'PLACEHOLDER 1'
-  validate-patient-data:
-    id: job-validate-id
-    name: Validate Patient Data
-    adaptor: '@openfn/language-common@latest'
-    body: 'PLACEHOLDER 2'
-  log-validation-errors:
-    id: job-log-id
-    name: Log Validation Errors to Google Sheets
-    adaptor: '@openfn/language-googlesheets@latest'
-    body: 'PLACEHOLDER 3'
-  transform-and-upload-to-dhis2:
-    id: job-transform-id
-    name: Transform and Upload to DHIS2
-    adaptor: '@openfn/language-dhis2@latest'
-    body: 'PLACEHOLER 4'
-triggers:
-  webhook:
-    id: trigger-webhook-id
-    type: webhook
-    enabled: false
-edges:
-  webhook->receive-commcare-data:
-    id: edge-webhook-receive-id
-    source_trigger: webhook
-    target_job: receive-commcare-data
-    condition_type: always
-    enabled: true
-  receive-commcare-data->validate-patient-data:
-    id: edge-receive-validate-id
-    source_job: receive-commcare-data
-    target_job: validate-patient-data
-    condition_type: on_job_success
-    enabled: true
-  validate-patient-data->log-validation-errors:
-    id: edge-validate-log-id
-    source_job: validate-patient-data
-    target_job: log-validation-errors
-    condition_type: on_job_failure
-    enabled: true
-  validate-patient-data->transform-and-upload-to-dhis2:
-    id: edge-validate-transform-id
-    source_job: validate-patient-data
-    target_job: transform-and-upload-to-dhis2
-    condition_type: on_job_success
-    enabled: true
-"""
-    
-    history = [
-        {
-            "role": "user", 
-            "content": "Set up an OpenFn workflow to automatically receive new patient data from CommCare, validate the data and if there's an issue log it to a google sheet, otherwise map it to the DHIS2 data model, and load it into the DHIS2 national health information system"
-        },
-        {
-            "role": "assistant", 
-            "content": '{"text":"I\'ll create a workflow to process CommCare patient data. The flow will receive data via webhook, validate it, then either log issues to Google Sheets or transform and send valid data to DHIS2. This creates four distinct jobs with appropriate connections and error handling.","yaml":"name: CommCare-to-DHIS2-Patient-Integration\\njobs:\\n  receive-commcare-data:\\n    name: Receive CommCare Patient Data\\n    adaptor: \\"@openfn/language-commcare@latest\\"\\n    body: \\"// Add operations here\\"\\n  validate-patient-data:\\n    name: Validate Patient Data\\n    adaptor: \\"@openfn/language-common@latest\\"\\n    body: \\"// Add operations here\\"\\n  log-validation-errors:\\n    name: Log Validation Errors to Google Sheets\\n    adaptor: \\"@openfn/language-googlesheets@latest\\"\\n    body: \\"// Add operations here\\"\\n  transform-and-upload-to-dhis2:\\n    name: Transform and Upload to DHIS2\\n    adaptor: \\"@openfn/language-dhis2@latest\\"\\n    body: \\"// Add operations here\\"\\ntriggers:\\n  webhook:\\n    type: webhook\\n    enabled: false\\nedges:\\n  webhook->receive-commcare-data:\\n    source_trigger: webhook\\n    target_job: receive-commcare-data\\n    condition_type: always\\n    enabled: true\\n  receive-commcare-data->validate-patient-data:\\n    source_job: receive-commcare-data\\n    target_job: validate-patient-data\\n    condition_type: on_job_success\\n    enabled: true\\n  validate-patient-data->log-validation-errors:\\n    source_job: validate-patient-data\\n    target_job: log-validation-errors\\n    condition_type: on_job_failure\\n    enabled: true\\n  validate-patient-data->transform-and-upload-to-dhis2:\\n    source_job: validate-patient-data\\n    target_job: transform-and-upload-to-dhis2\\n    condition_type: on_job_success\\n    enabled: true"}'
-        }
-    ]
-    
-    content = "Actually, let's add data deduplication before validation to prevent duplicate patient records"
-    
-    service_input = make_service_input(existing_yaml, history, content=content)
-    response = call_workflow_chat_service(service_input)
-    print_response_details(response, content=content)
-    
-    assert response is not None
-    assert isinstance(response, dict)
-
-    assert_yaml_section_contains_all(existing_yaml, response.get("response_yaml", ""), "jobs", context="Jobs section")
-    assert_no_special_chars(response["response_yaml"], context="test_input_second_turn")
-
-def test_conversational_turn():
-    print("==================TEST==================")
-    print("Description: There is an existing YAML and the user asks a question that should not "
-          "lead to a change in the YAML. Check the service only outputs a message, and no YAML, "
-          "or an unchanged YAML.")
-    existing_yaml = """
-name: fridge-statistics-processing
-jobs:
-  parse-and-aggregate-fridge-data:
-    id: job-parse-id
-    name: Parse and Aggregate Fridge Data
-    adaptor: '@openfn/language-common@latest'
-    body: '| // Add data parsing and aggregation operations here'
-  upload-to-redis:
-    id: job-upload-id
-    name: Upload to Redis Collection
-    adaptor: '@openfn/language-redis@latest'
-    body: '| // Add Redis collection upload operations here'
-triggers:
-  webhook:
-    id: trigger-webhook-id
-    type: webhook
-    enabled: false
-edges:
-  webhook->parse-and-aggregate-fridge-data:
-    id: edge-webhook-parse-id
-    source_trigger: webhook
-    target_job: parse-and-aggregate-fridge-data
-    condition_type: always
-    enabled: true
-  parse-and-aggregate-fridge-data->upload-to-redis:
-    id: edge-parse-upload-id
-    source_job: parse-and-aggregate-fridge-data
-    target_job: upload-to-redis
-    condition_type: on_job_success
-    enabled: true 
-"""
-    history = [
-        {"role": "user", "content": "Whenever fridge statistics are send to you, parse and aggregate the data and upload to a collection in redis."},
-        {"role": "assistant", "content": "I'll create a workflow that processes fridge statistics through a webhook trigger, then aggregates and stores the data in Redis.\n\n```\nname: fridge-statistics-processing\njobs:\n  parse-and-aggregate-fridge-data:\n    name: Parse and Aggregate Fridge Data\n    adaptor: \"@openfn/language-common@latest\"\n    body: \"| // Add data parsing and aggregation operations here\"\n  upload-to-redis:\n    name: Upload to Redis Collection\n    adaptor: \"@openfn/language-redis@latest\"\n    body: \"| // Add Redis collection upload operations here\"\ntriggers:\n  webhook:\n    type: webhook\n    enabled: false\nedges:\n  webhook->parse-and-aggregate-fridge-data:\n    source_trigger: webhook\n    target_job: parse-and-aggregate-fridge-data\n    condition_type: always\n    enabled: true\n  parse-and-aggregate-fridge-data->upload-to-redis:\n    source_job: parse-and-aggregate-fridge-data\n    target_job: upload-to-redis\n    condition_type: on_job_success\n    enabled: true\n```"}
-    ]
-    content = "Can you explain that better"
-    service_input = make_service_input(existing_yaml, history, content=content)
-    response = call_workflow_chat_service(service_input)
-    print_response_details(response, content=content)
-    assert response is not None
-    assert isinstance(response, dict)
-
-    response_yaml_str = response.get("response_yaml", None)
-    if response_yaml_str and str(response_yaml_str).strip():
-        orig_yaml = yaml.safe_load(existing_yaml)
-        response_yaml = yaml.safe_load(response_yaml_str)
-        # Check that the entire YAML is unchanged
-        assert orig_yaml == response_yaml, "If YAML is present in response, it must be unchanged."
-        assert_no_special_chars(response["response_yaml"], context="test_conversational_turn")
-
-def test_simple_lang_bug():
-    print("==================TEST==================")
-    print("Description: Check how the service describes itself. It should use simple language and not mention YAMLs.")
-    existing_yaml = """"""
-    history = []
-    content = "are you there?"
-    service_input = make_service_input(existing_yaml, history, content=content)
-    response = call_workflow_chat_service(service_input)
-    print_response_details(response, content=content)
-    assert response is not None
-    assert isinstance(response, dict)
-    # Assert that the response text does not include the word 'YAML' (case-insensitive)
-    response_text = response.get("response", "")
-
-    assert "yaml" not in response_text.lower(), f"Response text should not mention 'YAML', but got: {response_text}"
-    # Check for id fields in generated YAML
-    if response.get("response_yaml"):
-        assert_yaml_has_ids(response["response_yaml"], context="test_simple_lang_bug")
-        assert_yaml_jobs_have_body(response["response_yaml"], context="test_simple_lang_bug")
-        assert_no_special_chars(response["response_yaml"], context="test_simple_lang_bug")
-
-def test_single_trigger_node():
-    print("==================TEST==================")
-    print("Description: The user asks for a change that implies they want multiple nodes from the trigger. "
-          "As only one node can come from the trigger, the service should select one job to be run first, "
-          "and that one can have multiple nodes for the other jobs.")
-    existing_yaml = """
-name: fridge-statistics-processing
-jobs:
-  parse-and-aggregate-fridge-data:
-    id: job-parse-id
-    name: Parse and Aggregate Fridge Data
-    adaptor: '@openfn/language-common@latest'
-    body: 'print("hello a")'
-  upload-to-redis:
-    id: job-upload-id
-    name: Upload to Redis Collection
-    adaptor: '@openfn/language-redis@latest'
-    body: 'print("hello b")'
-triggers:
-  webhook:
-    id: trigger-webhook-id
-    type: webhook
-    enabled: false
-edges:
-  webhook->parse-and-aggregate-fridge-data:
-    id: edge-webhook-parse-id
-    source_trigger: webhook
-    target_job: parse-and-aggregate-fridge-data
-    condition_type: always
-    enabled: true
-  parse-and-aggregate-fridge-data->upload-to-redis:
-    id: edge-parse-upload-id
-    source_job: parse-and-aggregate-fridge-data
-    target_job: upload-to-redis
-    condition_type: on_job_success
-    enabled: true 
-"""
-    history = [
-        {"role": "user", "content": "Whenever fridge statistics are send to you, parse and aggregate the data and upload to a collection in redis."},
-        {"role": "assistant", "content": "I'll create a workflow that processes fridge statistics through a webhook trigger, then aggregates and stores the data in Redis.\n\n```\nname: fridge-statistics-processing\njobs:\n  parse-and-aggregate-fridge-data:\n    name: Parse and Aggregate Fridge Data\n    adaptor: \"@openfn/language-common@latest\"\n    body: \"| // Add data parsing and aggregation operations here\"\n  upload-to-redis:\n    name: Upload to Redis Collection\n    adaptor: \"@openfn/language-redis@latest\"\n    body: \"| // Add Redis collection upload operations here\"\ntriggers:\n  webhook:\n    type: webhook\n    enabled: false\nedges:\n  webhook->parse-and-aggregate-fridge-data:\n    source_trigger: webhook\n    target_job: parse-and-aggregate-fridge-data\n    condition_type: always\n    enabled: true\n  parse-and-aggregate-fridge-data->upload-to-redis:\n    source_job: parse-and-aggregate-fridge-data\n    target_job: upload-to-redis\n    condition_type: on_job_success\n    enabled: true\n```"}
-    ]
-    content = "Actually I also want an email notification at the same time as the data is being parsed."
-    service_input = make_service_input(existing_yaml, history, content=content)
-    response = call_workflow_chat_service(service_input)
-    if response.get("response_yaml"):
-      print_response_details(response, content=content)
-      assert_no_special_chars(response["response_yaml"], context="test_single_trigger_node")
-
-    assert response is not None
-    assert isinstance(response, dict)
-
-def test_edit_job_code():
-    print("==================TEST==================")
-    print("Description: The user asks for job code to be filled in. The service should explain why it can't. "
-          "A new YAML should not be generated or it should be identical to the existing one.")
-    existing_yaml = """
-name: fridge-statistics-processing
-jobs:
-  parse-and-aggregate-fridge-data:
-    id: job-parse-id
-    name: Parse and Aggregate Fridge Data
-    adaptor: '@openfn/language-common@latest'
-    body: 'print("hello a")'
-  upload-to-redis:
-    id: job-upload-id
-    name: Upload to Redis Collection
-    adaptor: '@openfn/language-redis@latest'
-    body: 'print("hello a")'
-triggers:
-  webhook:
-    id: trigger-webhook-id
-    type: webhook
-    enabled: false
-edges:
-  webhook->parse-and-aggregate-fridge-data:
-    id: edge-webhook-parse-id
-    source_trigger: webhook
-    target_job: parse-and-aggregate-fridge-data
-    condition_type: always
-    enabled: true
-  parse-and-aggregate-fridge-data->upload-to-redis:
-    id: edge-parse-upload-id
-    source_job: parse-and-aggregate-fridge-data
-    target_job: upload-to-redis
-    condition_type: on_job_success
-    enabled: true 
-"""
-    history = [
-        {"role": "user", "content": "Whenever fridge statistics are send to you, parse and aggregate the data and upload to a collection in redis."},
-        {"role": "assistant", "content": "I'll create a workflow that processes fridge statistics through a webhook trigger, then aggregates and stores the data in Redis.\n\n```\nname: fridge-statistics-processing\njobs:\n  parse-and-aggregate-fridge-data:\n    name: Parse and Aggregate Fridge Data\n    adaptor: \"@openfn/language-common@latest\"\n    body: \"| // Add data parsing and aggregation operations here\"\n  upload-to-redis:\n    name: Upload to Redis Collection\n    adaptor: \"@openfn/language-redis@latest\"\n    body: \"| // Add Redis collection upload operations here\"\ntriggers:\n  webhook:\n    type: webhook\n    enabled: false\nedges:\n  webhook->parse-and-aggregate-fridge-data:\n    source_trigger: webhook\n    target_job: parse-and-aggregate-fridge-data\n    condition_type: always\n    enabled: true\n  parse-and-aggregate-fridge-data->upload-to-redis:\n    source_job: parse-and-aggregate-fridge-data\n    target_job: upload-to-redis\n    condition_type: on_job_success\n    enabled: true\n```"}
-    ]
-    content = "Can you also fill in the job code for all the steps"
-    service_input = make_service_input(existing_yaml, history, content=content)
-    response = call_workflow_chat_service(service_input)
-    if response.get("response_yaml"):
-      print_response_details(response, content=content)
-      assert_no_special_chars(response["response_yaml"], context="test_edit_job_code")
-
-    assert response is not None
-    assert isinstance(response, dict)
-
-def test_error_field():
-    print("==================TEST==================")
-    print("Description: This tests that the service can handle an error field input (that replaces the content field). "
-          "Check that the service comments on the error and produces a new YAML.")
-    existing_yaml = """
-name: fridge-statistics-processing
-jobs:
-  parse-and-aggregate-fridge-data:
-    id: job-parse-id
-    name: Parse and Aggregate Fridge Data
-    adaptor: '@openfn/language-commons@latest'
-    body: '| // Add data parsing and aggregation operations here'
-  upload-to-redis:
-    id: job-upload-id
-    name: Upload to Redis Collection
-    adaptor: '@openfn/language-redis@latest'
-    body: '| // Add Redis collection upload operations here'
-triggers:
-  webhook:
-    id: trigger-webhook-id
-    type: webhook
-    enabled: false
-edges:
-  webhook->parse-and-aggregate-fridge-data:
-    id: edge-webhook-parse-id
-    source_trigger: webhook
-    target_job: parse-and-aggregate-fridge-data
-    condition_type: always
-    enabled: true
-  parse-and-aggregate-fridge-data->upload-to-redis:
-    id: edge-parse-upload-id
-    source_job: parse-and-aggregate-fridge-data
-    target_job: upload-to-redis
-    condition_type: on_job_success
-    enabled: true 
-"""
-    history = [
-        {"role": "user", "content": "Whenever fridge statistics are send to you, parse and aggregate the data and upload to a collection in redis."},
-        {"role": "assistant", "content": "I'll create a workflow that processes fridge statistics through a webhook trigger, then aggregates and stores the data in Redis.\n\n```\nname: fridge-statistics-processing\njobs:\n  parse-and-aggregate-fridge-data:\n    name: Parse and Aggregate Fridge Data\n    adaptor: \"@openfn/language-common@latest\"\n    body: \"| // Add data parsing and aggregation operations here\"\n  upload-to-redis:\n    name: Upload to Redis Collection\n    adaptor: \"@openfn/language-redis@latest\"\n    body: \"| // Add Redis collection upload operations here\"\ntriggers:\n  webhook:\n    type: webhook\n    enabled: false\nedges:\n  webhook->parse-and-aggregate-fridge-data:\n    source_trigger: webhook\n    target_job: parse-and-aggregate-fridge-data\n    condition_type: always\n    enabled: true\n  parse-and-aggregate-fridge-data->upload-to-redis:\n    source_job: parse-and-aggregate-fridge-data\n    target_job: upload-to-redis\n    condition_type: on_job_success\n    enabled: true\n```"}
-    ]
-    errors = "adaptor error"
-    # This test uses errors, so no content field
-
-    service_input = make_service_input(existing_yaml, history, errors=errors)
-    response = call_workflow_chat_service(service_input)
-    if response.get("response_yaml"):
-      print_response_details(response, errors=errors)
-      assert_no_special_chars(response["response_yaml"], context="test_error_field")
-
-    assert response is not None
-    assert isinstance(response, dict)
-
-def test_long_yaml():
-    print("==================TEST==================")
-    print("Description: Test that the service can handle a slighly longer YAML & conversation history. "
-          "Check that the answer isn't cut off or empty, and that all the job code is retained.")
-    existing_yaml = """
-name: Data-Integration-and-Reporting
-jobs:
-  Retrieve-Google-Sheets-Data:
-    id: job-retrieve-gsheets
-    name: Retrieve Google Sheets Data
-    adaptor: "@openfn/language-googlesheets@latest"
-    body: // PLACEHOLDER 1
-  Retrieve-NetSuite-Data:
-    id: job-retrieve-netsuite
-    name: Retrieve NetSuite Data
-    adaptor: "@openfn/language-http@latest"
-    body: // PLACEHOLDER 2
-  Retrieve-Ferntech-Data:
-    id: job-retrieve-ferntech
-    name: Retrieve Ferntech Data
-    adaptor: "@openfn/language-http@latest"
-    body: // PLACEHOLDER 3
-  Process-Combined-Data:
-    id: job-process-combined
-    name: Process Combined Data
-    adaptor: "@openfn/language-common@latest"
-    body: // PLACEHOLDER 4
-  Send-Email-Report:
-    id: job-send-email
-    name: Send Email Report
-    adaptor: "@openfn/language-gmail@latest"
-    body: // PLACEHOLDER 5a
-  write-to-sheet:
-    id: job-write-sheet
-    name: write to sheet
-    adaptor: "@openfn/language-googlesheets@3.0.13"
-    body: // PLACEHOLDER 5b
-  Summarise-with-claude:
-    id: job-summarise-claude
-    name: Summarise with claude
-    adaptor: "@openfn/language-claude@1.0.7"
-    body: // PLACEHOLDER 5c
-  Email-summary:
-    id: job-email-summary
-    name: Email summary
-    adaptor: "@openfn/language-gmail@1.3.0"
-    body: // PLACEHOLDER 6
-  Update-asana:
-    id: job-update-asana
-    name: Update asana
-    adaptor: "@openfn/language-asana@4.1.0"
-    body: // PLACEHOLDER 7
-triggers:
-  webhook:
-    id: trigger-webhook
-    type: webhook
-    enabled: false
-edges:
-  webhook->Retrieve-Google-Sheets-Data:
-    id: edge-webhook-gsheets
-    source_trigger: webhook
-    target_job: Retrieve-Google-Sheets-Data
-    condition_type: always
-    enabled: true
-  Retrieve-Google-Sheets-Data->Retrieve-NetSuite-Data:
-    id: edge-gsheets-netsuite
-    source_job: Retrieve-Google-Sheets-Data
-    target_job: Retrieve-NetSuite-Data
-    condition_type: on_job_success
-    enabled: true
-  Retrieve-NetSuite-Data->Retrieve-Ferntech-Data:
-    id: edge-netsuite-ferntech
-    source_job: Retrieve-NetSuite-Data
-    target_job: Retrieve-Ferntech-Data
-    condition_type: on_job_success
-    enabled: true
-  Retrieve-Ferntech-Data->Process-Combined-Data:
-    id: edge-ferntech-combined
-    source_job: Retrieve-Ferntech-Data
-    target_job: Process-Combined-Data
-    condition_type: on_job_success
-    enabled: true
-  Process-Combined-Data->Send-Email-Report:
-    id: edge-combined-email
-    source_job: Process-Combined-Data
-    target_job: Send-Email-Report
-    condition_type: on_job_success
-    enabled: true
-  Process-Combined-Data->write-to-sheet:
-    id: edge-combined-sheet
-    source_job: Process-Combined-Data
-    target_job: write-to-sheet
-    condition_type: on_job_success
-    enabled: true
-  Process-Combined-Data->Summarise-with-claude:
-    id: edge-combined-summarise
-    source_job: Process-Combined-Data
-    target_job: Summarise-with-claude
-    condition_type: on_job_success
-    enabled: true
-  Summarise-with-claude->Email-summary:
-    id: edge-summarise-email
-    source_job: Summarise-with-claude
-    target_job: Email-summary
-    condition_type: on_job_success
-    enabled: true
-  Email-summary->Update-asana:
-    id: edge-email-asana
-    source_job: Email-summary
-    target_job: Update-asana
-    condition_type: on_job_success
-    enabled: true
-"""
-    history = [
-        {"role": "user", "content": "I need to create a comprehensive data integration workflow that pulls data from Google Sheets, NetSuite, and Ferntech, then processes everything together and creates various reports and notifications."},
-        {"role": "assistant", "content": "Absolutely! Let's start by setting up the workflow to retrieve data from Google Sheets, NetSuite, and Ferntech. I'll create jobs for each data source and connect them in sequence.\n\n```yaml\nname: Data-Integration-Workflow\njobs:\n  Retrieve-Google-Sheets-Data:\n    name: Retrieve Google Sheets Data\n    adaptor: '@openfn/language-googlesheets@latest'\n    body: // PLACEHOLDER 1\n  Retrieve-NetSuite-Data:\n    name: Retrieve NetSuite Data\n    adaptor: '@openfn/language-http@latest'\n    body: // PLACEHOLDER 2\n  Retrieve-Ferntech-Data:\n    name: Retrieve Ferntech Data\n    adaptor: '@openfn/language-http@latest'\n    body: // PLACEHOLDER 3\ntriggers:\n  webhook:\n    type: webhook\n    enabled: false\nedges:\n  webhook->Retrieve-Google-Sheets-Data:\n    source_trigger: webhook\n    target_job: Retrieve-Google-Sheets-Data\n    condition_type: always\n    enabled: true\n  Retrieve-Google-Sheets-Data->Retrieve-NetSuite-Data:\n    source_job: Retrieve-Google-Sheets-Data\n    target_job: Retrieve-NetSuite-Data\n    condition_type: on_job_success\n    enabled: true\n  Retrieve-NetSuite-Data->Retrieve-Ferntech-Data:\n    source_job: Retrieve-NetSuite-Data\n    target_job: Retrieve-Ferntech-Data\n    condition_type: on_job_success\n    enabled: true\n```"},
-        {"role": "user", "content": "Once the data is retrieved, I want to process all the combined data together before generating any reports."},
-        {"role": "assistant", "content": "Great, I'll add a processing job that takes the output from the three data retrieval jobs and combines the data for further use.\n\n```yaml\nname: Data-Integration-Workflow\njobs:\n  Retrieve-Google-Sheets-Data:\n    name: Retrieve Google Sheets Data\n    adaptor: '@openfn/language-googlesheets@latest'\n    body: // PLACEHOLDER 1\n  Retrieve-NetSuite-Data:\n    name: Retrieve NetSuite Data\n    adaptor: '@openfn/language-http@latest'\n    body: // PLACEHOLDER 2\n  Retrieve-Ferntech-Data:\n    name: Retrieve Ferntech Data\n    adaptor: '@openfn/language-http@latest'\n    body: // PLACEHOLDER 3\n  Process-Combined-Data:\n    name: Process Combined Data\n    adaptor: '@openfn/language-common@latest'\n    body: // PLACEHOLDER 4\ntriggers:\n  webhook:\n    type: webhook\n    enabled: false\nedges:\n  webhook->Retrieve-Google-Sheets-Data:\n    source_trigger: webhook\n    target_job: Retrieve-Google-Sheets-Data\n    condition_type: always\n    enabled: true\n  Retrieve-Google-Sheets-Data->Retrieve-NetSuite-Data:\n    source_job: Retrieve-Google-Sheets-Data\n    target_job: Retrieve-NetSuite-Data\n    condition_type: on_job_success\n    enabled: true\n  Retrieve-NetSuite-Data->Retrieve-Ferntech-Data:\n    source_job: Retrieve-NetSuite-Data\n    target_job: Retrieve-Ferntech-Data\n    condition_type: on_job_success\n    enabled: true\n  Retrieve-Ferntech-Data->Process-Combined-Data:\n    source_job: Retrieve-Ferntech-Data\n    target_job: Process-Combined-Data\n    condition_type: on_job_success\n    enabled: true\n```"},
-        {"role": "user", "content": "After processing, I want to send an email report and also write the results back to a Google Sheet."},
-        {"role": "assistant", "content": "Understood. I'll add two jobs: one to send an email report and another to write the processed data to a Google Sheet. Both will run after the processing step.\n\n```yaml\nname: Data-Integration-and-Reporting\njobs:\n  Retrieve-Google-Sheets-Data:\n    name: Retrieve Google Sheets Data\n    adaptor: '@openfn/language-googlesheets@latest'\n    body: // PLACEHOLDER 1\n  Retrieve-NetSuite-Data:\n    name: Retrieve NetSuite Data\n    adaptor: '@openfn/language-http@latest'\n    body: // PLACEHOLDER 2\n  Retrieve-Ferntech-Data:\n    name: Retrieve Ferntech Data\n    adaptor: '@openfn/language-http@latest'\n    body: // PLACEHOLDER 3\n  Process-Combined-Data:\n    name: Process Combined Data\n    adaptor: '@openfn/language-common@latest'\n    body: // PLACEHOLDER 4\n  Send-Email-Report:\n    name: Send Email Report\n    adaptor: '@openfn/language-gmail@latest'\n    body: // PLACEHOLDER 5a\n  write-to-sheet:\n    name: write to sheet\n    adaptor: '@openfn/language-googlesheets@3.0.13'\n    body: // PLACEHOLDER 5b\ntriggers:\n  webhook:\n    type: webhook\n    enabled: false\nedges:\n  webhook->Retrieve-Google-Sheets-Data:\n    source_trigger: webhook\n    target_job: Retrieve-Google-Sheets-Data\n    condition_type: always\n    enabled: true\n  Retrieve-Google-Sheets-Data->Retrieve-NetSuite-Data:\n    source_job: Retrieve-Google-Sheets-Data\n    target_job: Retrieve-NetSuite-Data\n    condition_type: on_job_success\n    enabled: true\n  Retrieve-NetSuite-Data->Retrieve-Ferntech-Data:\n    source_job: Retrieve-NetSuite-Data\n    target_job: Retrieve-Ferntech-Data\n    condition_type: on_job_success\n    enabled: true\n  Retrieve-Ferntech-Data->Process-Combined-Data:\n    source_job: Retrieve-Ferntech-Data\n    target_job: Process-Combined-Data\n    condition_type: on_job_success\n    enabled: true\n  Process-Combined-Data->Send-Email-Report:\n    source_job: Process-Combined-Data\n    target_job: Send-Email-Report\n    condition_type: on_job_success\n    enabled: true\n  Process-Combined-Data->write-to-sheet:\n    source_job: Process-Combined-Data\n    target_job: write-to-sheet\n    condition_type: on_job_success\n    enabled: true\n```"},
-        {"role": "user", "content": "Can we also use Claude AI to summarize the processed data, then email that summary, and update our Asana project with the results?"},
-        {"role": "assistant", "content": "Excellent idea! I'll add a summarization job using Claude AI, followed by jobs to email the summary and update Asana. These will branch off after the processing step as well.\n\n```yaml\nname: Data-Integration-and-Reporting\njobs:\n  Retrieve-Google-Sheets-Data:\n    name: Retrieve Google Sheets Data\n    adaptor: '@openfn/language-googlesheets@latest'\n    body: // PLACEHOLDER 1\n  Retrieve-NetSuite-Data:\n    name: Retrieve NetSuite Data\n    adaptor: '@openfn/language-http@latest'\n    body: // PLACEHOLDER 2\n  Retrieve-Ferntech-Data:\n    name: Retrieve Ferntech Data\n    adaptor: '@openfn/language-http@latest'\n    body: // PLACEHOLDER 3\n  Process-Combined-Data:\n    name: Process Combined Data\n    adaptor: '@openfn/language-common@latest'\n    body: // PLACEHOLDER 4\n  Send-Email-Report:\n    name: Send Email Report\n    adaptor: '@openfn/language-gmail@latest'\n    body: // PLACEHOLDER 5a\n  write-to-sheet:\n    name: write to sheet\n    adaptor: '@openfn/language-googlesheets@3.0.13'\n    body: // PLACEHOLDER 5b\n  Summarise-with-claude:\n    name: Summarise with claude\n    adaptor: '@openfn/language-claude@1.0.7'\n    body: // PLACEHOLDER 5c\n  Email-summary:\n    name: Email summary\n    adaptor: '@openfn/language-gmail@1.3.0'\n    body: // PLACEHOLDER 6\n  Update-asana:\n    name: Update asana\n    adaptor: '@openfn/language-asana@4.1.0'\n    body: // PLACEHOLDER 7\ntriggers:\n  webhook:\n    type: webhook\n    enabled: false\nedges:\n  webhook->Retrieve-Google-Sheets-Data:\n    source_trigger: webhook\n    target_job: Retrieve-Google-Sheets-Data\n    condition_type: always\n    enabled: true\n  Retrieve-Google-Sheets-Data->Retrieve-NetSuite-Data:\n    source_job: Retrieve-Google-Sheets-Data\n    target_job: Retrieve-NetSuite-Data\n    condition_type: on_job_success\n    enabled: true\n  Retrieve-NetSuite-Data->Retrieve-Ferntech-Data:\n    source_job: Retrieve-NetSuite-Data\n    target_job: Retrieve-Ferntech-Data\n    condition_type: on_job_success\n    enabled: true\n  Retrieve-Ferntech-Data->Process-Combined-Data:\n    source_job: Retrieve-Ferntech-Data\n    target_job: Process-Combined-Data\n    condition_type: on_job_success\n    enabled: true\n  Process-Combined-Data->Send-Email-Report:\n    source_job: Process-Combined-Data\n    target_job: Send-Email-Report\n    condition_type: on_job_success\n    enabled: true\n  Process-Combined-Data->write-to-sheet:\n    source_job: Process-Combined-Data\n    target_job: write-to-sheet\n    condition_type: on_job_success\n    enabled: true\n  Process-Combined-Data->Summarise-with-claude:\n    source_job: Process-Combined-Data\n    target_job: Summarise-with-claude\n    condition_type: on_job_success\n    enabled: true\n  Summarise-with-claude->Email-summary:\n    source_job: Summarise-with-claude\n    target_job: Email-summary\n    condition_type: on_job_success\n    enabled: true\n  Email-summary->Update-asana:\n    source_job: Email-summary\n    target_job: Update-asana\n    condition_type: on_job_success\n    enabled: true\n```"},
-        {"role": "user", "content": "Can you make sure the workflow is robust to errors in any of the data retrieval steps?"},
-        {"role": "assistant", "content": "Good point! I'll add error handling so that if any data retrieval job fails, the workflow will log the error and stop further processing.\n\n```yaml\nname: Data-Integration-and-Reporting\njobs:\n  Retrieve-Google-Sheets-Data:\n    name: Retrieve Google Sheets Data\n    adaptor: '@openfn/language-googlesheets@latest'\n    body: // PLACEHOLDER 1\n  Retrieve-NetSuite-Data:\n    name: Retrieve NetSuite Data\n    adaptor: '@openfn/language-http@latest'\n    body: // PLACEHOLDER 2\n  Retrieve-Ferntech-Data:\n    name: Retrieve Ferntech Data\n    adaptor: '@openfn/language-http@latest'\n    body: // PLACEHOLDER 3\n  Process-Combined-Data:\n    name: Process Combined Data\n    adaptor: '@openfn/language-common@latest'\n    body: // PLACEHOLDER 4\n  Send-Email-Report:\n    name: Send Email Report\n    adaptor: '@openfn/language-gmail@latest'\n    body: // PLACEHOLDER 5a\n  write-to-sheet:\n    name: write to sheet\n    adaptor: '@openfn/language-googlesheets@3.0.13'\n    body: // PLACEHOLDER 5b\n  Summarise-with-claude:\n    name: Summarise with claude\n    adaptor: '@openfn/language-claude@1.0.7'\n    body: // PLACEHOLDER 5c\n  Email-summary:\n    name: Email summary\n    adaptor: '@openfn/language-gmail@1.3.0'\n    body: // PLACEHOLDER 6\n  Update-asana:\n    name: Update asana\n    adaptor: '@openfn/language-asana@4.1.0'\n    body: // PLACEHOLDER 7\n  error-handler:\n    name: Error Handler\n    adaptor: '@openfn/language-common@latest'\n    body: // Log error and stop workflow\ntriggers:\n  webhook:\n    type: webhook\n    enabled: false\nedges:\n  webhook->Retrieve-Google-Sheets-Data:\n    source_trigger: webhook\n    target_job: Retrieve-Google-Sheets-Data\n    condition_type: always\n    enabled: true\n  Retrieve-Google-Sheets-Data->Retrieve-NetSuite-Data:\n    source_job: Retrieve-Google-Sheets-Data\n    target_job: Retrieve-NetSuite-Data\n    condition_type: on_job_success\n    enabled: true\n  Retrieve-NetSuite-Data->Retrieve-Ferntech-Data:\n    source_job: Retrieve-NetSuite-Data\n    target_job: Retrieve-Ferntech-Data\n    condition_type: on_job_success\n    enabled: true\n  Retrieve-Ferntech-Data->Process-Combined-Data:\n    source_job: Retrieve-Ferntech-Data\n    target_job: Process-Combined-Data\n    condition_type: on_job_success\n    enabled: true\n  Process-Combined-Data->Send-Email-Report:\n    source_job: Process-Combined-Data\n    target_job: Send-Email-Report\n    condition_type: on_job_success\n    enabled: true\n  Process-Combined-Data->write-to-sheet:\n    source_job: Process-Combined-Data\n    target_job: write-to-sheet\n    condition_type: on_job_success\n    enabled: true\n  Process-Combined-Data->Summarise-with-claude:\n    source_job: Process-Combined-Data\n    target_job: Summarise-with-claude\n    condition_type: on_job_success\n    enabled: true\n  Summarise-with-claude->Email-summary:\n    source_job: Summarise-with-claude\n    target_job: Email-summary\n    condition_type: on_job_success\n    enabled: true\n  Email-summary->Update-asana:\n    source_job: Email-summary\n    target_job: Update-asana\n    condition_type: on_job_success\n    enabled: true\n  Retrieve-Google-Sheets-Data->error-handler:\n    source_job: Retrieve-Google-Sheets-Data\n    target_job: error-handler\n    condition_type: on_job_failure\n    enabled: true\n  Retrieve-NetSuite-Data->error-handler:\n    source_job: Retrieve-NetSuite-Data\n    target_job: error-handler\n    condition_type: on_job_failure\n    enabled: true\n  Retrieve-Ferntech-Data->error-handler:\n    source_job: Retrieve-Ferntech-Data\n    target_job: error-handler\n    condition_type: on_job_failure\n    enabled: true\n```"}
-    ]
-    content = "Perfect! One final addition - after updating Asana, I want to format the data for bulk emailing and then send out bulk emails using Mailgun."
-    service_input = make_service_input(existing_yaml, history, content=content)
-    response = call_workflow_chat_service(service_input)
-    print_response_details(response, content=content)
-    assert response is not None
-    assert isinstance(response, dict)
-
-    assert_yaml_section_contains_all(existing_yaml, response.get("response_yaml", ""), "jobs", context="Jobs section")
-    assert_yaml_section_contains_all(existing_yaml, response.get("response_yaml", ""), "edges", context="Edges section")
-    assert_no_special_chars(response["response_yaml"], context="test_long_yaml")
-
-def test_navigation_job_to_workflow():
-    print("==================TEST==================")
-    print("Description: Testing cross-service navigation from job editor to workflow editor - model should infer context change")
-
-    # History shows user was on job editor discussing job code
-    history = [
-        {"role": "user", "content": "[pg:job_code/transform-data/http] Can you add error handling to this HTTP request?"},
-        {"role": "assistant", "content": "I'll add try-catch error handling to catch any request failures in your HTTP job."},
-        {"role": "user", "content": "[pg:job_code/transform-data/http] Also add retry logic with backoff"},
-        {"role": "assistant", "content": "I'll add exponential backoff retry logic to handle transient failures."}
-    ]
-
-    # Now user is on workflow editor - abrupt question about adding a job
-    content = "Add a step to send the results via email"
-
-    # Current context is workflow, not job code
-    existing_yaml = """name: data-pipeline
-jobs:
-  fetch-source-data:
-    id: job-fetch-id
-    name: Fetch Source Data
-    adaptor: '@openfn/language-http@6.5.4'
-    body: 'get("https://source.api/data");'
-  transform-data:
-    id: job-transform-id
-    name: Transform Data
-    adaptor: '@openfn/language-common@latest'
-    body: 'fn(state => { return { ...state, transformed: true }; });'
-  save-to-database:
-    id: job-save-id
-    name: Save to Database
-    adaptor: '@openfn/language-http@6.5.4'
-    body: 'post("https://db.api/save", state => state.data);'
-triggers:
-  webhook:
-    id: trigger-webhook-id
-    type: webhook
-    enabled: false
-edges:
-  webhook->fetch-source-data:
-    id: edge-webhook-fetch-id
-    source_trigger: webhook
-    target_job: fetch-source-data
-    condition_type: always
-    enabled: true
-  fetch-source-data->transform-data:
-    id: edge-fetch-transform-id
-    source_job: fetch-source-data
-    target_job: transform-data
-    condition_type: on_job_success
-    enabled: true
-  transform-data->save-to-database:
-    id: edge-transform-save-id
-    source_job: transform-data
-    target_job: save-to-database
-    condition_type: on_job_success
-    enabled: true
-"""
-
-    context = {
-        "page_name": "data-pipeline"
-    }
-
-    # Meta shows navigation happened
-    meta = {
-        "last_page": {
-            "type": "job_code",
-            "name": "transform-data",
-            "adaptor": "http"
-        }
-    }
-
-    service_input = make_service_input(existing_yaml, history, content=content, context=context, meta=meta)
-    response = call_workflow_chat_service(service_input)
-    print_response_details(response, content=content)
-
-    # Assertions to verify model correctly inferred navigation and responded about workflow
-    assert response is not None
-    assert "response_yaml" in response
-    assert response["response_yaml"] is not None, "Model should have generated YAML for the workflow"
-
-    # Verify email job was added
-    yaml_obj = yaml.safe_load(response["response_yaml"])
-    assert "jobs" in yaml_obj
-
-    job_names = [job.get("name", "").lower() for job in yaml_obj["jobs"].values()]
-    job_adaptors = [job.get("adaptor", "").lower() for job in yaml_obj["jobs"].values()]
-
-    # Check that an email-related job was added
-    assert any("email" in name or "mail" in name or "send" in name for name in job_names), \
-        "Email job not found in workflow"
-
-    # Check that a new job was added (could be gmail, mailgun, or http for email API)
-    # This is flexible since the model may choose different adaptors for sending email
-    orig_yaml_obj = yaml.safe_load(existing_yaml)
-    orig_job_count = len(orig_yaml_obj["jobs"])
-    new_job_count = len(yaml_obj["jobs"])
-    assert new_job_count > orig_job_count, f"Expected new job to be added. Original: {orig_job_count}, New: {new_job_count}"
-
-    # Verify response talks about workflow, not job code
-    response_text = response["response"].lower()
-    assert not any(phrase in response_text for phrase in ["try", "catch", "retry", "backoff", "error handling in the code"]), \
-        "Response should be about workflow structure, not job code error handling"
-
-    print("\n✓ Navigation test passed: Model correctly inferred navigation from job editor to workflow editor")
-
-
-if __name__ == "__main__":
-    pytest.main([__file__, "-v"])
\ No newline at end of file

From 6b5ac262bc83f5df75a994d8253c0e145c82e21c Mon Sep 17 00:00:00 2001
From: "Hanna Paasivirta (OpenFn)" <hanna@openfn.org>
Date: Thu, 14 May 2026 02:21:36 +0900
Subject: [PATCH 04/15] rewrite to md without asserts

---
 .../example-acceptance-md-spec.md             | 300 ++++++++++++++++++
 .../pr-summary-acceptance-tests.md            |  32 ++
 conftest.py                                   |   5 +-
 ...o_dhis2_tracker_with_specific_functions.md |  28 ++
 ...o_dhis2_tracker_with_specific_functions.py |  44 ---
 .../test_commcare_to_dhis2_with_job_code.md   |  24 ++
 .../test_commcare_to_dhis2_with_job_code.py   |  58 ----
 ..._gsheets_transform_salesforce_with_cron.md |  23 ++
 ..._gsheets_transform_salesforce_with_cron.py |  35 --
 ...to_salesforce_three_steps_with_job_code.md |  22 ++
 ...to_salesforce_three_steps_with_job_code.py |  52 ---
 .../test_vague_gmail_to_database.md           |  23 ++
 .../test_vague_gmail_to_database.py           |  32 --
 .../test_adaptor_context_switching.md         |  74 +++++
 .../test_adaptor_context_switching.py         |  60 ----
 .../tests/acceptance/test_basic_input.md      |  49 +++
 .../tests/acceptance/test_basic_input.py      |  51 ---
 .../acceptance/test_contextualised_input.md   | 112 +++++++
 .../acceptance/test_contextualised_input.py   |  94 ------
 .../acceptance/test_duplicate_sections.md     |  88 +++++
 .../acceptance/test_duplicate_sections.py     |  92 ------
 .../test_duplicate_sections_additional.md     |  60 ++++
 .../test_duplicate_sections_additional.py     |  60 ----
 .../test_navigation_workflow_to_job.md        | 104 ++++++
 .../test_navigation_workflow_to_job.py        |  65 ----
 services/testing/judge_rules.md               |   6 +
 services/testing/spec_collector.py            |  84 +++++
 services/testing/spec_parser.py               | 240 ++++++++++++++
 .../tests/acceptance/test_basic_input.md      |  18 ++
 .../tests/acceptance/test_basic_input.py      |  40 ---
 .../acceptance/test_conversational_turn.md    |  81 +++++
 .../acceptance/test_conversational_turn.py    |  88 -----
 .../tests/acceptance/test_edit_job_code.md    |  82 +++++
 .../tests/acceptance/test_edit_job_code.py    |  83 -----
 .../tests/acceptance/test_error_field.md      |  76 +++++
 .../tests/acceptance/test_error_field.py      |  82 -----
 .../acceptance/test_input_second_turn.md      |  99 ++++++
 .../acceptance/test_input_second_turn.py      | 107 -------
 .../{test_long_yaml.py => test_long_yaml.md}  | 187 +++++++----
 .../test_navigation_job_to_workflow.md        | 128 ++++++++
 .../test_navigation_job_to_workflow.py        | 110 -------
 .../tests/acceptance/test_simple_lang_bug.md  |  23 ++
 .../tests/acceptance/test_simple_lang_bug.py  |  45 ---
 .../acceptance/test_single_trigger_node.md    |  82 +++++
 .../acceptance/test_single_trigger_node.py    |  83 -----
 45 files changed, 1987 insertions(+), 1344 deletions(-)
 create mode 100644 agent-team-architecture-plan/example-acceptance-md-spec.md
 create mode 100644 agent-team-architecture-plan/pr-summary-acceptance-tests.md
 create mode 100644 services/global_chat/tests/acceptance/test_commcare_to_dhis2_tracker_with_specific_functions.md
 delete mode 100644 services/global_chat/tests/acceptance/test_commcare_to_dhis2_tracker_with_specific_functions.py
 create mode 100644 services/global_chat/tests/acceptance/test_commcare_to_dhis2_with_job_code.md
 delete mode 100644 services/global_chat/tests/acceptance/test_commcare_to_dhis2_with_job_code.py
 create mode 100644 services/global_chat/tests/acceptance/test_gsheets_transform_salesforce_with_cron.md
 delete mode 100644 services/global_chat/tests/acceptance/test_gsheets_transform_salesforce_with_cron.py
 create mode 100644 services/global_chat/tests/acceptance/test_http_to_salesforce_three_steps_with_job_code.md
 delete mode 100644 services/global_chat/tests/acceptance/test_http_to_salesforce_three_steps_with_job_code.py
 create mode 100644 services/global_chat/tests/acceptance/test_vague_gmail_to_database.md
 delete mode 100644 services/global_chat/tests/acceptance/test_vague_gmail_to_database.py
 create mode 100644 services/job_chat/tests/acceptance/test_adaptor_context_switching.md
 delete mode 100644 services/job_chat/tests/acceptance/test_adaptor_context_switching.py
 create mode 100644 services/job_chat/tests/acceptance/test_basic_input.md
 delete mode 100644 services/job_chat/tests/acceptance/test_basic_input.py
 create mode 100644 services/job_chat/tests/acceptance/test_contextualised_input.md
 delete mode 100644 services/job_chat/tests/acceptance/test_contextualised_input.py
 create mode 100644 services/job_chat/tests/acceptance/test_duplicate_sections.md
 delete mode 100644 services/job_chat/tests/acceptance/test_duplicate_sections.py
 create mode 100644 services/job_chat/tests/acceptance/test_duplicate_sections_additional.md
 delete mode 100644 services/job_chat/tests/acceptance/test_duplicate_sections_additional.py
 create mode 100644 services/job_chat/tests/acceptance/test_navigation_workflow_to_job.md
 delete mode 100644 services/job_chat/tests/acceptance/test_navigation_workflow_to_job.py
 create mode 100644 services/testing/spec_collector.py
 create mode 100644 services/testing/spec_parser.py
 create mode 100644 services/workflow_chat/tests/acceptance/test_basic_input.md
 delete mode 100644 services/workflow_chat/tests/acceptance/test_basic_input.py
 create mode 100644 services/workflow_chat/tests/acceptance/test_conversational_turn.md
 delete mode 100644 services/workflow_chat/tests/acceptance/test_conversational_turn.py
 create mode 100644 services/workflow_chat/tests/acceptance/test_edit_job_code.md
 delete mode 100644 services/workflow_chat/tests/acceptance/test_edit_job_code.py
 create mode 100644 services/workflow_chat/tests/acceptance/test_error_field.md
 delete mode 100644 services/workflow_chat/tests/acceptance/test_error_field.py
 create mode 100644 services/workflow_chat/tests/acceptance/test_input_second_turn.md
 delete mode 100644 services/workflow_chat/tests/acceptance/test_input_second_turn.py
 rename services/workflow_chat/tests/acceptance/{test_long_yaml.py => test_long_yaml.md} (51%)
 create mode 100644 services/workflow_chat/tests/acceptance/test_navigation_job_to_workflow.md
 delete mode 100644 services/workflow_chat/tests/acceptance/test_navigation_job_to_workflow.py
 create mode 100644 services/workflow_chat/tests/acceptance/test_simple_lang_bug.md
 delete mode 100644 services/workflow_chat/tests/acceptance/test_simple_lang_bug.py
 create mode 100644 services/workflow_chat/tests/acceptance/test_single_trigger_node.md
 delete mode 100644 services/workflow_chat/tests/acceptance/test_single_trigger_node.py

diff --git a/agent-team-architecture-plan/example-acceptance-md-spec.md b/agent-team-architecture-plan/example-acceptance-md-spec.md
new file mode 100644
index 0000000..01318fa
--- /dev/null
+++ b/agent-team-architecture-plan/example-acceptance-md-spec.md
@@ -0,0 +1,300 @@
+---
+id: global-chat.long-yaml.bulk-email-addition
+service: global_chat
+runs: 1
+---
+
+# notes
+
+Long workflow YAML plus multi-turn history. The user has incrementally built up a 9-job pipeline over 5 turns. They now ask for one final addition (Mailgun bulk email after the Asana update). The service should preserve every existing job and edge and add the new step in the right position. Tests that the response is not truncated and that no structure is dropped.
+
+# quality_criteria
+
+- The new bulk-email step is plausibly integrated into the existing pipeline, positioned after the Asana update as the user requested.
+- The response language acknowledges that the user has been iterating and treats this as a final tweak, not a fresh start.
+
+# settings
+
+## page
+
+workflows/Data-Integration-and-Reporting/Update-asana
+
+## workflow_yaml
+
+```yaml
+name: Data-Integration-and-Reporting
+jobs:
+  Retrieve-Google-Sheets-Data:
+    id: job-retrieve-gsheets
+    name: Retrieve Google Sheets Data
+    adaptor: "@openfn/language-googlesheets@latest"
+    body: // PLACEHOLDER 1
+  Retrieve-NetSuite-Data:
+    id: job-retrieve-netsuite
+    name: Retrieve NetSuite Data
+    adaptor: "@openfn/language-http@latest"
+    body: // PLACEHOLDER 2
+  Retrieve-Ferntech-Data:
+    id: job-retrieve-ferntech
+    name: Retrieve Ferntech Data
+    adaptor: "@openfn/language-http@latest"
+    body: // PLACEHOLDER 3
+  Process-Combined-Data:
+    id: job-process-combined
+    name: Process Combined Data
+    adaptor: "@openfn/language-common@latest"
+    body: // PLACEHOLDER 4
+  Send-Email-Report:
+    id: job-send-email
+    name: Send Email Report
+    adaptor: "@openfn/language-gmail@latest"
+    body: // PLACEHOLDER 5a
+  write-to-sheet:
+    id: job-write-sheet
+    name: write to sheet
+    adaptor: "@openfn/language-googlesheets@3.0.13"
+    body: // PLACEHOLDER 5b
+  Summarise-with-claude:
+    id: job-summarise-claude
+    name: Summarise with claude
+    adaptor: "@openfn/language-claude@1.0.7"
+    body: // PLACEHOLDER 5c
+  Email-summary:
+    id: job-email-summary
+    name: Email summary
+    adaptor: "@openfn/language-gmail@1.3.0"
+    body: // PLACEHOLDER 6
+  Update-asana:
+    id: job-update-asana
+    name: Update asana
+    adaptor: "@openfn/language-asana@4.1.0"
+    body: // PLACEHOLDER 7
+  error-handler:
+    id: job-error-handler
+    name: Error Handler
+    adaptor: "@openfn/language-common@latest"
+    body: // Log error and stop workflow
+triggers:
+  webhook:
+    id: trigger-webhook
+    type: webhook
+    enabled: false
+edges:
+  webhook->Retrieve-Google-Sheets-Data:
+    id: edge-webhook-gsheets
+    source_trigger: webhook
+    target_job: Retrieve-Google-Sheets-Data
+    condition_type: always
+    enabled: true
+  Retrieve-Google-Sheets-Data->Retrieve-NetSuite-Data:
+    id: edge-gsheets-netsuite
+    source_job: Retrieve-Google-Sheets-Data
+    target_job: Retrieve-NetSuite-Data
+    condition_type: on_job_success
+    enabled: true
+  Retrieve-NetSuite-Data->Retrieve-Ferntech-Data:
+    id: edge-netsuite-ferntech
+    source_job: Retrieve-NetSuite-Data
+    target_job: Retrieve-Ferntech-Data
+    condition_type: on_job_success
+    enabled: true
+  Retrieve-Ferntech-Data->Process-Combined-Data:
+    id: edge-ferntech-combined
+    source_job: Retrieve-Ferntech-Data
+    target_job: Process-Combined-Data
+    condition_type: on_job_success
+    enabled: true
+  Process-Combined-Data->Send-Email-Report:
+    id: edge-combined-email
+    source_job: Process-Combined-Data
+    target_job: Send-Email-Report
+    condition_type: on_job_success
+    enabled: true
+  Process-Combined-Data->write-to-sheet:
+    id: edge-combined-sheet
+    source_job: Process-Combined-Data
+    target_job: write-to-sheet
+    condition_type: on_job_success
+    enabled: true
+  Process-Combined-Data->Summarise-with-claude:
+    id: edge-combined-summarise
+    source_job: Process-Combined-Data
+    target_job: Summarise-with-claude
+    condition_type: on_job_success
+    enabled: true
+  Summarise-with-claude->Email-summary:
+    id: edge-summarise-email
+    source_job: Summarise-with-claude
+    target_job: Email-summary
+    condition_type: on_job_success
+    enabled: true
+  Email-summary->Update-asana:
+    id: edge-email-asana
+    source_job: Email-summary
+    target_job: Update-asana
+    condition_type: on_job_success
+    enabled: true
+  Retrieve-Google-Sheets-Data->error-handler:
+    source_job: Retrieve-Google-Sheets-Data
+    target_job: error-handler
+    condition_type: on_job_failure
+    enabled: true
+  Retrieve-NetSuite-Data->error-handler:
+    source_job: Retrieve-NetSuite-Data
+    target_job: error-handler
+    condition_type: on_job_failure
+    enabled: true
+  Retrieve-Ferntech-Data->error-handler:
+    source_job: Retrieve-Ferntech-Data
+    target_job: error-handler
+    condition_type: on_job_failure
+    enabled: true
+```
+
+## attachments
+
+```json
+[
+  {
+    "type": "log",
+    "content": "[2026-05-12T14:08:11Z] INFO  Retrieve-Google-Sheets-Data succeeded\n[2026-05-12T14:08:12Z] INFO  Retrieve-NetSuite-Data succeeded\n[2026-05-12T14:08:13Z] INFO  Retrieve-Ferntech-Data succeeded\n[2026-05-12T14:08:14Z] INFO  Process-Combined-Data succeeded (412 records)\n[2026-05-12T14:08:15Z] INFO  Send-Email-Report succeeded\n[2026-05-12T14:08:15Z] INFO  write-to-sheet succeeded\n[2026-05-12T14:08:17Z] INFO  Summarise-with-claude succeeded\n[2026-05-12T14:08:18Z] INFO  Email-summary succeeded\n[2026-05-12T14:08:19Z] WARN  Update-asana: HTTP 502 from api.asana.com, retrying (1/3)\n[2026-05-12T14:08:21Z] INFO  Update-asana succeeded on retry\n[2026-05-12T14:08:21Z] INFO  Run complete, 9 jobs, 0 failures"
+  },
+  {
+    "type": "run_output",
+    "content": "{\"recordsProcessed\": 412, \"asanaTaskIds\": [\"1209384756\", \"1209384757\", \"1209384758\"], \"summary\": \"412 records merged from 3 sources, summary emailed, Asana updated\"}"
+  }
+]
+```
+
+# history
+
+## turn
+
+### role
+
+user
+
+### content
+
+I need to create a comprehensive data integration workflow that pulls data from Google Sheets, NetSuite, and Ferntech, then processes everything together and creates various reports and notifications.
+
+## turn
+
+### role
+
+assistant
+
+### content
+
+Absolutely! Let's start by setting up the workflow to retrieve data from Google Sheets, NetSuite, and Ferntech. I'll create jobs for each data source and connect them in sequence.
+
+```yaml
+# 3 jobs: Retrieve-Google-Sheets-Data, Retrieve-NetSuite-Data, Retrieve-Ferntech-Data
+# webhook trigger, sequential edges
+```
+
+## turn
+
+### role
+
+user
+
+### content
+
+Once the data is retrieved, I want to process all the combined data together before generating any reports.
+
+## turn
+
+### role
+
+assistant
+
+### content
+
+Great, I'll add a processing job that takes the output from the three data retrieval jobs and combines the data for further use.
+
+```yaml
+# adds Process-Combined-Data after the three retrieval jobs
+```
+
+## turn
+
+### role
+
+user
+
+### content
+
+After processing, I want to send an email report and also write the results back to a Google Sheet.
+
+## turn
+
+### role
+
+assistant
+
+### content
+
+Understood. I'll add two jobs: one to send an email report and another to write the processed data to a Google Sheet. Both will run after the processing step.
+
+```yaml
+# adds Send-Email-Report and write-to-sheet branching from Process-Combined-Data
+```
+
+## turn
+
+### role
+
+user
+
+### content
+
+Can we also use Claude AI to summarize the processed data, then email that summary, and update our Asana project with the results?
+
+## turn
+
+### role
+
+assistant
+
+### content
+
+Excellent idea! I'll add a summarization job using Claude AI, followed by jobs to email the summary and update Asana. These will branch off after the processing step as well.
+
+```yaml
+# adds Summarise-with-claude, Email-summary, Update-asana
+```
+
+## turn
+
+### role
+
+user
+
+### content
+
+Can you make sure the workflow is robust to errors in any of the data retrieval steps?
+
+## turn
+
+### role
+
+assistant
+
+### content
+
+Good point! I'll add error handling so that if any data retrieval job fails, the workflow will log the error and stop further processing.
+
+```yaml
+# adds error-handler job and on_job_failure edges from each retrieval job
+```
+
+# turn
+
+## role
+
+user
+
+## content
+
+Perfect! One final addition. After updating Asana, I want to format the data for bulk emailing and then send out bulk emails using Mailgun.
diff --git a/agent-team-architecture-plan/pr-summary-acceptance-tests.md b/agent-team-architecture-plan/pr-summary-acceptance-tests.md
new file mode 100644
index 0000000..92060b2
--- /dev/null
+++ b/agent-team-architecture-plan/pr-summary-acceptance-tests.md
@@ -0,0 +1,32 @@
+# Acceptance tests architecture (draft PR summary)
+
+Acceptance tests live next to the service they test, in `services/<svc>/tests/acceptance/`. Each test is a normal Python file. It builds a payload with an intuitive helper, calls the service, runs structural assertions, then calls an LLM judge for natural-language quality checks. The judge takes a `QUALITY_CRITERIA` list (one bullet per criterion, easy for non-technical contributors to edit) and also flags anything else notable. Universal rules in `services/testing/judge_rules.md` apply to every evaluation. The repo-root `conftest.py` auto-applies the `acceptance` marker by folder name, so `pytest -m acceptance` selects every test with no manual decoration. Acceptance is run on demand only, not in any automated pipeline. The 17 existing qualitative tests have been migrated; their originals are deleted.
+
+## Tree
+
+```
+services/
+  testing/
+    judge.py              # LLM judge: criteria + general_flags + Verdict
+    judge_rules.md        # universal rules, applied to every evaluation
+    payloads.py           # build_{global,workflow,job}_chat_payload
+    responses.py          # get_attachment, assert_routed_to, assert_agent_calls
+    apollo_client.py      # ApolloClient (subprocess stub; integration tier will swap to HTTP)
+    fixtures.py           # session-scoped apollo_client fixture
+    yaml_assertions.py    # already shipped in #486
+  global_chat/tests/acceptance/    # 5 tests
+  workflow_chat/tests/acceptance/  # 9 tests
+  job_chat/tests/acceptance/       # 6 tests
+conftest.py               # registers pytest_plugins = ["testing.fixtures"]
+agent-team-architecture-plan/
+  4-acceptance-tests.md   # full architecture doc
+```
+
+## Key ideas
+
+- One Python file per test. Each has a `QUALITY_CRITERIA = [...]` constant at the top.
+- Three judge layers: universal rules, per-test criteria, open-ended "flag anything else notable".
+- Payload builders use user-facing kwargs (`current_job_code`, `current_adaptor`, `previous_page`) and translate to the underlying JSON shape.
+- Structural assertions stay deterministic Python. The LLM judge is one assertion among several, not the whole test.
+- Multi-run sampling is plain `@pytest.mark.parametrize("_run", range(N))`. Nothing custom.
+- The integration tier will replace the `ApolloClient` internals with a real HTTP client. No test changes required.
diff --git a/conftest.py b/conftest.py
index 2d70b9f..e22fc61 100644
--- a/conftest.py
+++ b/conftest.py
@@ -12,8 +12,9 @@
 
 
 # Shared fixtures (apollo_client, ...) live in services/testing/fixtures.py.
-# Registered globally so any test can request them by name.
-pytest_plugins = ["testing.fixtures"]
+# The spec collector picks up acceptance test markdown specs from
+# services/<svc>/tests/acceptance/*.md. Both registered globally.
+pytest_plugins = ["testing.fixtures", "testing.spec_collector"]
 
 
 _TIER_DIRS = ("unit", "service", "integration", "acceptance")
diff --git a/services/global_chat/tests/acceptance/test_commcare_to_dhis2_tracker_with_specific_functions.md b/services/global_chat/tests/acceptance/test_commcare_to_dhis2_tracker_with_specific_functions.md
new file mode 100644
index 0000000..3905cf0
--- /dev/null
+++ b/services/global_chat/tests/acceptance/test_commcare_to_dhis2_tracker_with_specific_functions.md
@@ -0,0 +1,28 @@
+---
+id: global-chat.commcare-to-dhis2-tracker-with-specific-functions
+service: global_chat
+---
+
+# notes
+
+User provides specific function-level instructions (each, fields, field, tracker.import with strategy CREATE_AND_UPDATE, fn for logging). The planner should pass that detail through to the job code agent and the generated code should use those exact functions, not generic substitutes.
+
+# quality_criteria
+
+- The generated job code uses the specific functions the user named: each, fields, field, and tracker.import with strategy CREATE_AND_UPDATE.
+- The generated job code includes a fn() step that logs the import summary (state.data.stats).
+- The mapping correctly maps case_id to trackedEntity, owner_name to a DHIS2 attribute, and date_modified to enrollmentDate as specified.
+
+# turn
+
+## role
+
+user
+
+## content
+
+Can you build a workflow that runs daily at 6am and syncs cases from CommCare to DHIS2 Tracker? It should have 4 steps:
+1. Fetch closed cases from CommCare from the last 24 hours.
+2. Use each() to iterate over the cases and use fields() and field() to map each case to a DHIS2 tracked entity instance — map case_id to trackedEntity, owner_name to a DHIS2 attribute, and date_modified to enrollmentDate.
+3. Import the mapped entities to DHIS2 using tracker.import() with strategy CREATE_AND_UPDATE and the async option set to false.
+4. Use fn() to log the import summary from state.data.stats to the console.
diff --git a/services/global_chat/tests/acceptance/test_commcare_to_dhis2_tracker_with_specific_functions.py b/services/global_chat/tests/acceptance/test_commcare_to_dhis2_tracker_with_specific_functions.py
deleted file mode 100644
index fc7fe97..0000000
--- a/services/global_chat/tests/acceptance/test_commcare_to_dhis2_tracker_with_specific_functions.py
+++ /dev/null
@@ -1,44 +0,0 @@
-"""User provides specific function-level instructions (each, fields, field,
-tracker.import with strategy CREATE_AND_UPDATE, fn for logging). The planner
-should pass that detail through to the job code agent and the generated code
-should use those exact functions — not generic substitutes."""
-
-from testing import judge
-from testing.payloads import build_global_chat_payload
-
-
-QUALITY_CRITERIA = [
-    "The generated job code uses the specific functions the user named: each, fields, field, and tracker.import with strategy CREATE_AND_UPDATE.",
-    "The generated job code includes a fn() step that logs the import summary (state.data.stats).",
-    "The mapping correctly maps case_id to trackedEntity, owner_name to a DHIS2 attribute, and date_modified to enrollmentDate as specified.",
-]
-
-
-def test_commcare_to_dhis2_tracker_with_specific_functions(apollo_client):
-    payload = build_global_chat_payload(
-        user_message=(
-            "Can you build a workflow that runs daily at 6am and syncs cases from "
-            "CommCare to DHIS2 Tracker? It should have 4 steps: "
-            "1. Fetch closed cases from CommCare from the last 24 hours. "
-            "2. Use each() to iterate over the cases and use fields() and field() to "
-            "map each case to a DHIS2 tracked entity instance — map case_id to "
-            "trackedEntity, owner_name to a DHIS2 attribute, and date_modified to "
-            "enrollmentDate. "
-            "3. Import the mapped entities to DHIS2 using tracker.import() with "
-            "strategy CREATE_AND_UPDATE and the async option set to false. "
-            "4. Use fn() to log the import summary from state.data.stats to the console."
-        ),
-        history=[],
-    )
-
-    response = apollo_client.call("global_chat", payload)
-
-    # ---- Structural assertions ---------------------------------------------
-    assert response is not None
-    assert isinstance(response, dict)
-    assert "response" in response
-    assert len(response["response"]) > 0
-
-    # ---- Quality assertions ------------------------------------------------
-    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
-    assert verdict.passed, verdict.summary
diff --git a/services/global_chat/tests/acceptance/test_commcare_to_dhis2_with_job_code.md b/services/global_chat/tests/acceptance/test_commcare_to_dhis2_with_job_code.md
new file mode 100644
index 0000000..3693db0
--- /dev/null
+++ b/services/global_chat/tests/acceptance/test_commcare_to_dhis2_with_job_code.md
@@ -0,0 +1,24 @@
+---
+id: global-chat.commcare-to-dhis2-with-job-code
+service: global_chat
+---
+
+# notes
+
+From-scratch CommCare to DHIS2 workflow with job code for both steps. No existing YAML, no history. The planner should be invoked, call the workflow agent to produce a two-job workflow, then call the job code agent at least twice to fill in the bodies.
+
+# quality_criteria
+
+- The response explains the workflow's purpose in plain language a non-engineer can follow.
+- The job code for the CommCare step calls CommCare adaptor functions (e.g. submissions, forms, cases), not generic JavaScript.
+- The job code for the DHIS2 step calls DHIS2 adaptor functions (e.g. create, upsert, trackedEntities), not generic JavaScript.
+
+# turn
+
+## role
+
+user
+
+## content
+
+Create a workflow that fetches patient cases from CommCare and registers them in DHIS2.
diff --git a/services/global_chat/tests/acceptance/test_commcare_to_dhis2_with_job_code.py b/services/global_chat/tests/acceptance/test_commcare_to_dhis2_with_job_code.py
deleted file mode 100644
index 7274a1e..0000000
--- a/services/global_chat/tests/acceptance/test_commcare_to_dhis2_with_job_code.py
+++ /dev/null
@@ -1,58 +0,0 @@
-"""From-scratch CommCare→DHIS2 workflow with job code for both steps. No
-existing YAML, no history. The planner should be invoked, call the workflow
-agent to produce a two-job workflow, then call the job code agent at least
-twice to fill in the bodies."""
-
-import yaml
-
-from testing import judge
-from testing.payloads import build_global_chat_payload
-from testing.responses import assert_routed_to, assert_agent_calls, get_attachment
-from testing.yaml_assertions import assert_yaml_has_ids, assert_yaml_jobs_have_body
-
-
-QUALITY_CRITERIA = [
-    "The response explains the workflow's purpose in plain language a non-engineer can follow.",
-    "The job code for the CommCare step calls CommCare adaptor functions (e.g. submissions, forms, cases), not generic JavaScript.",
-    "The job code for the DHIS2 step calls DHIS2 adaptor functions (e.g. create, upsert, trackedEntities), not generic JavaScript.",
-    "The response does not leak an api_key or any value that looks like a secret.",
-]
-
-
-def test_commcare_to_dhis2_with_job_code(apollo_client):
-    payload = build_global_chat_payload(
-        user_message="Create a workflow that fetches patient cases from CommCare and registers them in DHIS2.",
-        history=[],
-    )
-
-    response = apollo_client.call("global_chat", payload)
-
-    # ---- Structural assertions ---------------------------------------------
-    assert response is not None
-    assert isinstance(response, dict)
-    assert_routed_to(response, "planner")
-
-    yaml_str = get_attachment(response, "workflow_yaml")
-    assert yaml_str, "Expected a workflow_yaml attachment"
-
-    parsed = yaml.safe_load(yaml_str)
-    assert "jobs" in parsed, "YAML must have a jobs section"
-    assert len(parsed["jobs"]) >= 2, f"Expected at least 2 jobs, got {len(parsed['jobs'])}"
-    assert "triggers" in parsed, "YAML must have a triggers section"
-
-    assert_yaml_has_ids(yaml_str)
-    assert_yaml_jobs_have_body(yaml_str)
-
-    assert_agent_calls(
-        response.get("meta") or {},
-        expected_agents=["planner", "workflow_agent", "job_agent"],
-        min_job_code_calls=2,
-    )
-
-    # ---- Quality assertions ------------------------------------------------
-    verdict = judge.evaluate(
-        criteria=QUALITY_CRITERIA,
-        candidate=response,
-        test_notes=__doc__,
-    )
-    assert verdict.passed, verdict.summary
diff --git a/services/global_chat/tests/acceptance/test_gsheets_transform_salesforce_with_cron.md b/services/global_chat/tests/acceptance/test_gsheets_transform_salesforce_with_cron.md
new file mode 100644
index 0000000..87e1816
--- /dev/null
+++ b/services/global_chat/tests/acceptance/test_gsheets_transform_salesforce_with_cron.md
@@ -0,0 +1,23 @@
+---
+id: global-chat.gsheets-transform-salesforce-with-cron
+service: global_chat
+---
+
+# notes
+
+Semi-specific request: cron trigger at midnight, fetch from Google Sheets, transform, upsert to Salesforce. The Salesforce upsert step requires field mapping decisions the user hasn't provided. The planner should acknowledge the missing details rather than inventing field mappings silently.
+
+# quality_criteria
+
+- The response acknowledges that the Salesforce upsert needs field-mapping details from the user (object type, key fields, source-to-target mapping).
+- If the response generates job code or YAML, it does not silently fabricate field mappings the user did not provide.
+
+# turn
+
+## role
+
+user
+
+## content
+
+Can you make a workflow that triggers at midnight, fetches data from Google Sheets, transforms it, and upserts it into Salesforce?
diff --git a/services/global_chat/tests/acceptance/test_gsheets_transform_salesforce_with_cron.py b/services/global_chat/tests/acceptance/test_gsheets_transform_salesforce_with_cron.py
deleted file mode 100644
index 9078a57..0000000
--- a/services/global_chat/tests/acceptance/test_gsheets_transform_salesforce_with_cron.py
+++ /dev/null
@@ -1,35 +0,0 @@
-"""Semi-specific request: cron trigger at midnight, fetch from Google Sheets,
-transform, upsert to Salesforce. The Salesforce upsert step requires field
-mapping decisions the user hasn't provided. The planner should acknowledge
-the missing details rather than inventing field mappings silently."""
-
-from testing import judge
-from testing.payloads import build_global_chat_payload
-
-
-QUALITY_CRITERIA = [
-    "The response acknowledges that the Salesforce upsert needs field-mapping details from the user (object type, key fields, source-to-target mapping).",
-    "If the response generates job code or YAML, it does not silently fabricate field mappings the user did not provide.",
-]
-
-
-def test_gsheets_transform_salesforce_with_cron(apollo_client):
-    payload = build_global_chat_payload(
-        user_message=(
-            "Can you make a workflow that triggers at midnight, fetches data from "
-            "Google Sheets, transforms it, and upserts it into Salesforce?"
-        ),
-        history=[],
-    )
-
-    response = apollo_client.call("global_chat", payload)
-
-    # ---- Structural assertions ---------------------------------------------
-    assert response is not None
-    assert isinstance(response, dict)
-    assert "response" in response
-    assert len(response["response"]) > 0
-
-    # ---- Quality assertions ------------------------------------------------
-    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
-    assert verdict.passed, verdict.summary
diff --git a/services/global_chat/tests/acceptance/test_http_to_salesforce_three_steps_with_job_code.md b/services/global_chat/tests/acceptance/test_http_to_salesforce_three_steps_with_job_code.md
new file mode 100644
index 0000000..e5f1263
--- /dev/null
+++ b/services/global_chat/tests/acceptance/test_http_to_salesforce_three_steps_with_job_code.md
@@ -0,0 +1,22 @@
+---
+id: global-chat.http-to-salesforce-three-steps-with-job-code
+service: global_chat
+---
+
+# notes
+
+From-scratch three-step HTTP to transform to Salesforce workflow with job code for all three steps. The planner should call the workflow agent to produce a three-job workflow, then call the job code agent at least three times to fill in the bodies.
+
+# quality_criteria
+
+- Each job's body uses functions appropriate to its adaptor (HTTP get/post for the fetch step, JS for transform, Salesforce upsert for the destination).
+
+# turn
+
+## role
+
+user
+
+## content
+
+Build a workflow that can fetch records from an HTTP endpoint, transform the data, and upsert contacts to Salesforce.
diff --git a/services/global_chat/tests/acceptance/test_http_to_salesforce_three_steps_with_job_code.py b/services/global_chat/tests/acceptance/test_http_to_salesforce_three_steps_with_job_code.py
deleted file mode 100644
index 243f92e..0000000
--- a/services/global_chat/tests/acceptance/test_http_to_salesforce_three_steps_with_job_code.py
+++ /dev/null
@@ -1,52 +0,0 @@
-"""From-scratch three-step HTTP→transform→Salesforce workflow with job code
-for all three steps. The planner should call the workflow agent to produce a
-three-job workflow, then call the job code agent at least three times to fill
-in the bodies."""
-
-import yaml
-
-from testing import judge
-from testing.payloads import build_global_chat_payload
-from testing.responses import assert_routed_to, assert_agent_calls, get_attachment
-from testing.yaml_assertions import assert_yaml_has_ids, assert_yaml_jobs_have_body
-
-
-QUALITY_CRITERIA = [
-    "Each job's body uses functions appropriate to its adaptor (HTTP get/post for the fetch step, JS for transform, Salesforce upsert for the destination).",
-]
-
-
-def test_http_to_salesforce_three_steps_with_job_code(apollo_client):
-    payload = build_global_chat_payload(
-        user_message=(
-            "Build a workflow that can fetch records from an HTTP endpoint, "
-            "transform the data, and upsert contacts to Salesforce."
-        ),
-        history=[],
-    )
-
-    response = apollo_client.call("global_chat", payload)
-
-    # ---- Structural assertions ---------------------------------------------
-    assert response is not None
-    assert isinstance(response, dict)
-    assert_routed_to(response, "planner")
-
-    yaml_str = get_attachment(response, "workflow_yaml")
-    assert yaml_str, "Expected a workflow_yaml attachment"
-
-    parsed = yaml.safe_load(yaml_str)
-    assert "jobs" in parsed
-    assert len(parsed["jobs"]) >= 3, f"Expected at least 3 jobs, got {len(parsed['jobs'])}"
-    assert_yaml_has_ids(yaml_str)
-    assert_yaml_jobs_have_body(yaml_str)
-
-    assert_agent_calls(
-        response.get("meta") or {},
-        expected_agents=["planner", "workflow_agent", "job_agent"],
-        min_job_code_calls=3,
-    )
-
-    # ---- Quality assertions ------------------------------------------------
-    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
-    assert verdict.passed, verdict.summary
diff --git a/services/global_chat/tests/acceptance/test_vague_gmail_to_database.md b/services/global_chat/tests/acceptance/test_vague_gmail_to_database.md
new file mode 100644
index 0000000..698e6f8
--- /dev/null
+++ b/services/global_chat/tests/acceptance/test_vague_gmail_to_database.md
@@ -0,0 +1,23 @@
+---
+id: global-chat.vague-gmail-to-database
+service: global_chat
+---
+
+# notes
+
+Vague request: "fetch my data from gmail and send it to my database". No specifics on which gmail data, which database, how to map between them. The planner should surface the ambiguity (or ask clarifying questions) rather than silently inventing details.
+
+# quality_criteria
+
+- The response surfaces the ambiguities in the user's request (e.g. which gmail data, which database, how to authenticate) rather than silently inventing unstated requirements.
+- If the response asks clarifying questions, they are concrete and answerable, not generic.
+
+# turn
+
+## role
+
+user
+
+## content
+
+I want to fetch my data from gmail and send it to my database
diff --git a/services/global_chat/tests/acceptance/test_vague_gmail_to_database.py b/services/global_chat/tests/acceptance/test_vague_gmail_to_database.py
deleted file mode 100644
index d73f2ea..0000000
--- a/services/global_chat/tests/acceptance/test_vague_gmail_to_database.py
+++ /dev/null
@@ -1,32 +0,0 @@
-"""Vague request: "fetch my data from gmail and send it to my database". No
-specifics on which gmail data, which database, how to map between them. The
-planner should surface the ambiguity (or ask clarifying questions) rather
-than silently inventing details."""
-
-from testing import judge
-from testing.payloads import build_global_chat_payload
-
-
-QUALITY_CRITERIA = [
-    "The response surfaces the ambiguities in the user's request (e.g. which gmail data, which database, how to authenticate) rather than silently inventing unstated requirements.",
-    "If the response asks clarifying questions, they are concrete and answerable, not generic.",
-]
-
-
-def test_vague_gmail_to_database(apollo_client):
-    payload = build_global_chat_payload(
-        user_message="I want to fetch my data from gmail and send it to my database",
-        history=[],
-    )
-
-    response = apollo_client.call("global_chat", payload)
-
-    # ---- Structural assertions ---------------------------------------------
-    assert response is not None
-    assert isinstance(response, dict)
-    assert "response" in response, "Expected a text response"
-    assert len(response["response"]) > 0, "Expected non-empty response"
-
-    # ---- Quality assertions ------------------------------------------------
-    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
-    assert verdict.passed, verdict.summary
diff --git a/services/job_chat/tests/acceptance/test_adaptor_context_switching.md b/services/job_chat/tests/acceptance/test_adaptor_context_switching.md
new file mode 100644
index 0000000..009b801
--- /dev/null
+++ b/services/job_chat/tests/acceptance/test_adaptor_context_switching.md
@@ -0,0 +1,74 @@
+---
+id: job-chat.adaptor-context-switching
+service: job_chat
+---
+
+# notes
+
+User was on a Salesforce job page and asked "How do I get data?"; the assistant answered with SOQL. The user has now navigated to a DHIS2 job page and asks the same question again. The model should notice the page-prefix change and switch context to DHIS2-specific guidance.
+
+# quality_criteria
+
+- The response is specifically about fetching data from DHIS2 — not from Salesforce.
+- The response references DHIS2 concepts (tracker, data values, events, programs, etc.) rather than SOQL or SQL.
+- The response does not assume the previous Salesforce context still applies.
+
+# settings
+
+## context.expression
+
+```js
+fn(state => {
+  return state;
+});
+```
+
+## context.adaptor
+
+@openfn/language-dhis2@8.0.7
+
+## context.page_name
+
+fetch-data
+
+## suggest_code
+
+false
+
+# history
+
+## turn
+
+### role
+
+user
+
+### content
+
+[pg:job_code/fetch-records/salesforce@9.0.3] How do I get data?
+
+## turn
+
+### role
+
+assistant
+
+### content
+
+To get data from Salesforce, you can use the `query()` operation with SOQL (Salesforce Object Query Language). For example:
+
+```js
+query('SELECT Id, Name FROM Account WHERE Status = "Active"');
+```
+
+This will fetch records from Salesforce and store them in `state.data`.
+
+# turn
+
+## role
+
+user
+
+## content
+
+How do I get data?
diff --git a/services/job_chat/tests/acceptance/test_adaptor_context_switching.py b/services/job_chat/tests/acceptance/test_adaptor_context_switching.py
deleted file mode 100644
index 7e94174..0000000
--- a/services/job_chat/tests/acceptance/test_adaptor_context_switching.py
+++ /dev/null
@@ -1,60 +0,0 @@
-"""User was on a Salesforce job page and asked "How do I get data?"; the
-assistant answered with SOQL. The user has now navigated to a DHIS2 job page
-and asks the same question again. The model should notice the page-prefix
-change and switch context to DHIS2-specific guidance."""
-
-from testing import judge
-from testing.payloads import build_job_chat_payload
-from testing.responses import latest_user_message
-
-
-QUALITY_CRITERIA = [
-    "The response is specifically about fetching data from DHIS2 — not from Salesforce.",
-    "The response references DHIS2 concepts (tracker, data values, events, programs, etc.) rather than SOQL or SQL.",
-    "The response does not assume the previous Salesforce context still applies.",
-]
-
-
-def test_adaptor_context_switching(apollo_client):
-    payload = build_job_chat_payload(
-        user_message="How do I get data?",
-        history=[
-            {"role": "user", "content": "[pg:job_code/fetch-records/salesforce@9.0.3] How do I get data?"},
-            {"role": "assistant", "content": (
-                "To get data from Salesforce, you can use the `query()` operation with SOQL "
-                "(Salesforce Object Query Language). For example:\n\n"
-                "```js\nquery('SELECT Id, Name FROM Account WHERE Status = \"Active\"');\n```\n\n"
-                "This will fetch records from Salesforce and store them in `state.data`."
-            )},
-        ],
-        current_job_code="fn(state => {\n  return state;\n});",
-        current_adaptor="@openfn/language-dhis2@8.0.7",
-        current_page="fetch-data",
-        suggest_code=False,
-    )
-
-    response = apollo_client.call("job_chat", payload)
-
-    # ---- Structural assertions ---------------------------------------------
-    assert response is not None
-    assert "response" in response
-
-    # Page prefix was applied to the latest user turn in updated history
-    latest_user = latest_user_message(response)
-    assert latest_user is not None, "Expected at least one user message in updated history"
-    assert "[pg:job_code/fetch-data/dhis2@8.0.7]" in latest_user["content"], (
-        f"Expected DHIS2 page prefix in latest user message, got: {latest_user['content'][:200]}"
-    )
-
-    # DHIS2 mentioned in response text
-    assert "dhis" in response["response"].lower(), (
-        f"Expected DHIS2 to be mentioned in response. Got: {response['response'][:300]}"
-    )
-
-    # ---- Quality assertions ------------------------------------------------
-    verdict = judge.evaluate(
-        criteria=QUALITY_CRITERIA,
-        candidate=response,
-        test_notes=__doc__,
-    )
-    assert verdict.passed, verdict.summary
diff --git a/services/job_chat/tests/acceptance/test_basic_input.md b/services/job_chat/tests/acceptance/test_basic_input.md
new file mode 100644
index 0000000..c2064c0
--- /dev/null
+++ b/services/job_chat/tests/acceptance/test_basic_input.md
@@ -0,0 +1,49 @@
+---
+id: job-chat.basic-input
+service: job_chat
+---
+
+# notes
+
+Basic input: a simple job-code modification request. The service returns a response with both a text answer and a suggested_code patch when suggest_code is true.
+
+# settings
+
+## context.expression
+
+```js
+// Get data from external API
+get('https://api.example.com/data');
+
+// Process and transform data
+fn(state => {
+  const transformed = state.data.map(item => ({
+    id: item.id,
+    name: item.full_name,
+    status: item.active ? 'Active' : 'Inactive'
+  }));
+
+  return { ...state, transformed };
+});
+
+// Send transformed data to destination
+post('https://destination.org/upload', state => state.transformed);
+```
+
+## context.adaptor
+
+@openfn/language-gmail@2.0.2
+
+## suggest_code
+
+true
+
+# turn
+
+## role
+
+user
+
+## content
+
+Can you add error handling to this job that will log the error message and retry the operation once if the API call fails?
diff --git a/services/job_chat/tests/acceptance/test_basic_input.py b/services/job_chat/tests/acceptance/test_basic_input.py
deleted file mode 100644
index 06d5efc..0000000
--- a/services/job_chat/tests/acceptance/test_basic_input.py
+++ /dev/null
@@ -1,51 +0,0 @@
-"""Basic input: a simple job-code modification request. The service should
-return a response with both a text answer and a suggested_code patch when
-suggest_code=True is set."""
-
-from testing import judge
-from testing.payloads import build_job_chat_payload
-
-
-QUALITY_CRITERIA = []  # mostly structural; relies on universal rules + general flags
-
-
-JOB_CODE = """// Get data from external API
-get('https://api.example.com/data');
-
-// Process and transform data
-fn(state => {
-  const transformed = state.data.map(item => ({
-    id: item.id,
-    name: item.full_name,
-    status: item.active ? 'Active' : 'Inactive'
-  }));
-
-  return { ...state, transformed };
-});
-
-// Send transformed data to destination
-post('https://destination.org/upload', state => state.transformed);"""
-
-
-def test_basic_input(apollo_client):
-    payload = build_job_chat_payload(
-        user_message=(
-            "Can you add error handling to this job that will log the error message "
-            "and retry the operation once if the API call fails?"
-        ),
-        history=[],
-        current_job_code=JOB_CODE,
-        current_adaptor="@openfn/language-gmail@2.0.2",
-        suggest_code=True,
-    )
-
-    response = apollo_client.call("job_chat", payload)
-
-    # ---- Structural assertions ---------------------------------------------
-    assert response is not None
-    assert "response" in response
-    assert "suggested_code" in response
-
-    # ---- Quality assertions ------------------------------------------------
-    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
-    assert verdict.passed, verdict.summary
diff --git a/services/job_chat/tests/acceptance/test_contextualised_input.md b/services/job_chat/tests/acceptance/test_contextualised_input.md
new file mode 100644
index 0000000..248401d
--- /dev/null
+++ b/services/job_chat/tests/acceptance/test_contextualised_input.md
@@ -0,0 +1,112 @@
+---
+id: job-chat.contextualised-input
+service: job_chat
+---
+
+# notes
+
+All payload fields populated: history, job code, multiple adaptors, ids, and pre-injected RAG search results in meta. The service should pick up on the RAG hints (HTTP adaptor error handling, retry logic) and produce a suggested_code that differs from the original.
+
+# quality_criteria
+
+- The suggested_code is meaningfully modified from the original to add error handling and retry logic, drawing on the pre-injected RAG search results rather than ignoring them.
+
+# settings
+
+## context.expression
+
+```js
+// Get data from external API
+get('https://api.example.com/data');
+
+// Process and transform data
+fn(state => {
+  const transformed = state.data.map(item => ({
+    id: item.id,
+    name: item.full_name,
+    status: item.active ? 'Active' : 'Inactive'
+  }));
+
+  return { ...state, transformed };
+});
+
+// Send transformed data to destination
+post('https://destination.org/upload', state => state.transformed);
+```
+
+## context.adaptor
+
+@openfn/language-fhir-4@0.1.10
+
+## context.adaptors
+
+```json
+["@openfn/language-http", "@openfn/language-common"]
+```
+
+## context.projectId
+
+project-xyz789
+
+## context.jobId
+
+job-abc123
+
+## meta.rag
+
+```json
+{
+  "search_queries": [
+    "http adaptor error handling",
+    "openfn retry logic"
+  ],
+  "search_results": [
+    {
+      "title": "HTTP Adaptor Error Handling",
+      "url": "https://docs.openfn.org/adaptors/http#error-handling",
+      "content": "The HTTP adaptor provides mechanisms for handling connection errors and retrying failed requests. Use the maxRetries option to specify retry attempts."
+    },
+    {
+      "title": "Common Adaptor Documentation",
+      "url": "https://docs.openfn.org/adaptors/common#error-handling",
+      "content": "Error handling can be implemented using standard JavaScript try/catch blocks or with the withError helper function."
+    }
+  ]
+}
+```
+
+## suggest_code
+
+true
+
+# history
+
+## turn
+
+### role
+
+user
+
+### content
+
+I need to add error handling to my API integration job. What's the best approach?
+
+## turn
+
+### role
+
+assistant
+
+### content
+
+There are several approaches to handling errors in API calls. You can use try/catch blocks, implement retry logic, or use built-in error handling functions. Could you share your current job code so I can provide specific recommendations?
+
+# turn
+
+## role
+
+user
+
+## content
+
+Can you add error handling to this job that will log the error message and retry the operation once if the API call fails?
diff --git a/services/job_chat/tests/acceptance/test_contextualised_input.py b/services/job_chat/tests/acceptance/test_contextualised_input.py
deleted file mode 100644
index 303eefc..0000000
--- a/services/job_chat/tests/acceptance/test_contextualised_input.py
+++ /dev/null
@@ -1,94 +0,0 @@
-"""All payload fields populated: history, job code, multiple adaptors, ids,
-and pre-injected RAG search results in meta. The service should pick up on
-the RAG hints (HTTP adaptor error handling, retry logic) and produce a
-suggested_code that differs from the original."""
-
-from testing import judge
-from testing.payloads import build_job_chat_payload
-
-
-QUALITY_CRITERIA = [
-    "The response or suggested_code references the pre-injected RAG search results (HTTP adaptor error handling and retry logic) rather than ignoring them.",
-]
-
-
-JOB_CODE = """// Get data from external API
-get('https://api.example.com/data');
-
-// Process and transform data
-fn(state => {
-  const transformed = state.data.map(item => ({
-    id: item.id,
-    name: item.full_name,
-    status: item.active ? 'Active' : 'Inactive'
-  }));
-
-  return { ...state, transformed };
-});
-
-// Send transformed data to destination
-post('https://destination.org/upload', state => state.transformed);"""
-
-
-def test_contextualised_input(apollo_client):
-    payload = build_job_chat_payload(
-        user_message=(
-            "Can you add error handling to this job that will log the error message "
-            "and retry the operation once if the API call fails?"
-        ),
-        history=[
-            {
-                "role": "user",
-                "content": "I need to add error handling to my API integration job. What's the best approach?",
-            },
-            {
-                "role": "assistant",
-                "content": (
-                    "There are several approaches to handling errors in API calls. You can use "
-                    "try/catch blocks, implement retry logic, or use built-in error handling "
-                    "functions. Could you share your current job code so I can provide specific "
-                    "recommendations?"
-                ),
-            },
-        ],
-        current_job_code=JOB_CODE,
-        current_adaptor="@openfn/language-fhir-4@0.1.10",
-        project_adaptors=["@openfn/language-http", "@openfn/language-common"],
-        job_id="job-abc123",
-        project_id="project-xyz789",
-        rag_results=[
-            {
-                "title": "HTTP Adaptor Error Handling",
-                "url": "https://docs.openfn.org/adaptors/http#error-handling",
-                "content": (
-                    "The HTTP adaptor provides mechanisms for handling connection errors and "
-                    "retrying failed requests. Use the maxRetries option to specify retry attempts."
-                ),
-            },
-            {
-                "title": "Common Adaptor Documentation",
-                "url": "https://docs.openfn.org/adaptors/common#error-handling",
-                "content": (
-                    "Error handling can be implemented using standard JavaScript try/catch "
-                    "blocks or with the withError helper function."
-                ),
-            },
-        ],
-        rag_queries=["http adaptor error handling", "openfn retry logic"],
-        suggest_code=True,
-    )
-
-    response = apollo_client.call("job_chat", payload)
-
-    # ---- Structural assertions ---------------------------------------------
-    assert response is not None
-    assert "response" in response
-    assert "suggested_code" in response
-    assert "meta" in response
-    assert "usage" in response
-    assert response["suggested_code"] is not None, "suggested_code should not be None"
-    assert response["suggested_code"] != JOB_CODE, "suggested_code should differ from the original"
-
-    # ---- Quality assertions ------------------------------------------------
-    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
-    assert verdict.passed, verdict.summary
diff --git a/services/job_chat/tests/acceptance/test_duplicate_sections.md b/services/job_chat/tests/acceptance/test_duplicate_sections.md
new file mode 100644
index 0000000..f6900e8
--- /dev/null
+++ b/services/job_chat/tests/acceptance/test_duplicate_sections.md
@@ -0,0 +1,88 @@
+---
+id: job-chat.duplicate-sections
+service: job_chat
+---
+
+# notes
+
+Code contains three identical line_item creation blocks. The user asks for validation added only to the SECOND one. The service should apply the change to the right block — not all three, not the wrong one — using enough surrounding context to disambiguate.
+
+# quality_criteria
+
+- The change is applied to the second line_item creation only — the first and third remain unchanged.
+- The applied change adds a check that Barcode__c is not empty before creating the record.
+
+# settings
+
+## context.expression
+
+```js
+each(
+  dataPath('data[*]'),
+  combine(
+    create(
+      'transaction__c',
+      fields(
+        field('Transaction_Date__c', dataValue('today')),
+        relationship(
+          'Person_Responsible__r',
+          'Staff_ID_Code__c',
+          dataValue('person_code')
+        ),
+        field('metainstanceid__c', dataValue('*meta-instance-id*'))
+      )
+    ),
+    each(
+      merge(
+        dataPath('line_items[*]'),
+        fields(
+          field('end', dataValue('time_end')),
+          field('parentId', lastReferenceValue('id'))
+        )
+      ),
+      create(
+        'line_item__c',
+        fields(
+          field('transaction__c', dataValue('parentId')),
+          field('Barcode__c', dataValue('product_barcode')),
+          field('ODK_Form_Completed__c', dataValue('end'))
+        )
+      ),
+      create(
+        'line_item__c',
+        fields(
+          field('transaction__c', dataValue('parentId')),
+          field('Barcode__c', dataValue('product_barcode')),
+          field('ODK_Form_Completed__c', dataValue('end'))
+        )
+      ),
+      create(
+        'line_item__c',
+        fields(
+          field('transaction__c', dataValue('parentId')),
+          field('Barcode__c', dataValue('product_barcode')),
+          field('ODK_Form_Completed__c', dataValue('end'))
+        )
+      )
+    )
+  )
+);
+```
+
+## context.adaptor
+
+@openfn/language-dhis2@8.0.1
+
+## suggest_code
+
+true
+
+# turn
+
+## role
+
+user
+
+## content
+
+In this job, I want to add a validation step only for the second line item creation, to check if the Barcode__c is not empty before creating the record. How can I do that?
diff --git a/services/job_chat/tests/acceptance/test_duplicate_sections.py b/services/job_chat/tests/acceptance/test_duplicate_sections.py
deleted file mode 100644
index a06b36c..0000000
--- a/services/job_chat/tests/acceptance/test_duplicate_sections.py
+++ /dev/null
@@ -1,92 +0,0 @@
-"""Code contains three identical line_item creation blocks. The user asks
-for validation added only to the SECOND one. The service should apply the
-change to the right block — not all three, not the wrong one — using enough
-surrounding context to disambiguate."""
-
-from testing import judge
-from testing.payloads import build_job_chat_payload
-
-
-QUALITY_CRITERIA = [
-    "The change is applied to the second line_item creation only — the first and third remain unchanged.",
-    "The applied change adds a check that Barcode__c is not empty before creating the record.",
-]
-
-
-JOB_CODE = """each(
-  dataPath('data[*]'),
-  combine(
-    create(
-      'transaction__c',
-      fields(
-        field('Transaction_Date__c', dataValue('today')),
-        relationship(
-          'Person_Responsible__r',
-          'Staff_ID_Code__c',
-          dataValue('person_code')
-        ),
-        field('metainstanceid__c', dataValue('*meta-instance-id*'))
-      )
-    ),
-    each(
-      merge(
-        dataPath('line_items[*]'),
-        fields(
-          field('end', dataValue('time_end')),
-          field('parentId', lastReferenceValue('id'))
-        )
-      ),
-      create(
-        'line_item__c',
-        fields(
-          field('transaction__c', dataValue('parentId')),
-          field('Barcode__c', dataValue('product_barcode')),
-          field('ODK_Form_Completed__c', dataValue('end'))
-        )
-      ),
-      create(
-        'line_item__c',
-        fields(
-          field('transaction__c', dataValue('parentId')),
-          field('Barcode__c', dataValue('product_barcode')),
-          field('ODK_Form_Completed__c', dataValue('end'))
-        )
-      ),
-      create(
-        'line_item__c',
-        fields(
-          field('transaction__c', dataValue('parentId')),
-          field('Barcode__c', dataValue('product_barcode')),
-          field('ODK_Form_Completed__c', dataValue('end'))
-        )
-      )
-    )
-  )
-);"""
-
-
-def test_duplicate_sections(apollo_client):
-    payload = build_job_chat_payload(
-        user_message=(
-            "In this job, I want to add a validation step only for the second "
-            "line item creation, to check if the Barcode__c is not empty before "
-            "creating the record. How can I do that?"
-        ),
-        history=[],
-        current_job_code=JOB_CODE,
-        current_adaptor="@openfn/language-dhis2@8.0.1",
-        suggest_code=True,
-    )
-
-    response = apollo_client.call("job_chat", payload)
-
-    # ---- Structural assertions ---------------------------------------------
-    assert response is not None
-    assert "response" in response
-    assert "suggested_code" in response
-    assert response["suggested_code"] is not None
-    assert response["suggested_code"] != JOB_CODE, "suggested_code should differ from the original"
-
-    # ---- Quality assertions ------------------------------------------------
-    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
-    assert verdict.passed, verdict.summary
diff --git a/services/job_chat/tests/acceptance/test_duplicate_sections_additional.md b/services/job_chat/tests/acceptance/test_duplicate_sections_additional.md
new file mode 100644
index 0000000..0565f5c
--- /dev/null
+++ b/services/job_chat/tests/acceptance/test_duplicate_sections_additional.md
@@ -0,0 +1,60 @@
+---
+id: job-chat.duplicate-sections-additional
+service: job_chat
+---
+
+# notes
+
+Six identical POST calls. The user asks for retry-on-failure error handling on the THIRD one only. The service must use enough context to identify the right call, apply the change only there, and not accidentally drop any of the other five.
+
+# quality_criteria
+
+- Error handling is added only to the third POST call — the others remain unchanged.
+- All six POST calls are still present in the suggested code (none accidentally removed).
+
+# settings
+
+## context.expression
+
+```js
+// Process and prepare data
+fn(state => {
+  const items = state.data.items.map(item => ({
+    id: item.id,
+    name: item.name,
+    status: 'pending'
+  }));
+
+  return { ...state, items };
+});
+
+post('https://api.example.com/endpoint', state => state.items);
+
+post('https://api.example.com/endpoint', state => state.items);
+
+post('https://api.example.com/endpoint', state => state.items);
+
+post('https://api.example.com/endpoint', state => state.items);
+
+post('https://api.example.com/endpoint', state => state.items);
+
+post('https://api.example.com/endpoint', state => state.items);
+```
+
+## context.adaptor
+
+@openfn/language-mailchimp@1.0.19
+
+## suggest_code
+
+true
+
+# turn
+
+## role
+
+user
+
+## content
+
+I need to add error handling only to the third POST request to retry once if it fails.
diff --git a/services/job_chat/tests/acceptance/test_duplicate_sections_additional.py b/services/job_chat/tests/acceptance/test_duplicate_sections_additional.py
deleted file mode 100644
index 0f6842f..0000000
--- a/services/job_chat/tests/acceptance/test_duplicate_sections_additional.py
+++ /dev/null
@@ -1,60 +0,0 @@
-"""Six identical POST calls. The user asks for retry-on-failure error handling
-on the THIRD one only. The service must use enough context to identify the
-right call, apply the change only there, and not accidentally drop any of
-the other five."""
-
-from testing import judge
-from testing.payloads import build_job_chat_payload
-
-
-QUALITY_CRITERIA = [
-    "Error handling is added only to the third POST call — the others remain unchanged.",
-    "All six POST calls are still present in the suggested code (none accidentally removed).",
-]
-
-
-JOB_CODE = """// Process and prepare data
-fn(state => {
-  const items = state.data.items.map(item => ({
-    id: item.id,
-    name: item.name,
-    status: 'pending'
-  }));
-
-  return { ...state, items };
-});
-
-post('https://api.example.com/endpoint', state => state.items);
-
-post('https://api.example.com/endpoint', state => state.items);
-
-post('https://api.example.com/endpoint', state => state.items);
-
-post('https://api.example.com/endpoint', state => state.items);
-
-post('https://api.example.com/endpoint', state => state.items);
-
-post('https://api.example.com/endpoint', state => state.items);"""
-
-
-def test_duplicate_sections_additional(apollo_client):
-    payload = build_job_chat_payload(
-        user_message="I need to add error handling only to the third POST request to retry once if it fails.",
-        history=[],
-        current_job_code=JOB_CODE,
-        current_adaptor="@openfn/language-mailchimp@1.0.19",
-        suggest_code=True,
-    )
-
-    response = apollo_client.call("job_chat", payload)
-
-    # ---- Structural assertions ---------------------------------------------
-    assert response is not None
-    assert "response" in response
-    assert "suggested_code" in response
-    assert response["suggested_code"] is not None
-    assert response["suggested_code"] != JOB_CODE, "suggested_code should differ from the original"
-
-    # ---- Quality assertions ------------------------------------------------
-    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
-    assert verdict.passed, verdict.summary
diff --git a/services/job_chat/tests/acceptance/test_navigation_workflow_to_job.md b/services/job_chat/tests/acceptance/test_navigation_workflow_to_job.md
new file mode 100644
index 0000000..9c2c821
--- /dev/null
+++ b/services/job_chat/tests/acceptance/test_navigation_workflow_to_job.md
@@ -0,0 +1,104 @@
+---
+id: job-chat.navigation-workflow-to-job
+service: job_chat
+---
+
+# notes
+
+User was on a workflow editor discussing workflow structure, then navigated to a job editor and asked an abrupt question about the current code. The model should recognise the navigation (via meta.last_page) and respond about the job code, not continue talking about workflow structure.
+
+# quality_criteria
+
+- The response is about the current job code (the patient mapping), not about workflow structure.
+- The suggested_code includes a log statement at the start of the job body (e.g. `console.log(...)`).
+
+# settings
+
+## context.expression
+
+```js
+fn(state => {
+  const patients = state.data.map(patient => ({
+    id: patient.patient_id,
+    name: patient.full_name,
+    dob: patient.date_of_birth
+  }));
+
+  return { ...state, patients };
+});
+
+post('https://destination.api/patients', state => state.patients);
+```
+
+## context.adaptor
+
+@openfn/language-common@latest
+
+## context.page_name
+
+map-patient-data
+
+## meta.last_page
+
+```json
+{
+  "type": "workflow",
+  "name": "patient-sync"
+}
+```
+
+## suggest_code
+
+true
+
+# history
+
+## turn
+
+### role
+
+user
+
+### content
+
+[pg:workflow/patient-sync] Create a workflow to sync patient data from source to destination
+
+## turn
+
+### role
+
+assistant
+
+### content
+
+I'll create a workflow with jobs to fetch patient data, transform it, and sync to the destination system.
+
+## turn
+
+### role
+
+user
+
+### content
+
+[pg:workflow/patient-sync] Add validation between fetch and transform
+
+## turn
+
+### role
+
+assistant
+
+### content
+
+I'll add a validation job that checks the patient data before transformation.
+
+# turn
+
+## role
+
+user
+
+## content
+
+Add a log statement at the start
diff --git a/services/job_chat/tests/acceptance/test_navigation_workflow_to_job.py b/services/job_chat/tests/acceptance/test_navigation_workflow_to_job.py
deleted file mode 100644
index a50dba3..0000000
--- a/services/job_chat/tests/acceptance/test_navigation_workflow_to_job.py
+++ /dev/null
@@ -1,65 +0,0 @@
-"""User was on a workflow editor discussing workflow structure, then navigated
-to a job editor and asked an abrupt question about the current code. The model
-should recognise the navigation (via meta.last_page) and respond about the
-job code, not continue talking about workflow structure."""
-
-from testing import judge
-from testing.payloads import build_job_chat_payload
-
-
-QUALITY_CRITERIA = [
-    "The response is about the current job code (the patient mapping), not about workflow structure.",
-]
-
-
-JOB_CODE = """fn(state => {
-  const patients = state.data.map(patient => ({
-    id: patient.patient_id,
-    name: patient.full_name,
-    dob: patient.date_of_birth
-  }));
-
-  return { ...state, patients };
-});
-
-post('https://destination.api/patients', state => state.patients);"""
-
-
-def test_navigation_workflow_to_job(apollo_client):
-    payload = build_job_chat_payload(
-        user_message="Add a log statement at the start",
-        history=[
-            {"role": "user", "content": "[pg:workflow/patient-sync] Create a workflow to sync patient data from source to destination"},
-            {"role": "assistant", "content": "I'll create a workflow with jobs to fetch patient data, transform it, and sync to the destination system."},
-            {"role": "user", "content": "[pg:workflow/patient-sync] Add validation between fetch and transform"},
-            {"role": "assistant", "content": "I'll add a validation job that checks the patient data before transformation."},
-        ],
-        current_job_code=JOB_CODE,
-        current_adaptor="@openfn/language-common@latest",
-        current_page="map-patient-data",
-        previous_page={"type": "workflow", "name": "patient-sync"},
-        suggest_code=True,
-    )
-
-    response = apollo_client.call("job_chat", payload)
-
-    # ---- Structural assertions ---------------------------------------------
-    assert response is not None
-    assert "response" in response
-    assert "suggested_code" in response
-    assert response["suggested_code"] is not None, "Model should have generated code for the job"
-
-    # Log statement was added
-    assert "console.log" in response["suggested_code"], (
-        f"Log statement not found in suggested code: {response['suggested_code'][:300]}"
-    )
-
-    # Response text is about job code, not workflow structure
-    response_text = response["response"].lower()
-    assert not any(word in response_text for word in ["workflow", "yaml", "trigger", "edge"]), (
-        f"Response should be about job code, not workflow structure. Response: {response_text[:300]}"
-    )
-
-    # ---- Quality assertions ------------------------------------------------
-    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
-    assert verdict.passed, verdict.summary
diff --git a/services/testing/judge_rules.md b/services/testing/judge_rules.md
index e69de29..cf017ee 100644
--- a/services/testing/judge_rules.md
+++ b/services/testing/judge_rules.md
@@ -0,0 +1,6 @@
+- Every job, trigger, and edge in a returned workflow YAML has a non-empty `id` field.
+- Every job in a returned workflow YAML has a `body` that is either real adaptor code or the canonical empty-job placeholder `// Add operations here`. Reject other placeholder-style markers such as `// PLACEHOLDER`, numbered placeholders, `TODO`, `FIXME`, or `<insert ... here>` — these are leftover generation artifacts.
+- Job names and edge source/target/key references in a returned workflow YAML use only letters, numbers, spaces, hyphens, and underscores.
+- When the user is editing an existing workflow, every job and edge from the existing YAML is present and unchanged in the response unless the user asked to remove or modify it. Additions are fine.
+- Any returned YAML parses as valid YAML.
+- If a criterion expects a specific concrete output (e.g. a workflow with particular jobs, code using specific functions) but the model instead asks the user a reasonable clarifying question to disambiguate the request, treat the criterion as satisfied. Asking for more information is a valid behaviour. Exception: when the test notes or a criterion explicitly evaluate the model's decision about when to act versus when to ask, grade strictly.
diff --git a/services/testing/spec_collector.py b/services/testing/spec_collector.py
new file mode 100644
index 0000000..1c46ab1
--- /dev/null
+++ b/services/testing/spec_collector.py
@@ -0,0 +1,84 @@
+"""Pytest plugin: collect acceptance test markdown specs and run them.
+
+Detects `*.md` files (excluding `_*.md`) inside any `acceptance/` folder under
+the configured testpaths. Each spec becomes one or more pytest items
+(`spec.runs` defaults to 1; multi-run specs yield `<id>[run=N]` items).
+
+Each item:
+  1. Builds the service payload from `# settings`, `# history`, and `# turn`.
+  2. Dispatches to the named service via ApolloClient.
+  3. Calls `judge.evaluate()` with the spec's quality_criteria.
+  4. Fails with the judge's reasoning summary if `verdict.passed` is False.
+"""
+
+import pytest
+
+from testing import judge
+from testing.apollo_client import ApolloClient
+from testing.spec_parser import Spec, parse_spec
+
+
+def pytest_collect_file(parent, file_path):
+    if (
+        file_path.suffix == ".md"
+        and not file_path.name.startswith("_")
+        and file_path.parent.name == "acceptance"
+    ):
+        return SpecFile.from_parent(parent, path=file_path)
+
+
+class SpecFile(pytest.File):
+    def collect(self):
+        spec = parse_spec(self.path)
+        for run_index in range(spec.runs):
+            suffix = f"[run={run_index}]" if spec.runs > 1 else ""
+            yield SpecItem.from_parent(
+                self,
+                name=f"{spec.id}{suffix}",
+                spec=spec,
+                run_index=run_index,
+            )
+
+
+class SpecItem(pytest.Item):
+    def __init__(self, *, name, parent, spec: Spec, run_index: int):
+        super().__init__(name, parent)
+        self.spec = spec
+        self.run_index = run_index
+
+    def runtest(self):
+        payload = _build_payload(self.spec)
+        client = ApolloClient()
+        response = client.call(self.spec.service, payload)
+
+        verdict = judge.evaluate(
+            criteria=self.spec.quality_criteria,
+            candidate=response,
+            test_notes=self.spec.notes or None,
+        )
+
+        if not verdict.passed:
+            raise AssertionError(verdict.summary)
+
+    def repr_failure(self, excinfo, style=None):
+        return str(excinfo.value)
+
+    def reportinfo(self):
+        return self.path, 0, f"acceptance: {self.spec.id}"
+
+
+def _build_payload(spec: Spec) -> dict:
+    """Assemble the JSON payload from a parsed spec.
+
+    Settings is the base. History (if present) goes into the `history` key.
+    The current turn's content (if present and role=user) goes into `content`.
+    """
+    payload = dict(spec.settings)
+
+    if spec.history:
+        payload["history"] = spec.history
+
+    if spec.current_turn and spec.current_turn.get("role") == "user":
+        payload["content"] = spec.current_turn["content"]
+
+    return payload
diff --git a/services/testing/spec_parser.py b/services/testing/spec_parser.py
new file mode 100644
index 0000000..ebb4daa
--- /dev/null
+++ b/services/testing/spec_parser.py
@@ -0,0 +1,240 @@
+"""Parse acceptance test markdown specs into Spec dataclasses.
+
+Spec format (see `agent-team-architecture-plan/example-acceptance-md-spec.md`):
+
+    ---
+    id: ...
+    service: ...
+    runs: 1            # optional, default 1
+    ---
+
+    # notes
+    prose description
+
+    # quality_criteria
+    - criterion 1
+    - criterion 2
+
+    # settings
+    ## page
+    workflows/...
+    ## context.expression
+    ```js
+    some code
+    ```
+    ## suggest_code
+    true
+
+    # history
+    ## turn
+    ### role
+    user
+    ### content
+    message
+    ## turn
+    ### role
+    assistant
+    ### content
+    reply
+
+    # turn
+    ## role
+    user
+    ## content
+    current message
+
+Section semantics:
+  - frontmatter: `id` (defaults to filename stem), `service` (required), `runs` (int, default 1).
+  - `# notes`: free-form prose, used as `test_notes` for the judge.
+  - `# quality_criteria`: bullet list, each `-` is one criterion.
+  - `# settings`: each `## key.path` becomes a nested dict entry. JSON code fences are parsed;
+    yaml/js/plain fences are kept as strings; `true`/`false` plain text becomes a bool.
+  - `# history`: repeated `## turn` blocks, each with `### role` and `### content`.
+  - `# turn`: the single current user message under test. Has `## role` and `## content`.
+    May be omitted (e.g. workflow_chat error-correction tests where `errors` is the trigger).
+"""
+
+import json
+import re
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any, Optional
+
+import yaml
+
+
+@dataclass
+class Spec:
+    id: str
+    service: str
+    runs: int = 1
+    notes: str = ""
+    quality_criteria: list[str] = field(default_factory=list)
+    settings: dict[str, Any] = field(default_factory=dict)
+    history: list[dict[str, str]] = field(default_factory=list)
+    current_turn: Optional[dict[str, str]] = None
+    path: Optional[Path] = None
+
+
+def parse_spec(path: Path) -> Spec:
+    text = path.read_text()
+    frontmatter, body = _split_frontmatter(text)
+    sections = _split_headers(body, level=1)
+
+    return Spec(
+        id=frontmatter.get("id") or path.stem,
+        service=frontmatter["service"],
+        runs=int(frontmatter.get("runs", 1)),
+        notes=sections.get("notes", "").strip(),
+        quality_criteria=_parse_bullets(sections.get("quality_criteria", "")),
+        settings=_parse_settings(sections.get("settings", "")),
+        history=_parse_history(sections.get("history", "")),
+        current_turn=_parse_role_content(sections.get("turn", ""), level=2),
+        path=path,
+    )
+
+
+def _split_frontmatter(text: str) -> tuple[dict, str]:
+    if not text.startswith("---\n"):
+        return {}, text
+    end = text.find("\n---", 4)
+    if end < 0:
+        return {}, text
+    frontmatter = yaml.safe_load(text[4:end]) or {}
+    body = text[end + 4:].lstrip("\n")
+    return frontmatter, body
+
+
+def _split_headers(text: str, *, level: int) -> dict[str, str]:
+    """Split markdown text into {name: content} on headers at the given level.
+
+    Respects fenced code blocks — `#` lines inside ``` ... ``` are NOT treated as headers.
+    """
+    prefix = "#" * level + " "
+    prefix_deeper = "#" * (level + 1)
+    sections: dict[str, str] = {}
+    current_name: Optional[str] = None
+    current_lines: list[str] = []
+    in_fence = False
+
+    for line in text.splitlines():
+        if line.lstrip().startswith("```"):
+            in_fence = not in_fence
+            current_lines.append(line)
+            continue
+
+        if (
+            not in_fence
+            and line.startswith(prefix)
+            and not line.startswith(prefix_deeper + " ")
+        ):
+            if current_name is not None:
+                sections[current_name] = "\n".join(current_lines).rstrip()
+            current_name = line[len(prefix):].strip().lower()
+            current_lines = []
+        else:
+            current_lines.append(line)
+
+    if current_name is not None:
+        sections[current_name] = "\n".join(current_lines).rstrip()
+    return sections
+
+
+def _parse_bullets(text: str) -> list[str]:
+    bullets = []
+    for line in text.splitlines():
+        m = re.match(r"^- +(.*)$", line.rstrip())
+        if m:
+            bullets.append(m.group(1).strip())
+    return bullets
+
+
+def _parse_settings(text: str) -> dict[str, Any]:
+    sub_sections = _split_headers(text, level=2)
+    settings: dict[str, Any] = {}
+    for key, value_text in sub_sections.items():
+        _set_dotted(settings, key, _parse_value(value_text))
+    return settings
+
+
+def _parse_value(text: str) -> Any:
+    text = text.strip()
+    if not text:
+        return ""
+
+    fence = re.match(r"^```(\w*)\s*\n(.*?)\n```\s*$", text, re.DOTALL)
+    if fence:
+        lang = fence.group(1).lower()
+        content = fence.group(2)
+        if lang == "json":
+            return json.loads(content)
+        return content
+
+    lower = text.lower()
+    if lower == "true":
+        return True
+    if lower == "false":
+        return False
+    return text
+
+
+def _set_dotted(target: dict, dotted: str, value: Any) -> None:
+    parts = dotted.split(".")
+    cur = target
+    for p in parts[:-1]:
+        if not isinstance(cur.get(p), dict):
+            cur[p] = {}
+        cur = cur[p]
+    cur[parts[-1]] = value
+
+
+def _parse_history(text: str) -> list[dict[str, str]]:
+    blocks = _split_repeated_header(text, level=2, name="turn")
+    turns = []
+    for block in blocks:
+        turn = _parse_role_content(block, level=3)
+        if turn:
+            turns.append(turn)
+    return turns
+
+
+def _split_repeated_header(text: str, *, level: int, name: str) -> list[str]:
+    """Split text on repeated headers like `## turn` (case-insensitive). Returns the
+    body following each header, respecting fenced code blocks."""
+    target = ("#" * level + " " + name).lower()
+    blocks: list[str] = []
+    current: list[str] = []
+    started = False
+    in_fence = False
+
+    for line in text.splitlines():
+        if line.lstrip().startswith("```"):
+            in_fence = not in_fence
+            if started:
+                current.append(line)
+            continue
+
+        if not in_fence and line.strip().lower() == target:
+            if started:
+                blocks.append("\n".join(current))
+            current = []
+            started = True
+            continue
+
+        if started:
+            current.append(line)
+
+    if started:
+        blocks.append("\n".join(current))
+    return blocks
+
+
+def _parse_role_content(text: str, *, level: int) -> Optional[dict[str, str]]:
+    if not text.strip():
+        return None
+    sub = _split_headers(text, level=level)
+    role = sub.get("role", "").strip()
+    content = sub.get("content", "").rstrip()
+    if not role:
+        return None
+    return {"role": role, "content": content}
diff --git a/services/workflow_chat/tests/acceptance/test_basic_input.md b/services/workflow_chat/tests/acceptance/test_basic_input.md
new file mode 100644
index 0000000..99aff9e
--- /dev/null
+++ b/services/workflow_chat/tests/acceptance/test_basic_input.md
@@ -0,0 +1,18 @@
+---
+id: workflow-chat.basic-input
+service: workflow_chat
+---
+
+# notes
+
+Basic input test. The service handles a simple input without an existing YAML and either generates a workflow YAML or asks for more information.
+
+# turn
+
+## role
+
+user
+
+## content
+
+Whenever fridge statistics are send to you, parse and aggregate the data and upload to a collection in redis.
diff --git a/services/workflow_chat/tests/acceptance/test_basic_input.py b/services/workflow_chat/tests/acceptance/test_basic_input.py
deleted file mode 100644
index ef3017b..0000000
--- a/services/workflow_chat/tests/acceptance/test_basic_input.py
+++ /dev/null
@@ -1,40 +0,0 @@
-"""Basic input: empty yaml, simple request. The service should either generate
-a YAML or ask for more information. Structural checks ensure any generated YAML
-is well-formed."""
-
-from testing import judge
-from testing.payloads import build_workflow_chat_payload
-from testing.yaml_assertions import (
-    assert_no_special_chars,
-    assert_yaml_has_ids,
-    assert_yaml_jobs_have_body,
-)
-
-
-QUALITY_CRITERIA = []  # mostly structural; relies on universal rules + general flags
-
-
-def test_basic_input(apollo_client):
-    payload = build_workflow_chat_payload(
-        existing_yaml="",
-        history=[],
-        user_message=(
-            "Whenever fridge statistics are send to you, parse and aggregate "
-            "the data and upload to a collection in redis."
-        ),
-    )
-
-    response = apollo_client.call("workflow_chat", payload)
-
-    # ---- Structural assertions ---------------------------------------------
-    assert response is not None
-    assert isinstance(response, dict)
-
-    if response.get("response_yaml"):
-        assert_yaml_has_ids(response["response_yaml"])
-        assert_yaml_jobs_have_body(response["response_yaml"])
-        assert_no_special_chars(response["response_yaml"])
-
-    # ---- Quality assertions ------------------------------------------------
-    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
-    assert verdict.passed, verdict.summary
diff --git a/services/workflow_chat/tests/acceptance/test_conversational_turn.md b/services/workflow_chat/tests/acceptance/test_conversational_turn.md
new file mode 100644
index 0000000..ec666de
--- /dev/null
+++ b/services/workflow_chat/tests/acceptance/test_conversational_turn.md
@@ -0,0 +1,81 @@
+---
+id: workflow-chat.conversational-turn
+service: workflow_chat
+---
+
+# notes
+
+User asks a conversational question that should not lead to YAML changes. The service should respond with text only, or with a YAML identical to the existing one.
+
+# quality_criteria
+
+- The response engages conversationally with the user's request for clarification, without unnecessarily restructuring or rewriting the workflow.
+
+# settings
+
+## existing_yaml
+
+```yaml
+name: fridge-statistics-processing
+jobs:
+  parse-and-aggregate-fridge-data:
+    id: job-parse-id
+    name: Parse and Aggregate Fridge Data
+    adaptor: '@openfn/language-common@latest'
+    body: '| // Add data parsing and aggregation operations here'
+  upload-to-redis:
+    id: job-upload-id
+    name: Upload to Redis Collection
+    adaptor: '@openfn/language-redis@latest'
+    body: '| // Add Redis collection upload operations here'
+triggers:
+  webhook:
+    id: trigger-webhook-id
+    type: webhook
+    enabled: false
+edges:
+  webhook->parse-and-aggregate-fridge-data:
+    id: edge-webhook-parse-id
+    source_trigger: webhook
+    target_job: parse-and-aggregate-fridge-data
+    condition_type: always
+    enabled: true
+  parse-and-aggregate-fridge-data->upload-to-redis:
+    id: edge-parse-upload-id
+    source_job: parse-and-aggregate-fridge-data
+    target_job: upload-to-redis
+    condition_type: on_job_success
+    enabled: true
+```
+
+# history
+
+## turn
+
+### role
+
+user
+
+### content
+
+Whenever fridge statistics are send to you, parse and aggregate the data and upload to a collection in redis.
+
+## turn
+
+### role
+
+assistant
+
+### content
+
+I'll create a workflow that processes fridge statistics through a webhook trigger, then aggregates and stores the data in Redis.
+
+# turn
+
+## role
+
+user
+
+## content
+
+Can you explain that better
diff --git a/services/workflow_chat/tests/acceptance/test_conversational_turn.py b/services/workflow_chat/tests/acceptance/test_conversational_turn.py
deleted file mode 100644
index 23b4077..0000000
--- a/services/workflow_chat/tests/acceptance/test_conversational_turn.py
+++ /dev/null
@@ -1,88 +0,0 @@
-"""User asks a conversational question that should NOT lead to YAML changes.
-The service should respond with text only, or with a YAML that is identical
-to the existing one."""
-
-import yaml
-
-from testing import judge
-from testing.payloads import build_workflow_chat_payload
-from testing.yaml_assertions import assert_no_special_chars
-
-
-QUALITY_CRITERIA = [
-    "The response engages conversationally with the user's request for clarification, without unnecessarily restructuring or rewriting the workflow.",
-]
-
-
-EXISTING_YAML = """
-name: fridge-statistics-processing
-jobs:
-  parse-and-aggregate-fridge-data:
-    id: job-parse-id
-    name: Parse and Aggregate Fridge Data
-    adaptor: '@openfn/language-common@latest'
-    body: '| // Add data parsing and aggregation operations here'
-  upload-to-redis:
-    id: job-upload-id
-    name: Upload to Redis Collection
-    adaptor: '@openfn/language-redis@latest'
-    body: '| // Add Redis collection upload operations here'
-triggers:
-  webhook:
-    id: trigger-webhook-id
-    type: webhook
-    enabled: false
-edges:
-  webhook->parse-and-aggregate-fridge-data:
-    id: edge-webhook-parse-id
-    source_trigger: webhook
-    target_job: parse-and-aggregate-fridge-data
-    condition_type: always
-    enabled: true
-  parse-and-aggregate-fridge-data->upload-to-redis:
-    id: edge-parse-upload-id
-    source_job: parse-and-aggregate-fridge-data
-    target_job: upload-to-redis
-    condition_type: on_job_success
-    enabled: true
-"""
-
-
-def test_conversational_turn(apollo_client):
-    payload = build_workflow_chat_payload(
-        existing_yaml=EXISTING_YAML,
-        history=[
-            {
-                "role": "user",
-                "content": (
-                    "Whenever fridge statistics are send to you, parse and aggregate "
-                    "the data and upload to a collection in redis."
-                ),
-            },
-            {
-                "role": "assistant",
-                "content": (
-                    "I'll create a workflow that processes fridge statistics through a "
-                    "webhook trigger, then aggregates and stores the data in Redis."
-                ),
-            },
-        ],
-        user_message="Can you explain that better",
-    )
-
-    response = apollo_client.call("workflow_chat", payload)
-
-    # ---- Structural assertions ---------------------------------------------
-    assert response is not None
-    assert isinstance(response, dict)
-
-    response_yaml_str = response.get("response_yaml")
-    if response_yaml_str and str(response_yaml_str).strip():
-        orig_yaml = yaml.safe_load(EXISTING_YAML)
-        response_yaml = yaml.safe_load(response_yaml_str)
-        assert orig_yaml == response_yaml, "If YAML is present in response, it must be unchanged."
-        assert_no_special_chars(response_yaml_str)
-
-    # ---- Quality assertions ------------------------------------------------
-    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
-    assert verdict.passed, verdict.summary
diff --git a/services/workflow_chat/tests/acceptance/test_edit_job_code.md b/services/workflow_chat/tests/acceptance/test_edit_job_code.md
new file mode 100644
index 0000000..08cb0a6
--- /dev/null
+++ b/services/workflow_chat/tests/acceptance/test_edit_job_code.md
@@ -0,0 +1,82 @@
+---
+id: workflow-chat.edit-job-code
+service: workflow_chat
+---
+
+# notes
+
+User asks workflow_chat to fill in job code, which is not its responsibility. The service should politely explain that and avoid generating or modifying actual job bodies. Any returned YAML should be unchanged from the existing one (or absent).
+
+# quality_criteria
+
+- The response politely explains that filling in job code is not workflow_chat's responsibility, or otherwise declines the request gracefully.
+- The response does not attempt to write actual job code into the workflow.
+
+# settings
+
+## existing_yaml
+
+```yaml
+name: fridge-statistics-processing
+jobs:
+  parse-and-aggregate-fridge-data:
+    id: job-parse-id
+    name: Parse and Aggregate Fridge Data
+    adaptor: '@openfn/language-common@latest'
+    body: 'print("hello a")'
+  upload-to-redis:
+    id: job-upload-id
+    name: Upload to Redis Collection
+    adaptor: '@openfn/language-redis@latest'
+    body: 'print("hello a")'
+triggers:
+  webhook:
+    id: trigger-webhook-id
+    type: webhook
+    enabled: false
+edges:
+  webhook->parse-and-aggregate-fridge-data:
+    id: edge-webhook-parse-id
+    source_trigger: webhook
+    target_job: parse-and-aggregate-fridge-data
+    condition_type: always
+    enabled: true
+  parse-and-aggregate-fridge-data->upload-to-redis:
+    id: edge-parse-upload-id
+    source_job: parse-and-aggregate-fridge-data
+    target_job: upload-to-redis
+    condition_type: on_job_success
+    enabled: true
+```
+
+# history
+
+## turn
+
+### role
+
+user
+
+### content
+
+Whenever fridge statistics are send to you, parse and aggregate the data and upload to a collection in redis.
+
+## turn
+
+### role
+
+assistant
+
+### content
+
+I'll create a workflow that processes fridge statistics through a webhook trigger, then aggregates and stores the data in Redis.
+
+# turn
+
+## role
+
+user
+
+## content
+
+Can you also fill in the job code for all the steps
diff --git a/services/workflow_chat/tests/acceptance/test_edit_job_code.py b/services/workflow_chat/tests/acceptance/test_edit_job_code.py
deleted file mode 100644
index 9ae938f..0000000
--- a/services/workflow_chat/tests/acceptance/test_edit_job_code.py
+++ /dev/null
@@ -1,83 +0,0 @@
-"""User asks workflow_chat to fill in job code, which is not its
-responsibility. The service should politely explain that and avoid generating
-or modifying actual job bodies. Any returned YAML should be unchanged from
-the existing one (or absent)."""
-
-from testing import judge
-from testing.payloads import build_workflow_chat_payload
-from testing.yaml_assertions import assert_no_special_chars
-
-
-QUALITY_CRITERIA = [
-    "The response politely explains that filling in job code is not workflow_chat's responsibility, or otherwise declines the request gracefully.",
-    "The response does not attempt to write actual job code into the workflow.",
-]
-
-
-EXISTING_YAML = """
-name: fridge-statistics-processing
-jobs:
-  parse-and-aggregate-fridge-data:
-    id: job-parse-id
-    name: Parse and Aggregate Fridge Data
-    adaptor: '@openfn/language-common@latest'
-    body: 'print("hello a")'
-  upload-to-redis:
-    id: job-upload-id
-    name: Upload to Redis Collection
-    adaptor: '@openfn/language-redis@latest'
-    body: 'print("hello a")'
-triggers:
-  webhook:
-    id: trigger-webhook-id
-    type: webhook
-    enabled: false
-edges:
-  webhook->parse-and-aggregate-fridge-data:
-    id: edge-webhook-parse-id
-    source_trigger: webhook
-    target_job: parse-and-aggregate-fridge-data
-    condition_type: always
-    enabled: true
-  parse-and-aggregate-fridge-data->upload-to-redis:
-    id: edge-parse-upload-id
-    source_job: parse-and-aggregate-fridge-data
-    target_job: upload-to-redis
-    condition_type: on_job_success
-    enabled: true
-"""
-
-
-def test_edit_job_code(apollo_client):
-    payload = build_workflow_chat_payload(
-        existing_yaml=EXISTING_YAML,
-        history=[
-            {
-                "role": "user",
-                "content": (
-                    "Whenever fridge statistics are send to you, parse and aggregate "
-                    "the data and upload to a collection in redis."
-                ),
-            },
-            {
-                "role": "assistant",
-                "content": (
-                    "I'll create a workflow that processes fridge statistics through a "
-                    "webhook trigger, then aggregates and stores the data in Redis."
-                ),
-            },
-        ],
-        user_message="Can you also fill in the job code for all the steps",
-    )
-
-    response = apollo_client.call("workflow_chat", payload)
-
-    # ---- Structural assertions ---------------------------------------------
-    assert response is not None
-    assert isinstance(response, dict)
-    if response.get("response_yaml"):
-        assert_no_special_chars(response["response_yaml"])
-
-    # ---- Quality assertions ------------------------------------------------
-    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
-    assert verdict.passed, verdict.summary
diff --git a/services/workflow_chat/tests/acceptance/test_error_field.md b/services/workflow_chat/tests/acceptance/test_error_field.md
new file mode 100644
index 0000000..9d855af
--- /dev/null
+++ b/services/workflow_chat/tests/acceptance/test_error_field.md
@@ -0,0 +1,76 @@
+---
+id: workflow-chat.error-field
+service: workflow_chat
+---
+
+# notes
+
+The service is invoked with an `errors` field (replacing `content`) — used when the user's previous workflow attempt produced a validation error. The service should acknowledge the error and produce a corrected workflow.
+
+# quality_criteria
+
+- The response acknowledges the reported error rather than ignoring it.
+- Any returned workflow YAML attempts to fix the cause of the error (in this case, an invalid adaptor).
+
+# settings
+
+## existing_yaml
+
+```yaml
+name: fridge-statistics-processing
+jobs:
+  parse-and-aggregate-fridge-data:
+    id: job-parse-id
+    name: Parse and Aggregate Fridge Data
+    adaptor: '@openfn/language-commons@latest'
+    body: '| // Add data parsing and aggregation operations here'
+  upload-to-redis:
+    id: job-upload-id
+    name: Upload to Redis Collection
+    adaptor: '@openfn/language-redis@latest'
+    body: '| // Add Redis collection upload operations here'
+triggers:
+  webhook:
+    id: trigger-webhook-id
+    type: webhook
+    enabled: false
+edges:
+  webhook->parse-and-aggregate-fridge-data:
+    id: edge-webhook-parse-id
+    source_trigger: webhook
+    target_job: parse-and-aggregate-fridge-data
+    condition_type: always
+    enabled: true
+  parse-and-aggregate-fridge-data->upload-to-redis:
+    id: edge-parse-upload-id
+    source_job: parse-and-aggregate-fridge-data
+    target_job: upload-to-redis
+    condition_type: on_job_success
+    enabled: true
+```
+
+## errors
+
+adaptor error
+
+# history
+
+## turn
+
+### role
+
+user
+
+### content
+
+Whenever fridge statistics are send to you, parse and aggregate the data and upload to a collection in redis.
+
+## turn
+
+### role
+
+assistant
+
+### content
+
+I'll create a workflow that processes fridge statistics through a webhook trigger, then aggregates and stores the data in Redis.
diff --git a/services/workflow_chat/tests/acceptance/test_error_field.py b/services/workflow_chat/tests/acceptance/test_error_field.py
deleted file mode 100644
index e8fad38..0000000
--- a/services/workflow_chat/tests/acceptance/test_error_field.py
+++ /dev/null
@@ -1,82 +0,0 @@
-"""The service is invoked with an `errors` field (replacing `content`) — used
-when the user's previous workflow attempt produced a validation error. The
-service should acknowledge the error and produce a corrected workflow."""
-
-from testing import judge
-from testing.payloads import build_workflow_chat_payload
-from testing.yaml_assertions import assert_no_special_chars
-
-
-QUALITY_CRITERIA = [
-    "The response acknowledges the reported error rather than ignoring it.",
-    "Any returned workflow YAML attempts to fix the cause of the error (in this case, an invalid adaptor).",
-]
-
-
-EXISTING_YAML = """
-name: fridge-statistics-processing
-jobs:
-  parse-and-aggregate-fridge-data:
-    id: job-parse-id
-    name: Parse and Aggregate Fridge Data
-    adaptor: '@openfn/language-commons@latest'
-    body: '| // Add data parsing and aggregation operations here'
-  upload-to-redis:
-    id: job-upload-id
-    name: Upload to Redis Collection
-    adaptor: '@openfn/language-redis@latest'
-    body: '| // Add Redis collection upload operations here'
-triggers:
-  webhook:
-    id: trigger-webhook-id
-    type: webhook
-    enabled: false
-edges:
-  webhook->parse-and-aggregate-fridge-data:
-    id: edge-webhook-parse-id
-    source_trigger: webhook
-    target_job: parse-and-aggregate-fridge-data
-    condition_type: always
-    enabled: true
-  parse-and-aggregate-fridge-data->upload-to-redis:
-    id: edge-parse-upload-id
-    source_job: parse-and-aggregate-fridge-data
-    target_job: upload-to-redis
-    condition_type: on_job_success
-    enabled: true
-"""
-
-
-def test_error_field(apollo_client):
-    payload = build_workflow_chat_payload(
-        existing_yaml=EXISTING_YAML,
-        history=[
-            {
-                "role": "user",
-                "content": (
-                    "Whenever fridge statistics are send to you, parse and aggregate "
-                    "the data and upload to a collection in redis."
-                ),
-            },
-            {
-                "role": "assistant",
-                "content": (
-                    "I'll create a workflow that processes fridge statistics through a "
-                    "webhook trigger, then aggregates and stores the data in Redis."
-                ),
-            },
-        ],
-        errors="adaptor error",
-    )
-
-    response = apollo_client.call("workflow_chat", payload)
-
-    # ---- Structural assertions ---------------------------------------------
-    assert response is not None
-    assert isinstance(response, dict)
-    if response.get("response_yaml"):
-        assert_no_special_chars(response["response_yaml"])
-
-    # ---- Quality assertions ------------------------------------------------
-    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
-    assert verdict.passed, verdict.summary
diff --git a/services/workflow_chat/tests/acceptance/test_input_second_turn.md b/services/workflow_chat/tests/acceptance/test_input_second_turn.md
new file mode 100644
index 0000000..a0f84ac
--- /dev/null
+++ b/services/workflow_chat/tests/acceptance/test_input_second_turn.md
@@ -0,0 +1,99 @@
+---
+id: workflow-chat.input-second-turn
+service: workflow_chat
+---
+
+# notes
+
+Second conversation turn requesting a change to the YAML. The service should preserve every job from the original YAML while applying the requested addition (data deduplication before validation).
+
+# settings
+
+## existing_yaml
+
+```yaml
+name: CommCare-to-DHIS2-Patient-Integration
+jobs:
+  receive-commcare-data:
+    id: job-receive-id
+    name: Receive CommCare Patient Data
+    adaptor: '@openfn/language-commcare@latest'
+    body: 'PLACEHOLDER 1'
+  validate-patient-data:
+    id: job-validate-id
+    name: Validate Patient Data
+    adaptor: '@openfn/language-common@latest'
+    body: 'PLACEHOLDER 2'
+  log-validation-errors:
+    id: job-log-id
+    name: Log Validation Errors to Google Sheets
+    adaptor: '@openfn/language-googlesheets@latest'
+    body: 'PLACEHOLDER 3'
+  transform-and-upload-to-dhis2:
+    id: job-transform-id
+    name: Transform and Upload to DHIS2
+    adaptor: '@openfn/language-dhis2@latest'
+    body: 'PLACEHOLER 4'
+triggers:
+  webhook:
+    id: trigger-webhook-id
+    type: webhook
+    enabled: false
+edges:
+  webhook->receive-commcare-data:
+    id: edge-webhook-receive-id
+    source_trigger: webhook
+    target_job: receive-commcare-data
+    condition_type: always
+    enabled: true
+  receive-commcare-data->validate-patient-data:
+    id: edge-receive-validate-id
+    source_job: receive-commcare-data
+    target_job: validate-patient-data
+    condition_type: on_job_success
+    enabled: true
+  validate-patient-data->log-validation-errors:
+    id: edge-validate-log-id
+    source_job: validate-patient-data
+    target_job: log-validation-errors
+    condition_type: on_job_failure
+    enabled: true
+  validate-patient-data->transform-and-upload-to-dhis2:
+    id: edge-validate-transform-id
+    source_job: validate-patient-data
+    target_job: transform-and-upload-to-dhis2
+    condition_type: on_job_success
+    enabled: true
+```
+
+# history
+
+## turn
+
+### role
+
+user
+
+### content
+
+Set up an OpenFn workflow to automatically receive new patient data from CommCare, validate the data and if there's an issue log it to a google sheet, otherwise map it to the DHIS2 data model, and load it into the DHIS2 national health information system
+
+## turn
+
+### role
+
+assistant
+
+### content
+
+I'll create a workflow to process CommCare patient data. The flow will receive data via webhook, validate it, then either log issues to Google Sheets or transform and send valid data to DHIS2.
+
+# turn
+
+## role
+
+user
+
+## content
+
+Actually, let's add data deduplication before validation to prevent duplicate patient records
diff --git a/services/workflow_chat/tests/acceptance/test_input_second_turn.py b/services/workflow_chat/tests/acceptance/test_input_second_turn.py
deleted file mode 100644
index 31faf35..0000000
--- a/services/workflow_chat/tests/acceptance/test_input_second_turn.py
+++ /dev/null
@@ -1,107 +0,0 @@
-"""Second conversation turn requesting a change to the YAML. The service
-should preserve every job from the original YAML while applying the requested
-addition (data deduplication before validation)."""
-
-from testing import judge
-from testing.payloads import build_workflow_chat_payload
-from testing.yaml_assertions import (
-    assert_no_special_chars,
-    assert_yaml_section_contains_all,
-)
-
-
-QUALITY_CRITERIA = []  # mostly structural
-
-
-EXISTING_YAML = """
-name: CommCare-to-DHIS2-Patient-Integration
-jobs:
-  receive-commcare-data:
-    id: job-receive-id
-    name: Receive CommCare Patient Data
-    adaptor: '@openfn/language-commcare@latest'
-    body: 'PLACEHOLDER 1'
-  validate-patient-data:
-    id: job-validate-id
-    name: Validate Patient Data
-    adaptor: '@openfn/language-common@latest'
-    body: 'PLACEHOLDER 2'
-  log-validation-errors:
-    id: job-log-id
-    name: Log Validation Errors to Google Sheets
-    adaptor: '@openfn/language-googlesheets@latest'
-    body: 'PLACEHOLDER 3'
-  transform-and-upload-to-dhis2:
-    id: job-transform-id
-    name: Transform and Upload to DHIS2
-    adaptor: '@openfn/language-dhis2@latest'
-    body: 'PLACEHOLER 4'
-triggers:
-  webhook:
-    id: trigger-webhook-id
-    type: webhook
-    enabled: false
-edges:
-  webhook->receive-commcare-data:
-    id: edge-webhook-receive-id
-    source_trigger: webhook
-    target_job: receive-commcare-data
-    condition_type: always
-    enabled: true
-  receive-commcare-data->validate-patient-data:
-    id: edge-receive-validate-id
-    source_job: receive-commcare-data
-    target_job: validate-patient-data
-    condition_type: on_job_success
-    enabled: true
-  validate-patient-data->log-validation-errors:
-    id: edge-validate-log-id
-    source_job: validate-patient-data
-    target_job: log-validation-errors
-    condition_type: on_job_failure
-    enabled: true
-  validate-patient-data->transform-and-upload-to-dhis2:
-    id: edge-validate-transform-id
-    source_job: validate-patient-data
-    target_job: transform-and-upload-to-dhis2
-    condition_type: on_job_success
-    enabled: true
-"""
-
-
-def test_input_second_turn(apollo_client):
-    payload = build_workflow_chat_payload(
-        existing_yaml=EXISTING_YAML,
-        history=[
-            {
-                "role": "user",
-                "content": (
-                    "Set up an OpenFn workflow to automatically receive new patient data from "
-                    "CommCare, validate the data and if there's an issue log it to a google "
-                    "sheet, otherwise map it to the DHIS2 data model, and load it into the "
-                    "DHIS2 national health information system"
-                ),
-            },
-            {
-                "role": "assistant",
-                "content": (
-                    "I'll create a workflow to process CommCare patient data. The flow will "
-                    "receive data via webhook, validate it, then either log issues to Google "
-                    "Sheets or transform and send valid data to DHIS2."
-                ),
-            },
-        ],
-        user_message="Actually, let's add data deduplication before validation to prevent duplicate patient records",
-    )
-
-    response = apollo_client.call("workflow_chat", payload)
-
-    # ---- Structural assertions ---------------------------------------------
-    assert response is not None
-    assert isinstance(response, dict)
-    assert_yaml_section_contains_all(EXISTING_YAML, response.get("response_yaml", ""), "jobs")
-    assert_no_special_chars(response["response_yaml"])
-
-    # ---- Quality assertions ------------------------------------------------
-    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
-    assert verdict.passed, verdict.summary
diff --git a/services/workflow_chat/tests/acceptance/test_long_yaml.py b/services/workflow_chat/tests/acceptance/test_long_yaml.md
similarity index 51%
rename from services/workflow_chat/tests/acceptance/test_long_yaml.py
rename to services/workflow_chat/tests/acceptance/test_long_yaml.md
index c741afe..7837d98 100644
--- a/services/workflow_chat/tests/acceptance/test_long_yaml.py
+++ b/services/workflow_chat/tests/acceptance/test_long_yaml.md
@@ -1,21 +1,21 @@
-"""Long workflow YAML + multi-turn history. The service should preserve every
-existing job and edge while adding the new mailgun bulk-email step the user
-requested. Tests that the response isn't truncated or stripped of structure."""
+---
+id: workflow-chat.long-yaml
+service: workflow_chat
+---
 
-from testing import judge
-from testing.payloads import build_workflow_chat_payload
-from testing.yaml_assertions import (
-    assert_no_special_chars,
-    assert_yaml_section_contains_all,
-)
+# notes
 
+Long workflow YAML plus multi-turn history. The user has incrementally built up a 9-job pipeline over 5 turns. They now ask for one final addition (Mailgun bulk email after the Asana update). The service should preserve every existing job and edge and add the new step in the right position. Tests that the response is not truncated and that no structure is dropped.
 
-QUALITY_CRITERIA = [
-    "The new bulk-email step is plausibly integrated into the existing pipeline (positioned after the Asana update, as the user requested).",
-]
+# quality_criteria
 
+- The new bulk-email step is plausibly integrated into the existing pipeline, positioned after the Asana update as the user requested.
 
-EXISTING_YAML = """
+# settings
+
+## existing_yaml
+
+```yaml
 name: Data-Integration-and-Reporting
 jobs:
   Retrieve-Google-Sheets-Data:
@@ -123,51 +123,116 @@
     target_job: Update-asana
     condition_type: on_job_success
     enabled: true
-"""
-
-
-# History is shortened to user/assistant pairs — the full reproduction lives in
-# the original test_qualitative.py and isn't needed to exercise the bug.
-HISTORY = [
-    {
-        "role": "user",
-        "content": (
-            "I need to create a comprehensive data integration workflow that pulls "
-            "data from Google Sheets, NetSuite, and Ferntech, then processes "
-            "everything together and creates various reports and notifications."
-        ),
-    },
-    {"role": "assistant", "content": "Absolutely! Let's start by setting up the workflow to retrieve data."},
-    {"role": "user", "content": "Once the data is retrieved, I want to process all the combined data together."},
-    {"role": "assistant", "content": "Great, I'll add a processing job."},
-    {"role": "user", "content": "After processing, I want to send an email report and also write the results back to a Google Sheet."},
-    {"role": "assistant", "content": "Understood. Added Send-Email-Report and write-to-sheet."},
-    {"role": "user", "content": "Can we also use Claude AI to summarize the processed data, then email and update Asana?"},
-    {"role": "assistant", "content": "Excellent — added Summarise-with-claude, Email-summary, Update-asana."},
-    {"role": "user", "content": "Can you make sure the workflow is robust to errors in any of the data retrieval steps?"},
-    {"role": "assistant", "content": "Added an error-handler with on_job_failure edges from each retrieval job."},
-]
-
-
-def test_long_yaml(apollo_client):
-    payload = build_workflow_chat_payload(
-        existing_yaml=EXISTING_YAML,
-        history=HISTORY,
-        user_message=(
-            "Perfect! One final addition - after updating Asana, I want to format "
-            "the data for bulk emailing and then send out bulk emails using Mailgun."
-        ),
-    )
-
-    response = apollo_client.call("workflow_chat", payload)
-
-    # ---- Structural assertions ---------------------------------------------
-    assert response is not None
-    assert isinstance(response, dict)
-    assert_yaml_section_contains_all(EXISTING_YAML, response.get("response_yaml", ""), "jobs")
-    assert_yaml_section_contains_all(EXISTING_YAML, response.get("response_yaml", ""), "edges")
-    assert_no_special_chars(response["response_yaml"])
-
-    # ---- Quality assertions ------------------------------------------------
-    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
-    assert verdict.passed, verdict.summary
+```
+
+# history
+
+## turn
+
+### role
+
+user
+
+### content
+
+I need to create a comprehensive data integration workflow that pulls data from Google Sheets, NetSuite, and Ferntech, then processes everything together and creates various reports and notifications.
+
+## turn
+
+### role
+
+assistant
+
+### content
+
+Absolutely! Let's start by setting up the workflow to retrieve data.
+
+## turn
+
+### role
+
+user
+
+### content
+
+Once the data is retrieved, I want to process all the combined data together.
+
+## turn
+
+### role
+
+assistant
+
+### content
+
+Great, I'll add a processing job.
+
+## turn
+
+### role
+
+user
+
+### content
+
+After processing, I want to send an email report and also write the results back to a Google Sheet.
+
+## turn
+
+### role
+
+assistant
+
+### content
+
+Understood. Added Send-Email-Report and write-to-sheet.
+
+## turn
+
+### role
+
+user
+
+### content
+
+Can we also use Claude AI to summarize the processed data, then email and update Asana?
+
+## turn
+
+### role
+
+assistant
+
+### content
+
+Excellent — added Summarise-with-claude, Email-summary, Update-asana.
+
+## turn
+
+### role
+
+user
+
+### content
+
+Can you make sure the workflow is robust to errors in any of the data retrieval steps?
+
+## turn
+
+### role
+
+assistant
+
+### content
+
+Added an error-handler with on_job_failure edges from each retrieval job.
+
+# turn
+
+## role
+
+user
+
+## content
+
+Perfect! One final addition - after updating Asana, I want to format the data for bulk emailing and then send out bulk emails using Mailgun.
diff --git a/services/workflow_chat/tests/acceptance/test_navigation_job_to_workflow.md b/services/workflow_chat/tests/acceptance/test_navigation_job_to_workflow.md
new file mode 100644
index 0000000..970f7a2
--- /dev/null
+++ b/services/workflow_chat/tests/acceptance/test_navigation_job_to_workflow.md
@@ -0,0 +1,128 @@
+---
+id: workflow-chat.navigation-job-to-workflow
+service: workflow_chat
+---
+
+# notes
+
+User has just navigated from a job-code editor (where they were discussing HTTP error handling) to a workflow editor and asks to add a new step. The model should infer the context switch and respond about the workflow as a structure, not continue talking about job-code error handling.
+
+# quality_criteria
+
+- The response talks about the workflow as a structure (jobs, edges, triggers), not about job-code-level error handling.
+- The tone is warm and collaborative, not clinical or terse.
+- The response adds a new email-sending step to the workflow (gmail, mailgun, or similar adaptor) and the rationale is plausible — e.g. notification, summary, or alerting.
+
+# settings
+
+## existing_yaml
+
+```yaml
+name: data-pipeline
+jobs:
+  fetch-source-data:
+    id: job-fetch-id
+    name: Fetch Source Data
+    adaptor: '@openfn/language-http@6.5.4'
+    body: 'get("https://source.api/data");'
+  transform-data:
+    id: job-transform-id
+    name: Transform Data
+    adaptor: '@openfn/language-common@latest'
+    body: 'fn(state => { return { ...state, transformed: true }; });'
+  save-to-database:
+    id: job-save-id
+    name: Save to Database
+    adaptor: '@openfn/language-http@6.5.4'
+    body: 'post("https://db.api/save", state => state.data);'
+triggers:
+  webhook:
+    id: trigger-webhook-id
+    type: webhook
+    enabled: false
+edges:
+  webhook->fetch-source-data:
+    id: edge-webhook-fetch-id
+    source_trigger: webhook
+    target_job: fetch-source-data
+    condition_type: always
+    enabled: true
+  fetch-source-data->transform-data:
+    id: edge-fetch-transform-id
+    source_job: fetch-source-data
+    target_job: transform-data
+    condition_type: on_job_success
+    enabled: true
+  transform-data->save-to-database:
+    id: edge-transform-save-id
+    source_job: transform-data
+    target_job: save-to-database
+    condition_type: on_job_success
+    enabled: true
+```
+
+## context.page_name
+
+data-pipeline
+
+## meta.last_page
+
+```json
+{
+  "type": "job_code",
+  "name": "transform-data",
+  "adaptor": "http"
+}
+```
+
+# history
+
+## turn
+
+### role
+
+user
+
+### content
+
+[pg:job_code/transform-data/http] Can you add error handling to this HTTP request?
+
+## turn
+
+### role
+
+assistant
+
+### content
+
+I'll add try-catch error handling to catch any request failures in your HTTP job.
+
+## turn
+
+### role
+
+user
+
+### content
+
+[pg:job_code/transform-data/http] Also add retry logic with backoff
+
+## turn
+
+### role
+
+assistant
+
+### content
+
+I'll add exponential backoff retry logic to handle transient failures.
+
+# turn
+
+## role
+
+user
+
+## content
+
+Add a step to send the results via email
diff --git a/services/workflow_chat/tests/acceptance/test_navigation_job_to_workflow.py b/services/workflow_chat/tests/acceptance/test_navigation_job_to_workflow.py
deleted file mode 100644
index 80a6ba4..0000000
--- a/services/workflow_chat/tests/acceptance/test_navigation_job_to_workflow.py
+++ /dev/null
@@ -1,110 +0,0 @@
-"""User has just navigated from a job-code editor (where they were discussing
-HTTP error handling) to a workflow editor and asks to add a new step. The
-model should infer the context switch and respond about the workflow as a
-structure, not continue talking about job-code error handling."""
-
-import yaml
-
-from testing import judge
-from testing.payloads import build_workflow_chat_payload
-from testing.yaml_assertions import assert_no_special_chars
-
-
-QUALITY_CRITERIA = [
-    "The response talks about the workflow as a structure (jobs, edges, triggers), not about job-code-level error handling.",
-    "The tone is warm and collaborative, not clinical or terse.",
-    "If the response proposes a new email step, the rationale is plausible (e.g. mentions notification, summary, or alerting).",
-]
-
-
-EXISTING_YAML = """name: data-pipeline
-jobs:
-  fetch-source-data:
-    id: job-fetch-id
-    name: Fetch Source Data
-    adaptor: '@openfn/language-http@6.5.4'
-    body: 'get("https://source.api/data");'
-  transform-data:
-    id: job-transform-id
-    name: Transform Data
-    adaptor: '@openfn/language-common@latest'
-    body: 'fn(state => { return { ...state, transformed: true }; });'
-  save-to-database:
-    id: job-save-id
-    name: Save to Database
-    adaptor: '@openfn/language-http@6.5.4'
-    body: 'post("https://db.api/save", state => state.data);'
-triggers:
-  webhook:
-    id: trigger-webhook-id
-    type: webhook
-    enabled: false
-edges:
-  webhook->fetch-source-data:
-    id: edge-webhook-fetch-id
-    source_trigger: webhook
-    target_job: fetch-source-data
-    condition_type: always
-    enabled: true
-  fetch-source-data->transform-data:
-    id: edge-fetch-transform-id
-    source_job: fetch-source-data
-    target_job: transform-data
-    condition_type: on_job_success
-    enabled: true
-  transform-data->save-to-database:
-    id: edge-transform-save-id
-    source_job: transform-data
-    target_job: save-to-database
-    condition_type: on_job_success
-    enabled: true
-"""
-
-
-def test_navigation_job_to_workflow(apollo_client):
-    payload = build_workflow_chat_payload(
-        existing_yaml=EXISTING_YAML,
-        history=[
-            {"role": "user", "content": "[pg:job_code/transform-data/http] Can you add error handling to this HTTP request?"},
-            {"role": "assistant", "content": "I'll add try-catch error handling to catch any request failures in your HTTP job."},
-            {"role": "user", "content": "[pg:job_code/transform-data/http] Also add retry logic with backoff"},
-            {"role": "assistant", "content": "I'll add exponential backoff retry logic to handle transient failures."},
-        ],
-        user_message="Add a step to send the results via email",
-        current_page="data-pipeline",
-        previous_page={"type": "job_code", "name": "transform-data", "adaptor": "http"},
-    )
-
-    response = apollo_client.call("workflow_chat", payload)
-
-    # ---- Structural assertions ---------------------------------------------
-    assert response["response_yaml"], "Model should have generated YAML"
-
-    yaml_obj = yaml.safe_load(response["response_yaml"])
-    assert "jobs" in yaml_obj
-
-    job_names = [job.get("name", "").lower() for job in yaml_obj["jobs"].values()]
-    assert any("email" in n or "mail" in n or "send" in n for n in job_names), (
-        f"Email job not found in workflow. Job names: {job_names}"
-    )
-
-    orig_yaml_obj = yaml.safe_load(EXISTING_YAML)
-    assert len(yaml_obj["jobs"]) > len(orig_yaml_obj["jobs"]), (
-        f"Expected a new job to be added. Original: {len(orig_yaml_obj['jobs'])}, "
-        f"new: {len(yaml_obj['jobs'])}"
-    )
-
-    response_text = response["response"].lower()
-    assert not any(phrase in response_text for phrase in ["try", "catch", "retry", "backoff", "error handling in the code"]), (
-        "Response should be about workflow structure, not job-code error handling"
-    )
-
-    assert_no_special_chars(response["response_yaml"])
-
-    # ---- Quality assertions ------------------------------------------------
-    verdict = judge.evaluate(
-        criteria=QUALITY_CRITERIA,
-        candidate=response,
-        test_notes=__doc__,
-    )
-    assert verdict.passed, verdict.summary
diff --git a/services/workflow_chat/tests/acceptance/test_simple_lang_bug.md b/services/workflow_chat/tests/acceptance/test_simple_lang_bug.md
new file mode 100644
index 0000000..2023591
--- /dev/null
+++ b/services/workflow_chat/tests/acceptance/test_simple_lang_bug.md
@@ -0,0 +1,23 @@
+---
+id: workflow-chat.simple-lang-bug
+service: workflow_chat
+---
+
+# notes
+
+User asks "are you there?" — the service should respond conversationally about itself. It should use simple, user-facing language and not mention internal data structures like YAML.
+
+# quality_criteria
+
+- The response describes the service's capabilities in plain, user-facing language.
+- The response does not expose internal implementation details such as YAML, schemas, or data formats.
+
+# turn
+
+## role
+
+user
+
+## content
+
+are you there?
diff --git a/services/workflow_chat/tests/acceptance/test_simple_lang_bug.py b/services/workflow_chat/tests/acceptance/test_simple_lang_bug.py
deleted file mode 100644
index a1a9aba..0000000
--- a/services/workflow_chat/tests/acceptance/test_simple_lang_bug.py
+++ /dev/null
@@ -1,45 +0,0 @@
-"""User asks "are you there?" — the service should respond conversationally
-about itself. It should use simple, user-facing language and not mention
-internal data structures like YAML."""
-
-from testing import judge
-from testing.payloads import build_workflow_chat_payload
-from testing.yaml_assertions import (
-    assert_no_special_chars,
-    assert_yaml_has_ids,
-    assert_yaml_jobs_have_body,
-)
-
-
-QUALITY_CRITERIA = [
-    "The response describes the service's capabilities in plain, user-facing language.",
-    "The response does not expose internal implementation details such as YAML, schemas, or data formats.",
-]
-
-
-def test_simple_lang_bug(apollo_client):
-    payload = build_workflow_chat_payload(
-        existing_yaml="",
-        history=[],
-        user_message="are you there?",
-    )
-
-    response = apollo_client.call("workflow_chat", payload)
-
-    # ---- Structural assertions ---------------------------------------------
-    assert response is not None
-    assert isinstance(response, dict)
-
-    response_text = response.get("response", "")
-    assert "yaml" not in response_text.lower(), (
-        f"Response should not mention 'YAML', but got: {response_text}"
-    )
-
-    if response.get("response_yaml"):
-        assert_yaml_has_ids(response["response_yaml"])
-        assert_yaml_jobs_have_body(response["response_yaml"])
-        assert_no_special_chars(response["response_yaml"])
-
-    # ---- Quality assertions ------------------------------------------------
-    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
-    assert verdict.passed, verdict.summary
diff --git a/services/workflow_chat/tests/acceptance/test_single_trigger_node.md b/services/workflow_chat/tests/acceptance/test_single_trigger_node.md
new file mode 100644
index 0000000..3f9dbec
--- /dev/null
+++ b/services/workflow_chat/tests/acceptance/test_single_trigger_node.md
@@ -0,0 +1,82 @@
+---
+id: workflow-chat.single-trigger-node
+service: workflow_chat
+---
+
+# notes
+
+User asks for a change that implies multiple nodes coming directly from the trigger. Only one node can come from the trigger in OpenFn — the service should respect the constraint by picking one job to run first and branching from there, not by adding multiple direct children of the trigger.
+
+# quality_criteria
+
+- The proposed workflow respects the constraint that only one job can come directly from the trigger.
+- If the user's request implies multiple parallel steps from the trigger, the response restructures it so one job runs first and the others branch off after.
+
+# settings
+
+## existing_yaml
+
+```yaml
+name: fridge-statistics-processing
+jobs:
+  parse-and-aggregate-fridge-data:
+    id: job-parse-id
+    name: Parse and Aggregate Fridge Data
+    adaptor: '@openfn/language-common@latest'
+    body: 'print("hello a")'
+  upload-to-redis:
+    id: job-upload-id
+    name: Upload to Redis Collection
+    adaptor: '@openfn/language-redis@latest'
+    body: 'print("hello b")'
+triggers:
+  webhook:
+    id: trigger-webhook-id
+    type: webhook
+    enabled: false
+edges:
+  webhook->parse-and-aggregate-fridge-data:
+    id: edge-webhook-parse-id
+    source_trigger: webhook
+    target_job: parse-and-aggregate-fridge-data
+    condition_type: always
+    enabled: true
+  parse-and-aggregate-fridge-data->upload-to-redis:
+    id: edge-parse-upload-id
+    source_job: parse-and-aggregate-fridge-data
+    target_job: upload-to-redis
+    condition_type: on_job_success
+    enabled: true
+```
+
+# history
+
+## turn
+
+### role
+
+user
+
+### content
+
+Whenever fridge statistics are send to you, parse and aggregate the data and upload to a collection in redis.
+
+## turn
+
+### role
+
+assistant
+
+### content
+
+I'll create a workflow that processes fridge statistics through a webhook trigger, then aggregates and stores the data in Redis.
+
+# turn
+
+## role
+
+user
+
+## content
+
+Actually I also want an email notification at the same time as the data is being parsed.
diff --git a/services/workflow_chat/tests/acceptance/test_single_trigger_node.py b/services/workflow_chat/tests/acceptance/test_single_trigger_node.py
deleted file mode 100644
index e94c648..0000000
--- a/services/workflow_chat/tests/acceptance/test_single_trigger_node.py
+++ /dev/null
@@ -1,83 +0,0 @@
-"""User asks for a change that implies multiple nodes coming directly from
-the trigger. Only one node can come from the trigger in OpenFn — the service
-should respect the constraint by picking one job to run first and branching
-from there, not by adding multiple direct children of the trigger."""
-
-from testing import judge
-from testing.payloads import build_workflow_chat_payload
-from testing.yaml_assertions import assert_no_special_chars
-
-
-QUALITY_CRITERIA = [
-    "The proposed workflow respects the constraint that only one job can come directly from the trigger.",
-    "If the user's request implies multiple parallel steps from the trigger, the response restructures it so one job runs first and the others branch off after.",
-]
-
-
-EXISTING_YAML = """
-name: fridge-statistics-processing
-jobs:
-  parse-and-aggregate-fridge-data:
-    id: job-parse-id
-    name: Parse and Aggregate Fridge Data
-    adaptor: '@openfn/language-common@latest'
-    body: 'print("hello a")'
-  upload-to-redis:
-    id: job-upload-id
-    name: Upload to Redis Collection
-    adaptor: '@openfn/language-redis@latest'
-    body: 'print("hello b")'
-triggers:
-  webhook:
-    id: trigger-webhook-id
-    type: webhook
-    enabled: false
-edges:
-  webhook->parse-and-aggregate-fridge-data:
-    id: edge-webhook-parse-id
-    source_trigger: webhook
-    target_job: parse-and-aggregate-fridge-data
-    condition_type: always
-    enabled: true
-  parse-and-aggregate-fridge-data->upload-to-redis:
-    id: edge-parse-upload-id
-    source_job: parse-and-aggregate-fridge-data
-    target_job: upload-to-redis
-    condition_type: on_job_success
-    enabled: true
-"""
-
-
-def test_single_trigger_node(apollo_client):
-    payload = build_workflow_chat_payload(
-        existing_yaml=EXISTING_YAML,
-        history=[
-            {
-                "role": "user",
-                "content": (
-                    "Whenever fridge statistics are send to you, parse and aggregate "
-                    "the data and upload to a collection in redis."
-                ),
-            },
-            {
-                "role": "assistant",
-                "content": (
-                    "I'll create a workflow that processes fridge statistics through a "
-                    "webhook trigger, then aggregates and stores the data in Redis."
-                ),
-            },
-        ],
-        user_message="Actually I also want an email notification at the same time as the data is being parsed.",
-    )
-
-    response = apollo_client.call("workflow_chat", payload)
-
-    # ---- Structural assertions ---------------------------------------------
-    assert response is not None
-    assert isinstance(response, dict)
-    if response.get("response_yaml"):
-        assert_no_special_chars(response["response_yaml"])
-
-    # ---- Quality assertions ------------------------------------------------
-    verdict = judge.evaluate(criteria=QUALITY_CRITERIA, candidate=response, test_notes=__doc__)
-    assert verdict.passed, verdict.summary

From d149e866874be995f3f9ba0ac7284296ed69f50a Mon Sep 17 00:00:00 2001
From: "Hanna Paasivirta (OpenFn)" <hanna@openfn.org>
Date: Thu, 14 May 2026 02:23:36 +0900
Subject: [PATCH 05/15] udpate model usage and readme

---
 services/testing/README.md | 15 ++++++++-------
 services/testing/judge.py  |  4 +++-
 2 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/services/testing/README.md b/services/testing/README.md
index dcb1786..d594830 100644
--- a/services/testing/README.md
+++ b/services/testing/README.md
@@ -12,18 +12,19 @@ This directory is on the Python path via `pyproject.toml`
   every tier (unit included).
 - `judge.py` — LLM-as-judge helper for acceptance tests. Evaluates chat-service
   responses against natural-language criteria. Loads universal rules from
-  `judge_rules.md` at evaluation time.
+  `judge_rules.md` at evaluation time. Defaults to `CLAUDE_SONNET` from
+  `services/models.py`.
 - `judge_rules.md` — universal rules prepended to every acceptance judge
   evaluation. Edit this file to add project-wide standards (voice, style,
   refusal handling, etc.). One rule per bullet. Empty file = no universal rules.
-- `payloads.py` — `build_global_chat_payload`, `build_workflow_chat_payload`,
-  `build_job_chat_payload`. Intuitive named kwargs that translate to the JSON
-  shape each service expects.
-- `responses.py` — `get_attachment`, `assert_routed_to`, `assert_agent_calls`.
-  Shared response helpers used across acceptance tests.
+- `spec_parser.py` — parses acceptance test markdown specs
+  (`services/<svc>/tests/acceptance/*.md`) into `Spec` dataclasses.
+- `spec_collector.py` — pytest plugin (registered via `pytest_plugins` in the
+  repo-root `conftest.py`). Turns each MD spec into a pytest item that builds
+  the service payload, calls the service via `ApolloClient`, and runs the judge.
 - `apollo_client.py` — `ApolloClient` for dispatching to a chat service.
   Currently a subprocess-based stub; the integration tier will replace its
-  internals with a real HTTP client.
+  internals with a real HTTP client (same `.call()` signature, no test changes).
 - `fixtures.py` — pytest fixtures (`apollo_client`). Registered via
   `pytest_plugins = ["testing.fixtures"]` in the repo-root `conftest.py`.
 
diff --git a/services/testing/judge.py b/services/testing/judge.py
index cd618b0..d68a22c 100644
--- a/services/testing/judge.py
+++ b/services/testing/judge.py
@@ -29,8 +29,10 @@
 
 from anthropic import Anthropic
 
+from models import CLAUDE_SONNET
 
-DEFAULT_MODEL = "claude-sonnet-4-5"
+
+DEFAULT_MODEL = CLAUDE_SONNET
 _RULES_PATH = Path(__file__).parent / "judge_rules.md"
 
 

From b2c02c26bd1ac48b93502846ba2fb983956f506e Mon Sep 17 00:00:00 2001
From: "Hanna Paasivirta (OpenFn)" <hanna@openfn.org>
Date: Thu, 14 May 2026 13:05:06 +0900
Subject: [PATCH 06/15] clean

---
 conftest.py                   |   5 +-
 services/testing/README.md    |   2 -
 services/testing/fixtures.py  |  20 ----
 services/testing/payloads.py  | 188 ----------------------------------
 services/testing/responses.py |  90 ----------------
 5 files changed, 2 insertions(+), 303 deletions(-)
 delete mode 100644 services/testing/fixtures.py
 delete mode 100644 services/testing/payloads.py
 delete mode 100644 services/testing/responses.py

diff --git a/conftest.py b/conftest.py
index e22fc61..ba0cf17 100644
--- a/conftest.py
+++ b/conftest.py
@@ -11,10 +11,9 @@
 import pytest
 
 
-# Shared fixtures (apollo_client, ...) live in services/testing/fixtures.py.
 # The spec collector picks up acceptance test markdown specs from
-# services/<svc>/tests/acceptance/*.md. Both registered globally.
-pytest_plugins = ["testing.fixtures", "testing.spec_collector"]
+# services/<svc>/tests/acceptance/*.md and turns them into pytest items.
+pytest_plugins = ["testing.spec_collector"]
 
 
 _TIER_DIRS = ("unit", "service", "integration", "acceptance")
diff --git a/services/testing/README.md b/services/testing/README.md
index d594830..0ebc8c4 100644
--- a/services/testing/README.md
+++ b/services/testing/README.md
@@ -25,8 +25,6 @@ This directory is on the Python path via `pyproject.toml`
 - `apollo_client.py` — `ApolloClient` for dispatching to a chat service.
   Currently a subprocess-based stub; the integration tier will replace its
   internals with a real HTTP client (same `.call()` signature, no test changes).
-- `fixtures.py` — pytest fixtures (`apollo_client`). Registered via
-  `pytest_plugins = ["testing.fixtures"]` in the repo-root `conftest.py`.
 
 ## Why under `services/` and not a top-level `tests/`?
 
diff --git a/services/testing/fixtures.py b/services/testing/fixtures.py
deleted file mode 100644
index fb15f72..0000000
--- a/services/testing/fixtures.py
+++ /dev/null
@@ -1,20 +0,0 @@
-"""Pytest fixtures shared across acceptance and integration tests.
-
-Registered via `pytest_plugins = ["testing.fixtures"]` in the repo-root
-`conftest.py` so any test can request these fixtures by name.
-"""
-
-import pytest
-
-from testing.apollo_client import ApolloClient
-
-
-@pytest.fixture(scope="session")
-def apollo_client() -> ApolloClient:
-    """Session-scoped client for dispatching to chat services.
-
-    Today: subprocess-based stub. The integration tier will swap the
-    underlying implementation for a real HTTP client backed by a long-lived
-    bun server, without changing this fixture's interface.
-    """
-    return ApolloClient()
diff --git a/services/testing/payloads.py b/services/testing/payloads.py
deleted file mode 100644
index 25576ac..0000000
--- a/services/testing/payloads.py
+++ /dev/null
@@ -1,188 +0,0 @@
-"""Payload builders for acceptance tests.
-
-The function signatures are deliberately closer to user-facing concepts
-(`current_job_code`, `current_adaptor`, `previous_page`) than the underlying
-JSON shape (`context.expression`, `context.adaptor`, `meta.last_page`). The
-goal is that a contributor who has not memorised every service's payload spec
-can still author a test from these signatures alone.
-
-Each builder is "set if not None" — pass only the fields a test cares about.
-"""
-
-from typing import Any, Optional
-
-
-def build_global_chat_payload(
-    *,
-    user_message: str,
-    history: Optional[list[dict]] = None,
-    workflow_yaml: Optional[str] = None,
-    page: Optional[str] = None,
-    attachments: Optional[list[dict]] = None,
-    api_key: Optional[str] = None,
-    stream: bool = False,
-) -> dict:
-    """Build a global_chat service payload.
-
-    global_chat is the orchestrator entry point — it routes to workflow_chat,
-    job_chat, or the planner depending on context. For workflow scenarios pass
-    `workflow_yaml`; for job-code scenarios the planner will extract the
-    relevant job from the YAML using `page`.
-    """
-    payload: dict[str, Any] = {
-        "content": user_message,
-        "history": history or [],
-    }
-    if workflow_yaml is not None:
-        payload["workflow_yaml"] = workflow_yaml
-    if page is not None:
-        payload["page"] = page
-    if attachments is not None:
-        payload["attachments"] = attachments
-    if api_key is not None:
-        payload["api_key"] = api_key
-    if stream:
-        payload["options"] = {"stream": True}
-    return payload
-
-
-def build_workflow_chat_payload(
-    *,
-    user_message: Optional[str] = None,
-    existing_yaml: str = "",
-    history: Optional[list[dict]] = None,
-    errors: Optional[str] = None,
-    current_page: Optional[str] = None,
-    previous_page: Optional[dict] = None,
-    api_key: Optional[str] = None,
-) -> dict:
-    """Build a workflow_chat service payload.
-
-    Args:
-        user_message: The user's latest message. Required unless `errors` is set.
-        existing_yaml: Current workflow YAML the user is editing.
-        history: Chat history as a list of {role, content} dicts.
-        errors: An error string. When set, replaces `content` to put the
-            service in error-correction mode.
-        current_page: The page the user is currently on (e.g. workflow name).
-            Threaded into `context.page_name`.
-        previous_page: Where the user navigated from. Threaded into
-            `meta.last_page`. Shape: {"type": "job_code" | "workflow", "name": str, "adaptor": str}.
-        api_key: Optional Anthropic API key override.
-    """
-    payload: dict[str, Any] = {
-        "existing_yaml": existing_yaml,
-        "history": history or [],
-    }
-    if user_message is not None:
-        payload["content"] = user_message
-    if errors is not None:
-        payload["errors"] = errors
-
-    context: dict[str, Any] = {}
-    if current_page is not None:
-        context["page_name"] = current_page
-    if context:
-        payload["context"] = context
-
-    meta: dict[str, Any] = {}
-    if previous_page is not None:
-        meta["last_page"] = previous_page
-    if meta:
-        payload["meta"] = meta
-
-    if api_key is not None:
-        payload["api_key"] = api_key
-    return payload
-
-
-def build_job_chat_payload(
-    *,
-    user_message: str,
-    history: Optional[list[dict]] = None,
-    current_job_code: Optional[str] = None,
-    current_adaptor: Optional[str] = None,
-    project_adaptors: Optional[list[str]] = None,
-    current_page: Optional[str] = None,
-    project_id: Optional[str] = None,
-    job_id: Optional[str] = None,
-    input_data: Any = None,
-    output_data: Any = None,
-    log_data: Any = None,
-    rag_results: Optional[list[dict]] = None,
-    rag_queries: Optional[list[str]] = None,
-    previous_page: Optional[dict] = None,
-    suggest_code: Optional[bool] = None,
-    api_key: Optional[str] = None,
-    stream: Optional[bool] = None,
-    download_adaptor_docs: Optional[bool] = None,
-) -> dict:
-    """Build a job_chat service payload.
-
-    Args:
-        user_message: The user's latest message.
-        history: Chat history as a list of {role, content} dicts.
-        current_job_code: The job code currently in the editor. → context.expression
-        current_adaptor: The adaptor specifier (e.g. "@openfn/language-http@6.5.4"). → context.adaptor
-        project_adaptors: Other adaptors used in the project. → context.adaptors
-        current_page: Current page / job name. → context.page_name
-        project_id: → context.projectId
-        job_id: → context.jobId
-        input_data / output_data / log_data: Sample data the user has available.
-        rag_results: Pre-injected RAG search results. → meta.rag.search_results
-        rag_queries: Pre-injected RAG search queries. → meta.rag.search_queries
-        previous_page: Where the user navigated from. → meta.last_page
-        suggest_code: Enable code-suggestion mode (returns suggested_code in response).
-        api_key: Optional Anthropic API key override.
-        stream: Enable streaming.
-        download_adaptor_docs: Whether to load adaptor docs (default True in service).
-    """
-    payload: dict[str, Any] = {
-        "content": user_message,
-        "history": history or [],
-    }
-
-    context: dict[str, Any] = {}
-    if current_job_code is not None:
-        context["expression"] = current_job_code
-    if current_adaptor is not None:
-        context["adaptor"] = current_adaptor
-    if project_adaptors is not None:
-        context["adaptors"] = project_adaptors
-    if current_page is not None:
-        context["page_name"] = current_page
-    if project_id is not None:
-        context["projectId"] = project_id
-    if job_id is not None:
-        context["jobId"] = job_id
-    if input_data is not None:
-        context["input"] = input_data
-    if output_data is not None:
-        context["output"] = output_data
-    if log_data is not None:
-        context["log"] = log_data
-    if context:
-        payload["context"] = context
-
-    meta: dict[str, Any] = {}
-    if rag_results is not None or rag_queries is not None:
-        rag: dict[str, Any] = {}
-        if rag_results is not None:
-            rag["search_results"] = rag_results
-        if rag_queries is not None:
-            rag["search_queries"] = rag_queries
-        meta["rag"] = rag
-    if previous_page is not None:
-        meta["last_page"] = previous_page
-    if meta:
-        payload["meta"] = meta
-
-    if api_key is not None:
-        payload["api_key"] = api_key
-    if suggest_code is not None:
-        payload["suggest_code"] = suggest_code
-    if stream is not None:
-        payload["stream"] = stream
-    if download_adaptor_docs is not None:
-        payload["download_adaptor_docs"] = download_adaptor_docs
-    return payload
diff --git a/services/testing/responses.py b/services/testing/responses.py
deleted file mode 100644
index 07ded4a..0000000
--- a/services/testing/responses.py
+++ /dev/null
@@ -1,90 +0,0 @@
-"""Shared response helpers for acceptance tests.
-
-Originally lived as duplicate code in `services/global_chat/tests/test_utils.py`
-and the equivalent for other services. Centralised here so all acceptance tests
-share one implementation.
-"""
-
-from typing import Any, Optional
-
-
-def get_attachment(response: dict, attachment_type: str) -> Optional[str]:
-    """Return the `content` of the first attachment matching `attachment_type`.
-
-    Works for any service that returns `attachments: [{"type": ..., "content": ...}]`
-    (today: global_chat). Returns None if no matching attachment.
-    """
-    for attachment in response.get("attachments", []) or []:
-        if attachment.get("type") == attachment_type:
-            return attachment.get("content")
-    return None
-
-
-def assert_routed_to(response: dict, expected_agent: str, *, context: str = "") -> None:
-    """Assert `expected_agent` is present in `response["meta"]["agents"]`.
-
-    `meta.agents` is a list like ["router", "planner"] or ["router", "workflow_agent"].
-    Use this to verify the orchestrator routed the request the way you expected.
-    """
-    assert response is not None, f"{context}: response is None"
-    assert isinstance(response, dict), f"{context}: response is not a dict ({type(response).__name__})"
-
-    meta = response.get("meta") or {}
-    agents = meta.get("agents") or []
-    assert expected_agent in agents, (
-        f"{context}: expected '{expected_agent}' in meta.agents, got {agents}"
-    )
-
-
-def assert_agent_calls(
-    meta: dict,
-    *,
-    expected_agents: list[str],
-    min_job_code_calls: int = 0,
-    context: str = "",
-) -> None:
-    """Assert the planner orchestrated the expected sub-agents in the right order.
-
-    Checks:
-    - Every agent in `expected_agents` appears in `meta["agents"]`.
-    - `meta["tool_calls"]` contains at least one `call_workflow_agent`.
-    - `meta["tool_calls"]` contains at least `min_job_code_calls` of `call_job_code_agent`.
-    - Every `call_job_code_agent` comes after the first `call_workflow_agent`.
-
-    Used by global_chat planner-chain tests.
-    """
-    agents = meta.get("agents") or []
-    for agent in expected_agents:
-        assert agent in agents, f"{context}: expected '{agent}' in agents, got {agents}"
-
-    tool_calls = meta.get("tool_calls") or []
-    tool_names = [call.get("tool") for call in tool_calls]
-
-    assert "call_workflow_agent" in tool_names, (
-        f"{context}: expected call_workflow_agent in tool_calls, got {tool_names}"
-    )
-
-    job_code_indices = [i for i, name in enumerate(tool_names) if name == "call_job_code_agent"]
-    assert len(job_code_indices) >= min_job_code_calls, (
-        f"{context}: expected at least {min_job_code_calls} call_job_code_agent calls, "
-        f"got {len(job_code_indices)}. Tool calls: {tool_names}"
-    )
-
-    workflow_idx = tool_names.index("call_workflow_agent")
-    for j in job_code_indices:
-        assert j > workflow_idx, (
-            f"{context}: call_job_code_agent at index {j} came before "
-            f"call_workflow_agent at index {workflow_idx}. Tool calls: {tool_names}"
-        )
-
-
-def latest_user_message(response: dict) -> Optional[dict]:
-    """Return the most recent `role=user` message from `response["history"]`.
-
-    Useful for verifying page-prefix tagging applied to the user's input.
-    """
-    history = response.get("history") or []
-    for entry in reversed(history):
-        if entry.get("role") == "user":
-            return entry
-    return None

From de27a1c83d99226f722a0830dba85f4fec4789d2 Mon Sep 17 00:00:00 2001
From: "Hanna Paasivirta (OpenFn)" <hanna@openfn.org>
Date: Thu, 14 May 2026 17:25:13 +0900
Subject: [PATCH 07/15] add judges

---
 services/testing/README.md                    |  14 ++-
 services/testing/judge.py                     |  72 ++++++-----
 services/testing/judges.py                    |  64 ++++++++++
 .../{judge_rules.md => judges/general.md}     |   8 ++
 .../testing/judges/openfn_code_quality.md     | 113 ++++++++++++++++++
 .../testing/judges/openfn_workflow_expert.md  |  65 ++++++++++
 services/testing/spec_collector.py            |  23 ++--
 services/testing/spec_parser.py               |  10 ++
 8 files changed, 318 insertions(+), 51 deletions(-)
 create mode 100644 services/testing/judges.py
 rename services/testing/{judge_rules.md => judges/general.md} (82%)
 create mode 100644 services/testing/judges/openfn_code_quality.md
 create mode 100644 services/testing/judges/openfn_workflow_expert.md

diff --git a/services/testing/README.md b/services/testing/README.md
index 0ebc8c4..149f529 100644
--- a/services/testing/README.md
+++ b/services/testing/README.md
@@ -11,12 +11,14 @@ This directory is on the Python path via `pyproject.toml`
 - `yaml_assertions.py` — pure-function YAML structural assertions, safe for
   every tier (unit included).
 - `judge.py` — LLM-as-judge helper for acceptance tests. Evaluates chat-service
-  responses against natural-language criteria. Loads universal rules from
-  `judge_rules.md` at evaluation time. Defaults to `CLAUDE_SONNET` from
-  `services/models.py`.
-- `judge_rules.md` — universal rules prepended to every acceptance judge
-  evaluation. Edit this file to add project-wide standards (voice, style,
-  refusal handling, etc.). One rule per bullet. Empty file = no universal rules.
+  responses against natural-language criteria under a named judge. Defaults to
+  `CLAUDE_SONNET` from `services/models.py` and the `general` judge.
+- `judges.py` — registry that loads judge configs from `judges/<name>.md`.
+- `judges/` — one markdown file per named judge. Each has a `# role` section
+  (who the judge is and what it evaluates) and a `# rules` section (universal
+  bullets that apply to every evaluation under this judge). Today: `general`
+  and `openfn_code_quality`. Specs select judges via the `judges:` frontmatter
+  field; default is `[general]`.
 - `spec_parser.py` — parses acceptance test markdown specs
   (`services/<svc>/tests/acceptance/*.md`) into `Spec` dataclasses.
 - `spec_collector.py` — pytest plugin (registered via `pytest_plugins` in the
diff --git a/services/testing/judge.py b/services/testing/judge.py
index d68a22c..0d46d48 100644
--- a/services/testing/judge.py
+++ b/services/testing/judge.py
@@ -3,9 +3,13 @@
 Evaluates a chat service's response against a small list of natural-language
 criteria, returning a structured verdict.
 
+Each evaluation runs under a named judge (configured in
+`services/testing/judges/<name>.md` — see `services/testing/judges.py`).
+Different judges have different role and rules, but share the same JSON output
+contract.
+
 Three layers feed the judge prompt:
-1. Universal rules — loaded from `judge_rules.md` next to this file. Apply to
-   every evaluation. Edit the markdown file to change them; no Python touched.
+1. Judge role + universal rules — loaded from the judge's MD config.
 2. Per-test criteria — passed in via `evaluate(criteria=[...])`.
 3. Open-ended "flag anything else notable" — hardcoded at the end of the
    prompt. Means the criteria list never has to be exhaustive.
@@ -17,6 +21,7 @@
         criteria=["The response uses British English spelling.", ...],
         candidate=response_dict,
         test_notes=__doc__,
+        judge="general",
     )
     assert verdict.passed, verdict.summary
 """
@@ -24,16 +29,16 @@
 import json
 import os
 from dataclasses import dataclass, field
-from pathlib import Path
 from typing import Optional
 
 from anthropic import Anthropic
 
 from models import CLAUDE_SONNET
+from testing.judges import load_judge
 
 
 DEFAULT_MODEL = CLAUDE_SONNET
-_RULES_PATH = Path(__file__).parent / "judge_rules.md"
+DEFAULT_JUDGE = "general"
 
 
 @dataclass
@@ -57,23 +62,13 @@ class Verdict:
     general_flags: list[GeneralFlag]
     summary: str
     judge_usage: dict = field(default_factory=dict)
+    judge_name: str = DEFAULT_JUDGE
 
 
-def _load_universal_rules() -> str:
-    """Read the universal-rules markdown file. Empty string if absent or empty."""
-    if not _RULES_PATH.exists():
-        return ""
-    text = _RULES_PATH.read_text().strip()
-    return text
-
-
-def _build_system_prompt() -> str:
-    universal = _load_universal_rules()
-    parts = [
-        "You are a strict but fair quality reviewer for an AI assistant's responses.",
-        "You will be given (a) optional universal rules that apply to every response, "
-        "(b) a list of test-specific criteria, and (c) the AI assistant's full response "
-        "to evaluate.",
+def _build_system_prompt(judge_name: str) -> str:
+    config = load_judge(judge_name)
+    parts = [config.role.strip()]
+    parts += [
         "",
         "Return JSON with this exact shape:",
         "{",
@@ -91,11 +86,11 @@ def _build_system_prompt() -> str:
         "  - 'regression': would surprise a reviewer or hurt a user",
         "If nothing is notable, return an empty general_flags array.",
     ]
-    if universal:
+    if config.rules:
         parts += [
             "",
             "UNIVERSAL RULES (apply to every response):",
-            universal,
+            config.rules,
         ]
     return "\n".join(parts)
 
@@ -130,8 +125,8 @@ def _parse_verdict(
     raw_text: str,
     criteria: list[str],
     usage: dict,
+    judge_name: str,
 ) -> Verdict:
-    """Parse JSON judge output into a Verdict. Lenient: missing fields → defaults."""
     try:
         data = json.loads(raw_text)
     except json.JSONDecodeError as e:
@@ -140,8 +135,9 @@ def _parse_verdict(
             score=0.0,
             criteria=[],
             general_flags=[GeneralFlag(description=f"judge_error: {e}", severity="regression")],
-            summary=f"judge_error: failed to parse JSON output\n---\n{raw_text[:500]}",
+            summary=f"judge_error ({judge_name}): failed to parse JSON output\n---\n{raw_text[:500]}",
             judge_usage=usage,
+            judge_name=judge_name,
         )
 
     raw_criteria = data.get("criteria", [])
@@ -175,7 +171,7 @@ def _parse_verdict(
     passed = all_criteria_passed and not has_regression
     score = (sum(1 for c in parsed_criteria if c.passed) / len(parsed_criteria)) if parsed_criteria else 1.0
 
-    summary = _format_summary(parsed_criteria, parsed_flags, passed)
+    summary = _format_summary(judge_name, parsed_criteria, parsed_flags, passed)
 
     return Verdict(
         passed=passed,
@@ -184,11 +180,12 @@ def _parse_verdict(
         general_flags=parsed_flags,
         summary=summary,
         judge_usage=usage,
+        judge_name=judge_name,
     )
 
 
-def _format_summary(criteria: list[CriterionResult], flags: list[GeneralFlag], passed: bool) -> str:
-    lines = [f"Verdict: {'PASS' if passed else 'FAIL'}"]
+def _format_summary(judge_name: str, criteria: list[CriterionResult], flags: list[GeneralFlag], passed: bool) -> str:
+    lines = [f"Verdict ({judge_name}): {'PASS' if passed else 'FAIL'}"]
     if criteria:
         lines.append("")
         lines.append("Criteria:")
@@ -210,24 +207,25 @@ def evaluate(
     criteria: list[str],
     candidate: dict,
     test_notes: Optional[str] = None,
+    judge: str = DEFAULT_JUDGE,
     model: str = DEFAULT_MODEL,
     client: Optional[Anthropic] = None,
 ) -> Verdict:
-    """Evaluate a candidate response against criteria using an LLM judge.
+    """Evaluate a candidate response under a named judge.
 
     Args:
-        criteria: List of natural-language criteria. Can be empty — universal
-            rules and general_flags still apply.
-        candidate: Full response dict from the chat service. Whatever it
-            contains is shown verbatim to the judge.
+        criteria: Test-specific bullets the judge grades against.
+        candidate: Full response dict from the chat service.
         test_notes: Optional background context (typically the test's __doc__).
-            Shown to the judge but not graded against directly.
-        model: Judge model. Defaults to Sonnet.
+            Shown to the judge but not graded directly.
+        judge: Name of the judge (file at services/testing/judges/<name>.md).
+            Defaults to "general".
+        model: Model to use. Defaults to CLAUDE_SONNET from services/models.py.
         client: Optional Anthropic client. Constructed from env if not given.
 
     Returns:
-        A Verdict. Test code typically asserts on verdict.passed and uses
-        verdict.summary as the failure message.
+        A Verdict. Test code asserts on verdict.passed and uses verdict.summary
+        as the failure message.
     """
     if client is None:
         api_key = os.environ.get("ANTHROPIC_API_KEY")
@@ -238,7 +236,7 @@ def evaluate(
             )
         client = Anthropic(api_key=api_key)
 
-    system_prompt = _build_system_prompt()
+    system_prompt = _build_system_prompt(judge)
     user_prompt = _build_user_prompt(criteria, candidate, test_notes)
 
     response = client.messages.create(
@@ -256,4 +254,4 @@ def evaluate(
         "input_tokens": response.usage.input_tokens,
         "output_tokens": response.usage.output_tokens,
     }
-    return _parse_verdict(raw_text, criteria, usage)
+    return _parse_verdict(raw_text, criteria, usage, judge)
diff --git a/services/testing/judges.py b/services/testing/judges.py
new file mode 100644
index 0000000..8d416a9
--- /dev/null
+++ b/services/testing/judges.py
@@ -0,0 +1,64 @@
+"""Registry of named acceptance-test judges.
+
+Each judge is a `(role, rules)` pair defined in
+`services/testing/judges/<name>.md`. The file uses two top-level sections:
+
+    # role
+    <prose: who the judge is and what it evaluates>
+
+    # rules
+    - bullet rules that apply to every evaluation under this judge
+
+To add a new judge: drop a new markdown file in `services/testing/judges/`
+and reference its filename (without `.md`) in a spec's `judges:` frontmatter
+field. Default judge is `general`.
+"""
+
+from dataclasses import dataclass
+from pathlib import Path
+
+
+_JUDGES_DIR = Path(__file__).parent / "judges"
+
+
+@dataclass
+class JudgeConfig:
+    name: str
+    role: str
+    rules: str
+
+
+def load_judge(name: str) -> JudgeConfig:
+    """Load a judge config from `services/testing/judges/<name>.md`.
+
+    Raises FileNotFoundError if the file doesn't exist.
+    """
+    path = _JUDGES_DIR / f"{name}.md"
+    if not path.exists():
+        available = sorted(p.stem for p in _JUDGES_DIR.glob("*.md"))
+        raise FileNotFoundError(
+            f"Judge '{name}' not found at {path}. Available: {available}"
+        )
+    text = path.read_text()
+    return JudgeConfig(
+        name=name,
+        role=_extract_section(text, "role").strip(),
+        rules=_extract_section(text, "rules").strip(),
+    )
+
+
+def _extract_section(text: str, section_name: str) -> str:
+    """Pull the body under `# <section_name>` up to the next `# ` header."""
+    marker = f"# {section_name}".lower()
+    lines = text.splitlines()
+    in_section = False
+    out: list[str] = []
+    for line in lines:
+        if line.strip().lower() == marker:
+            in_section = True
+            continue
+        if in_section and line.startswith("# "):
+            break
+        if in_section:
+            out.append(line)
+    return "\n".join(out)
diff --git a/services/testing/judge_rules.md b/services/testing/judges/general.md
similarity index 82%
rename from services/testing/judge_rules.md
rename to services/testing/judges/general.md
index cf017ee..154dc43 100644
--- a/services/testing/judge_rules.md
+++ b/services/testing/judges/general.md
@@ -1,3 +1,11 @@
+# role
+
+You are a strict but fair quality reviewer for an AI assistant's responses.
+
+You will be given (a) optional universal rules that apply to every response, (b) a list of test-specific criteria, and (c) the AI assistant's full response to evaluate.
+
+# rules
+
 - Every job, trigger, and edge in a returned workflow YAML has a non-empty `id` field.
 - Every job in a returned workflow YAML has a `body` that is either real adaptor code or the canonical empty-job placeholder `// Add operations here`. Reject other placeholder-style markers such as `// PLACEHOLDER`, numbered placeholders, `TODO`, `FIXME`, or `<insert ... here>` — these are leftover generation artifacts.
 - Job names and edge source/target/key references in a returned workflow YAML use only letters, numbers, spaces, hyphens, and underscores.
diff --git a/services/testing/judges/openfn_code_quality.md b/services/testing/judges/openfn_code_quality.md
new file mode 100644
index 0000000..425366b
--- /dev/null
+++ b/services/testing/judges/openfn_code_quality.md
@@ -0,0 +1,113 @@
+# role
+
+You are an OpenFn code quality reviewer. Your job is to evaluate the quality of any OpenFn job code returned by the AI assistant — looking at idiomatic use of OpenFn adaptor operations, correct state-chaining, and adherence to platform conventions.
+
+Ignore non-code aspects (tone, explanation, structure) — focus only on the code itself. If the assistant did not return any job code, the code-quality rules below are vacuously satisfied; do not penalize for absence.
+
+# rules
+
+## How OpenFn job code differs from normal JavaScript
+
+OpenFn job code looks like JavaScript but is a DSL that is compiled before it runs. Several patterns that are normal in JS are wrong in an OpenFn job, and several patterns that look unusual are correct. Grade against the rules below, not against generic JS intuition.
+
+- **No `import` statements.** The compiler injects the adaptor for you. Top-level `import` is a sign the model is treating the file as plain JS.
+- **No top-level `async` or top-level `await`.** Top-level statements run synchronously even if they perform async work. If async chaining is needed, use `.then(...)` / `.catch(...)` on an operation instead.
+- **Only operations at the top level.** Quoting the docs: *"Your job code should only contain Operations at the top level/scope - you should NOT include any other JavaScript statements."* Variable declarations, loops, helper definitions, conditionals, etc. belong inside an `fn(state => { ... return state; })` block, not at the top level.
+- **Don't use `alterState`.** Use `fn()` instead. `alterState` is discouraged.
+
+## Operations as factory functions, and the `)(state)` anti-pattern
+
+In OpenFn, operations are factory functions. When called (e.g. `get('/patients')`), they don't execute immediately — they return a new function. The OpenFn runtime collects all top-level operations and executes them in sequence, passing state through the pipeline automatically.
+
+This means writing `get('/patients')(state)` — manually invoking an operation with state — bypasses the runtime entirely. The tell-tale sign in code is the pattern `)(state)`, which shows an operation being immediately invoked rather than registered with the pipeline.
+
+This is almost always caused by a misunderstanding of lazy state evaluation. The user wants to access `state.data` from a prior operation, doesn't realise it hasn't been assigned yet at parse time, and tries to force execution themselves. The correct fix is to pass a function instead of a value, so state is resolved at runtime:
+
+```js
+// ❌ Bad
+post('/endpoint', state.data)(state)
+
+// ✅ Good
+post('/endpoint', state => state.data)
+// or
+post('/endpoint', $.data)
+```
+
+Flag any use of `)(state)` as an anti-pattern and explain the lazy state model when doing so. The docs are explicit: *"you should never need to nest an operation."*
+
+## Lazy state evaluation
+
+Operation arguments are evaluated at **load time**, before any operation has run. So a bare `state.x` in an operation argument resolves to `undefined`, because the prior operation hasn't yet written to `state`. To defer resolution to run time, pass a function (or use the `$` shorthand).
+
+- ❌ `post('/x', state.data)` — `state.data` is undefined at load time.
+- ✅ `post('/x', state => state.data)`
+- ✅ `post('/x', $.data)`
+
+Flag every occurrence of a bare `state.<something>` used as an operation argument (i.e. not inside a function body) as a lazy-state bug.
+
+## The `$` lazy-state operator
+
+`$` is syntactic sugar for `state => state...`. It is **only** valid inside an argument to an operation. The docs are explicit: *"The `$` operator is not an alias for `state`. It cannot be used in place of the `state` variable... can only be used inside an argument to a function."*
+
+Valid:
+- `get($.data.url)`
+- `create({ name: $.patient.name, country: $.patient.country })`
+- `` get(`/patients/${$.patient.id}`) ``
+- `each($.items, post(\`patient/${$.data.id}\`, $.data))`
+
+Invalid (flag these):
+- `const url = $.data.url;` — `$` outside an operation argument.
+- `$.data.x = something;` — `$` on the left side of assignment.
+- Using `$` inside an `fn(state => { ... })` callback body where `state` is already in scope — should just be `state.data`, etc.
+
+## Callbacks must return state
+
+Every callback passed to an operation (whether `fn`, `each`, `.then`, `.catch`, or a state-function argument) must return state. A missing return drops state for downstream operations.
+
+- ❌ `fn(state => { state.x = 1; })` — no return; downstream operations see no state.
+- ✅ `fn(state => { state.x = 1; return state; })`
+
+Flag any callback whose final statement isn't `return state` (or `return { ...state, ... }`, or a thenable that ultimately resolves to state).
+
+## State chaining and destructive mutation
+
+Each operation receives state and returns state. When constructing a new state object, preserve the rest of state via spread so `configuration`, `references`, and other adaptor-set keys aren't dropped mid-pipeline.
+
+- ❌ `return { data: state.data };` mid-pipeline — drops `configuration` and everything else.
+- ✅ `return { ...state, data: newData };`
+
+Returning a trimmed object is fine **only as the final cleanup step**, where dropping bulky/sensitive scratch data is intentional (the docs show `return { data: state.data }` as a final-state cleaner). Use judgment: if it's the last operation in the job and the trimming looks deliberate, it's fine; mid-pipeline trimming is a bug.
+
+## Credentials and configuration
+
+- Credentials live on `state.configuration`, populated by the OpenFn runtime. Read them with `$.configuration.<key>` (or `state.configuration.<key>` inside `fn`).
+- The model must never inline a literal API key, password, bearer token, OAuth token, or other credential into job code. If a credential-shaped literal appears in code (e.g. `apiKey: 'sk-...'`, `password: 'hunter2'`, `Authorization: 'Bearer eyJ...'`), flag it.
+- The runtime scrubs `configuration` and functions from final state and logs — treat that as a safety net, not a license to hardcode secrets.
+
+## Adaptor usage
+
+- Use the adaptor operations available in the named adaptor — `each`, `fn`, `fields`, `field`, `dataValue`, `lastReferenceValue`, `combine`, `cursor`, plus the adaptor-specific ones (`get`/`post`/`upsert`/`create`/`bulk`/etc.) — in preference to raw JS loops or hand-rolled HTTP calls, when an equivalent operation exists.
+- Do not invent adaptor functions. If the assistant calls a function that isn't part of the declared adaptor (and isn't a documented `language-common` helper), flag it as a hallucinated function.
+- `each('$.path[*]', op)` uses JSONPath strings with the leading `$.`. Flag malformed JSONPath (e.g. missing `$.`, mismatched brackets) when the assistant clearly intends a JSONPath.
+- `cursor(...)` requires `@openfn/language-common` ≥ 1.13.0 — don't flag version mismatches unless the version is visible and clearly lower; the assistant rarely controls this.
+
+## Top-level structure
+
+- Only operation calls (and operation chains with `.then` / `.catch`) at the top level.
+- Free-floating `const`/`let`/`var`/`function`/control-flow statements at the top level are anti-patterns. Move them inside an `fn()`.
+- Helper functions intended for reuse across operations should be defined inside `fn(state => { state.helperName = ...; return state; })` so subsequent operations can read them off state, or inlined where they're used.
+
+## Final-state hygiene
+
+- Final state must be JSON-serializable. Flag obviously non-serializable values being assigned to state at the end (open DB clients, raw streams, functions intended to survive to final state).
+- It's fine — and often correct — for the last `fn()` to prune `state` down to just the keys the next step needs.
+
+## Scaffolding / placeholder leftovers
+
+- Job code must not contain leftover generation scaffolding: `// PLACEHOLDER`, numbered placeholders (`// step 1`, `// PLACEHOLDER_1`), `TODO`, `FIXME`, `<insert ... here>`, or empty `() => {}` callbacks meant to be filled in. The one acceptable placeholder is the canonical empty-job marker `// Add operations here`, and only when the assistant has deliberately declined to generate code.
+
+## Grading guidance
+
+- A clarifying question instead of code (when the request is genuinely ambiguous) is not a code-quality failure — there's no code to grade. Defer to the criterion-level expectations the test sets.
+- When job code is present, grade against the rules above strictly. Quote the offending snippet when you flag something so the verdict is checkable.
+- If the test criteria conflict with these baseline rules, the test criteria win.
diff --git a/services/testing/judges/openfn_workflow_expert.md b/services/testing/judges/openfn_workflow_expert.md
new file mode 100644
index 0000000..173e14d
--- /dev/null
+++ b/services/testing/judges/openfn_workflow_expert.md
@@ -0,0 +1,65 @@
+# role
+
+You are an OpenFn workflow expert. Your job is to evaluate any workflow YAML or workflow-design proposal returned by the AI assistant — looking at whether the structure is valid, the trigger/step/edge choices make sense, and the design reflects how OpenFn workflows actually work.
+
+Focus on workflow structure and shape, not on job code inside step bodies. The code-quality judge handles `body` contents. If the assistant returned no YAML and gave only a textual answer, the YAML-shape rules below are vacuously satisfied — grade the design-level rules against any structure the assistant described in prose.
+
+# rules
+
+## What an OpenFn workflow is
+
+A workflow is a trigger plus steps (jobs) connected by paths (edges). Each step is one task — typically one interaction with one backend system — described by a Name, an Adaptor, an Adaptor Version, optional Credentials, and a Job (the code in `body`). The trigger fires the workflow; edges control which step runs next based on conditions. Workflows do not loop — the docs are explicit that *"Looping workflows are not supported."*
+
+## YAML structural rules
+
+These mirror the workflow-generation contract. Reject the YAML if any are violated:
+
+- Output parses as valid YAML.
+- Every job, trigger, and edge has a non-empty `id` field. (Exception: in workflow-creation contexts where ids are auto-assigned downstream, missing ids for newly added items are tolerated — but the existing-item ids in an edit must be preserved.)
+- Every job has a `body` that is either real adaptor code or the canonical empty-job placeholder `// Add operations here`. Reject other placeholder markers such as `// PLACEHOLDER`, numbered placeholders, `TODO`, `FIXME`, or `<insert ... here>` — these are leftover generation artifacts.
+- Job names and edge `source_*` / `target_*` / key references contain only letters, numbers, spaces, hyphens, and underscores. Job names must be unique within a workflow and under 100 characters.
+- When the user is editing an existing workflow, every job and edge from the existing YAML is present and unchanged in the response unless the user asked to remove or modify it. Additions are fine.
+
+## Triggers
+
+- Exactly one trigger per workflow. Choose `webhook` for event-driven workflows (HTTP POST in) and `cron` for scheduled workflows.
+- A cron trigger needs a valid `cron_expression` (5-field: minute hour day month weekday).
+- New workflows should default `enabled: false` on the trigger.
+- **Only one step can come off the trigger.** If the user describes multiple parallel things "to do first," the workflow expert should pick one of them as the first step and either fan out from there or sequence the others — not attach two edges directly to the trigger.
+- For cron triggers, input state on each run is the final state of the previous successful run — useful for incremental sync via `cursor(...)`. Flag a design that contradicts this (e.g. assumes the cron trigger receives a fresh payload).
+
+## Steps (jobs)
+
+- One step per backend system or per clearly distinct action. If the user's description involves fetching from system A, transforming, and posting to system B, that's typically three steps: fetch with the A adaptor → transform with `@openfn/language-common` → post with the B adaptor.
+- Adaptor choice should match the system named by the user. Use `@openfn/language-common@latest` for pure transforms and `@openfn/language-http@latest` for generic HTTP integrations where no specific adaptor exists. Prefer the most specific adaptor available over `http`.
+- Do not invent adaptor packages. If an adaptor name doesn't follow the `@openfn/language-<name>` convention or doesn't correspond to a real system, flag it.
+- Don't pin random versions: use `@latest` for new workflows; preserve the version pinned on existing steps when editing.
+- Step names should be descriptive of the action ("Fetch visits from CommCare"), not generic ("Job 1"). Each name must be unique within the workflow.
+- The workflow_chat agent must NOT generate job code into `body` when creating workflow structure — `body` stays as `'// Add operations here'`. If the assistant fills `body` with real code during a workflow-shaping turn, flag it. (Code is written separately in the per-job code page.)
+
+## Edges (paths)
+
+- `condition_type` must be one of: `always`, `on_job_success`, `on_job_failure`, `js_expression`.
+- For `js_expression`, a `condition_expression` (a JavaScript expression with `state` in scope) must be supplied as a quoted string. The expression cannot use adaptor functions, the `$` operator, or control statements (`if`/`while`/`for`). Flag a `js_expression` edge that's missing `condition_expression` or that contains adaptor functions, `$`, or control flow.
+- The trigger→first-step edge typically uses `condition_type: always`.
+- Branching: multiple edges off one source step is the standard way to express parallel paths or conditional routing.
+- **Edges do not merge or wait.** When multiple edges target the same step, that target step runs *once per incoming edge*, not once after all converge. If the assistant describes "and then both feed into a single merge step that combines their results," flag this as a misconception — that's not how edges behave.
+- Edges should be `enabled: true` by default.
+
+## Workflow design judgment
+
+- The proposed shape should reflect the user's described process. A discovery-only request ("what time does the trigger run?") should not produce a new YAML.
+- For ambiguous requests, asking a clarifying question instead of generating a workflow is acceptable behavior, not a failure. The exception is when a test criterion explicitly evaluates whether the model acted versus asked — defer to the criterion in that case.
+- For requests that imply looping ("keep polling until X"), the correct response is to model the polling as a cron-triggered workflow with cursor-based state, not to invent a self-edge or back-edge.
+- Don't add steps the user didn't ask for. Conversely, don't collapse genuinely distinct integrations into one step just to keep the workflow short.
+
+## Out-of-scope concerns (do not grade)
+
+- The contents of `body` (the job code itself) — that's the code-quality judge's job. Even if you can see code in `body`, don't grade it here unless the issue is that code was written when it shouldn't have been.
+- Tone, conversational style, length of the textual explanation.
+- Whether the trailing `text` answer is well-phrased — only whether it contradicts the YAML or claims unsupported behavior.
+
+## Grading guidance
+
+- Quote the offending YAML fragment when you flag something so the verdict is checkable.
+- If the test criteria conflict with these baseline rules, the test criteria win.
diff --git a/services/testing/spec_collector.py b/services/testing/spec_collector.py
index 1c46ab1..e3a8710 100644
--- a/services/testing/spec_collector.py
+++ b/services/testing/spec_collector.py
@@ -51,14 +51,21 @@ def runtest(self):
         client = ApolloClient()
         response = client.call(self.spec.service, payload)
 
-        verdict = judge.evaluate(
-            criteria=self.spec.quality_criteria,
-            candidate=response,
-            test_notes=self.spec.notes or None,
-        )
-
-        if not verdict.passed:
-            raise AssertionError(verdict.summary)
+        # One service call, N judges evaluate the same response.
+        # Consensus: the test passes only if every judge passes.
+        verdicts = [
+            judge.evaluate(
+                criteria=self.spec.quality_criteria,
+                candidate=response,
+                test_notes=self.spec.notes or None,
+                judge=judge_name,
+            )
+            for judge_name in self.spec.judges
+        ]
+
+        if not all(v.passed for v in verdicts):
+            summary = "\n\n".join(v.summary for v in verdicts)
+            raise AssertionError(summary)
 
     def repr_failure(self, excinfo, style=None):
         return str(excinfo.value)
diff --git a/services/testing/spec_parser.py b/services/testing/spec_parser.py
index ebb4daa..4cba5df 100644
--- a/services/testing/spec_parser.py
+++ b/services/testing/spec_parser.py
@@ -68,6 +68,7 @@ class Spec:
     id: str
     service: str
     runs: int = 1
+    judges: list[str] = field(default_factory=lambda: ["general"])
     notes: str = ""
     quality_criteria: list[str] = field(default_factory=list)
     settings: dict[str, Any] = field(default_factory=dict)
@@ -81,10 +82,19 @@ def parse_spec(path: Path) -> Spec:
     frontmatter, body = _split_frontmatter(text)
     sections = _split_headers(body, level=1)
 
+    judges_field = frontmatter.get("judges")
+    if judges_field is None:
+        judges = ["general"]
+    elif isinstance(judges_field, str):
+        judges = [judges_field]
+    else:
+        judges = list(judges_field)
+
     return Spec(
         id=frontmatter.get("id") or path.stem,
         service=frontmatter["service"],
         runs=int(frontmatter.get("runs", 1)),
+        judges=judges,
         notes=sections.get("notes", "").strip(),
         quality_criteria=_parse_bullets(sections.get("quality_criteria", "")),
         settings=_parse_settings(sections.get("settings", "")),

From c62acf079766b98446f760ce4d67658e385bc947 Mon Sep 17 00:00:00 2001
From: "Hanna Paasivirta (OpenFn)" <hanna@openfn.org>
Date: Thu, 14 May 2026 17:30:00 +0900
Subject: [PATCH 08/15] edit judges for service prompt consistency

---
 services/testing/judges/openfn_code_quality.md    | 1 -
 services/testing/judges/openfn_workflow_expert.md | 8 ++++----
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/services/testing/judges/openfn_code_quality.md b/services/testing/judges/openfn_code_quality.md
index 425366b..0513248 100644
--- a/services/testing/judges/openfn_code_quality.md
+++ b/services/testing/judges/openfn_code_quality.md
@@ -58,7 +58,6 @@ Valid:
 Invalid (flag these):
 - `const url = $.data.url;` — `$` outside an operation argument.
 - `$.data.x = something;` — `$` on the left side of assignment.
-- Using `$` inside an `fn(state => { ... })` callback body where `state` is already in scope — should just be `state.data`, etc.
 
 ## Callbacks must return state
 
diff --git a/services/testing/judges/openfn_workflow_expert.md b/services/testing/judges/openfn_workflow_expert.md
index 173e14d..b66581a 100644
--- a/services/testing/judges/openfn_workflow_expert.md
+++ b/services/testing/judges/openfn_workflow_expert.md
@@ -15,7 +15,7 @@ A workflow is a trigger plus steps (jobs) connected by paths (edges). Each step
 These mirror the workflow-generation contract. Reject the YAML if any are violated:
 
 - Output parses as valid YAML.
-- Every job, trigger, and edge has a non-empty `id` field. (Exception: in workflow-creation contexts where ids are auto-assigned downstream, missing ids for newly added items are tolerated — but the existing-item ids in an edit must be preserved.)
+- Every job, trigger, and edge in the returned workflow YAML has a non-empty `id` field. (The workflow_chat service auto-generates IDs for newly added items during post-processing, so the YAML you grade should already have them — flag any item that is still missing one.)
 - Every job has a `body` that is either real adaptor code or the canonical empty-job placeholder `// Add operations here`. Reject other placeholder markers such as `// PLACEHOLDER`, numbered placeholders, `TODO`, `FIXME`, or `<insert ... here>` — these are leftover generation artifacts.
 - Job names and edge `source_*` / `target_*` / key references contain only letters, numbers, spaces, hyphens, and underscores. Job names must be unique within a workflow and under 100 characters.
 - When the user is editing an existing workflow, every job and edge from the existing YAML is present and unchanged in the response unless the user asked to remove or modify it. Additions are fine.
@@ -31,9 +31,9 @@ These mirror the workflow-generation contract. Reject the YAML if any are violat
 ## Steps (jobs)
 
 - One step per backend system or per clearly distinct action. If the user's description involves fetching from system A, transforming, and posting to system B, that's typically three steps: fetch with the A adaptor → transform with `@openfn/language-common` → post with the B adaptor.
-- Adaptor choice should match the system named by the user. Use `@openfn/language-common@latest` for pure transforms and `@openfn/language-http@latest` for generic HTTP integrations where no specific adaptor exists. Prefer the most specific adaptor available over `http`.
+- Adaptor choice should match the system named by the user. Use `@openfn/language-common` for pure transforms and `@openfn/language-http` for generic HTTP integrations where no specific adaptor exists. Prefer the most specific adaptor available over `http`.
 - Do not invent adaptor packages. If an adaptor name doesn't follow the `@openfn/language-<name>` convention or doesn't correspond to a real system, flag it.
-- Don't pin random versions: use `@latest` for new workflows; preserve the version pinned on existing steps when editing.
+- Version handling: by default, adaptors should use the latest version (the workflow_chat service surfaces it as `@latest` in its YAML examples). When the user is editing existing YAML that already pins a specific version on a step, that pin must be preserved unless the user asks to change it. Flag silent downgrades or upgrades of existing pins.
 - Step names should be descriptive of the action ("Fetch visits from CommCare"), not generic ("Job 1"). Each name must be unique within the workflow.
 - The workflow_chat agent must NOT generate job code into `body` when creating workflow structure — `body` stays as `'// Add operations here'`. If the assistant fills `body` with real code during a workflow-shaping turn, flag it. (Code is written separately in the per-job code page.)
 
@@ -50,7 +50,7 @@ These mirror the workflow-generation contract. Reject the YAML if any are violat
 
 - The proposed shape should reflect the user's described process. A discovery-only request ("what time does the trigger run?") should not produce a new YAML.
 - For ambiguous requests, asking a clarifying question instead of generating a workflow is acceptable behavior, not a failure. The exception is when a test criterion explicitly evaluates whether the model acted versus asked — defer to the criterion in that case.
-- For requests that imply looping ("keep polling until X"), the correct response is to model the polling as a cron-triggered workflow with cursor-based state, not to invent a self-edge or back-edge.
+- For requests that imply looping ("keep polling until X"), a cron-triggered workflow is the right shape — looping workflows are not supported. Flag any structure that introduces a self-edge or back-edge to a previous step.
 - Don't add steps the user didn't ask for. Conversely, don't collapse genuinely distinct integrations into one step just to keep the workflow short.
 
 ## Out-of-scope concerns (do not grade)

From 22a1f97de86c3db0bf2dbb39c565235c1857a7c1 Mon Sep 17 00:00:00 2001
From: "Hanna Paasivirta (OpenFn)" <hanna@openfn.org>
Date: Thu, 14 May 2026 17:53:50 +0900
Subject: [PATCH 09/15] specify judge

---
 .../test_commcare_to_dhis2_tracker_with_specific_functions.md    | 1 +
 .../tests/acceptance/test_commcare_to_dhis2_with_job_code.md     | 1 +
 .../acceptance/test_gsheets_transform_salesforce_with_cron.md    | 1 +
 .../test_http_to_salesforce_three_steps_with_job_code.md         | 1 +
 .../global_chat/tests/acceptance/test_vague_gmail_to_database.md | 1 +
 .../job_chat/tests/acceptance/test_adaptor_context_switching.md  | 1 +
 services/job_chat/tests/acceptance/test_basic_input.md           | 1 +
 services/job_chat/tests/acceptance/test_contextualised_input.md  | 1 +
 services/job_chat/tests/acceptance/test_duplicate_sections.md    | 1 +
 .../tests/acceptance/test_duplicate_sections_additional.md       | 1 +
 .../job_chat/tests/acceptance/test_navigation_workflow_to_job.md | 1 +
 services/workflow_chat/tests/acceptance/test_basic_input.md      | 1 +
 .../workflow_chat/tests/acceptance/test_conversational_turn.md   | 1 +
 services/workflow_chat/tests/acceptance/test_edit_job_code.md    | 1 +
 services/workflow_chat/tests/acceptance/test_error_field.md      | 1 +
 .../workflow_chat/tests/acceptance/test_input_second_turn.md     | 1 +
 services/workflow_chat/tests/acceptance/test_long_yaml.md        | 1 +
 .../tests/acceptance/test_navigation_job_to_workflow.md          | 1 +
 services/workflow_chat/tests/acceptance/test_simple_lang_bug.md  | 1 +
 .../workflow_chat/tests/acceptance/test_single_trigger_node.md   | 1 +
 20 files changed, 20 insertions(+)

diff --git a/services/global_chat/tests/acceptance/test_commcare_to_dhis2_tracker_with_specific_functions.md b/services/global_chat/tests/acceptance/test_commcare_to_dhis2_tracker_with_specific_functions.md
index 3905cf0..f3a5bb9 100644
--- a/services/global_chat/tests/acceptance/test_commcare_to_dhis2_tracker_with_specific_functions.md
+++ b/services/global_chat/tests/acceptance/test_commcare_to_dhis2_tracker_with_specific_functions.md
@@ -1,6 +1,7 @@
 ---
 id: global-chat.commcare-to-dhis2-tracker-with-specific-functions
 service: global_chat
+judges: [general, openfn_workflow_expert, openfn_code_quality]
 ---
 
 # notes
diff --git a/services/global_chat/tests/acceptance/test_commcare_to_dhis2_with_job_code.md b/services/global_chat/tests/acceptance/test_commcare_to_dhis2_with_job_code.md
index 3693db0..35270b8 100644
--- a/services/global_chat/tests/acceptance/test_commcare_to_dhis2_with_job_code.md
+++ b/services/global_chat/tests/acceptance/test_commcare_to_dhis2_with_job_code.md
@@ -1,6 +1,7 @@
 ---
 id: global-chat.commcare-to-dhis2-with-job-code
 service: global_chat
+judges: [general, openfn_workflow_expert, openfn_code_quality]
 ---
 
 # notes
diff --git a/services/global_chat/tests/acceptance/test_gsheets_transform_salesforce_with_cron.md b/services/global_chat/tests/acceptance/test_gsheets_transform_salesforce_with_cron.md
index 87e1816..b288815 100644
--- a/services/global_chat/tests/acceptance/test_gsheets_transform_salesforce_with_cron.md
+++ b/services/global_chat/tests/acceptance/test_gsheets_transform_salesforce_with_cron.md
@@ -1,6 +1,7 @@
 ---
 id: global-chat.gsheets-transform-salesforce-with-cron
 service: global_chat
+judges: [general, openfn_workflow_expert, openfn_code_quality]
 ---
 
 # notes
diff --git a/services/global_chat/tests/acceptance/test_http_to_salesforce_three_steps_with_job_code.md b/services/global_chat/tests/acceptance/test_http_to_salesforce_three_steps_with_job_code.md
index e5f1263..5f60930 100644
--- a/services/global_chat/tests/acceptance/test_http_to_salesforce_three_steps_with_job_code.md
+++ b/services/global_chat/tests/acceptance/test_http_to_salesforce_three_steps_with_job_code.md
@@ -1,6 +1,7 @@
 ---
 id: global-chat.http-to-salesforce-three-steps-with-job-code
 service: global_chat
+judges: [general, openfn_workflow_expert, openfn_code_quality]
 ---
 
 # notes
diff --git a/services/global_chat/tests/acceptance/test_vague_gmail_to_database.md b/services/global_chat/tests/acceptance/test_vague_gmail_to_database.md
index 698e6f8..d376ccc 100644
--- a/services/global_chat/tests/acceptance/test_vague_gmail_to_database.md
+++ b/services/global_chat/tests/acceptance/test_vague_gmail_to_database.md
@@ -1,6 +1,7 @@
 ---
 id: global-chat.vague-gmail-to-database
 service: global_chat
+judges: [general, openfn_workflow_expert, openfn_code_quality]
 ---
 
 # notes
diff --git a/services/job_chat/tests/acceptance/test_adaptor_context_switching.md b/services/job_chat/tests/acceptance/test_adaptor_context_switching.md
index 009b801..06c6151 100644
--- a/services/job_chat/tests/acceptance/test_adaptor_context_switching.md
+++ b/services/job_chat/tests/acceptance/test_adaptor_context_switching.md
@@ -1,6 +1,7 @@
 ---
 id: job-chat.adaptor-context-switching
 service: job_chat
+judges: [general, openfn_code_quality]
 ---
 
 # notes
diff --git a/services/job_chat/tests/acceptance/test_basic_input.md b/services/job_chat/tests/acceptance/test_basic_input.md
index c2064c0..4f98c2e 100644
--- a/services/job_chat/tests/acceptance/test_basic_input.md
+++ b/services/job_chat/tests/acceptance/test_basic_input.md
@@ -1,6 +1,7 @@
 ---
 id: job-chat.basic-input
 service: job_chat
+judges: [general, openfn_code_quality]
 ---
 
 # notes
diff --git a/services/job_chat/tests/acceptance/test_contextualised_input.md b/services/job_chat/tests/acceptance/test_contextualised_input.md
index 248401d..920a36b 100644
--- a/services/job_chat/tests/acceptance/test_contextualised_input.md
+++ b/services/job_chat/tests/acceptance/test_contextualised_input.md
@@ -1,6 +1,7 @@
 ---
 id: job-chat.contextualised-input
 service: job_chat
+judges: [general, openfn_code_quality]
 ---
 
 # notes
diff --git a/services/job_chat/tests/acceptance/test_duplicate_sections.md b/services/job_chat/tests/acceptance/test_duplicate_sections.md
index f6900e8..686c0be 100644
--- a/services/job_chat/tests/acceptance/test_duplicate_sections.md
+++ b/services/job_chat/tests/acceptance/test_duplicate_sections.md
@@ -1,6 +1,7 @@
 ---
 id: job-chat.duplicate-sections
 service: job_chat
+judges: [general, openfn_code_quality]
 ---
 
 # notes
diff --git a/services/job_chat/tests/acceptance/test_duplicate_sections_additional.md b/services/job_chat/tests/acceptance/test_duplicate_sections_additional.md
index 0565f5c..3a3eb46 100644
--- a/services/job_chat/tests/acceptance/test_duplicate_sections_additional.md
+++ b/services/job_chat/tests/acceptance/test_duplicate_sections_additional.md
@@ -1,6 +1,7 @@
 ---
 id: job-chat.duplicate-sections-additional
 service: job_chat
+judges: [general, openfn_code_quality]
 ---
 
 # notes
diff --git a/services/job_chat/tests/acceptance/test_navigation_workflow_to_job.md b/services/job_chat/tests/acceptance/test_navigation_workflow_to_job.md
index 9c2c821..86ec2b0 100644
--- a/services/job_chat/tests/acceptance/test_navigation_workflow_to_job.md
+++ b/services/job_chat/tests/acceptance/test_navigation_workflow_to_job.md
@@ -1,6 +1,7 @@
 ---
 id: job-chat.navigation-workflow-to-job
 service: job_chat
+judges: [general, openfn_code_quality]
 ---
 
 # notes
diff --git a/services/workflow_chat/tests/acceptance/test_basic_input.md b/services/workflow_chat/tests/acceptance/test_basic_input.md
index 99aff9e..74e891d 100644
--- a/services/workflow_chat/tests/acceptance/test_basic_input.md
+++ b/services/workflow_chat/tests/acceptance/test_basic_input.md
@@ -1,6 +1,7 @@
 ---
 id: workflow-chat.basic-input
 service: workflow_chat
+judges: [general, openfn_workflow_expert]
 ---
 
 # notes
diff --git a/services/workflow_chat/tests/acceptance/test_conversational_turn.md b/services/workflow_chat/tests/acceptance/test_conversational_turn.md
index ec666de..ef6866e 100644
--- a/services/workflow_chat/tests/acceptance/test_conversational_turn.md
+++ b/services/workflow_chat/tests/acceptance/test_conversational_turn.md
@@ -1,6 +1,7 @@
 ---
 id: workflow-chat.conversational-turn
 service: workflow_chat
+judges: [general, openfn_workflow_expert]
 ---
 
 # notes
diff --git a/services/workflow_chat/tests/acceptance/test_edit_job_code.md b/services/workflow_chat/tests/acceptance/test_edit_job_code.md
index 08cb0a6..59bdd28 100644
--- a/services/workflow_chat/tests/acceptance/test_edit_job_code.md
+++ b/services/workflow_chat/tests/acceptance/test_edit_job_code.md
@@ -1,6 +1,7 @@
 ---
 id: workflow-chat.edit-job-code
 service: workflow_chat
+judges: [general, openfn_workflow_expert]
 ---
 
 # notes
diff --git a/services/workflow_chat/tests/acceptance/test_error_field.md b/services/workflow_chat/tests/acceptance/test_error_field.md
index 9d855af..3ffb194 100644
--- a/services/workflow_chat/tests/acceptance/test_error_field.md
+++ b/services/workflow_chat/tests/acceptance/test_error_field.md
@@ -1,6 +1,7 @@
 ---
 id: workflow-chat.error-field
 service: workflow_chat
+judges: [general, openfn_workflow_expert]
 ---
 
 # notes
diff --git a/services/workflow_chat/tests/acceptance/test_input_second_turn.md b/services/workflow_chat/tests/acceptance/test_input_second_turn.md
index a0f84ac..40ca4bf 100644
--- a/services/workflow_chat/tests/acceptance/test_input_second_turn.md
+++ b/services/workflow_chat/tests/acceptance/test_input_second_turn.md
@@ -1,6 +1,7 @@
 ---
 id: workflow-chat.input-second-turn
 service: workflow_chat
+judges: [general, openfn_workflow_expert]
 ---
 
 # notes
diff --git a/services/workflow_chat/tests/acceptance/test_long_yaml.md b/services/workflow_chat/tests/acceptance/test_long_yaml.md
index 7837d98..fde1a16 100644
--- a/services/workflow_chat/tests/acceptance/test_long_yaml.md
+++ b/services/workflow_chat/tests/acceptance/test_long_yaml.md
@@ -1,6 +1,7 @@
 ---
 id: workflow-chat.long-yaml
 service: workflow_chat
+judges: [general, openfn_workflow_expert]
 ---
 
 # notes
diff --git a/services/workflow_chat/tests/acceptance/test_navigation_job_to_workflow.md b/services/workflow_chat/tests/acceptance/test_navigation_job_to_workflow.md
index 970f7a2..2e277f9 100644
--- a/services/workflow_chat/tests/acceptance/test_navigation_job_to_workflow.md
+++ b/services/workflow_chat/tests/acceptance/test_navigation_job_to_workflow.md
@@ -1,6 +1,7 @@
 ---
 id: workflow-chat.navigation-job-to-workflow
 service: workflow_chat
+judges: [general, openfn_workflow_expert]
 ---
 
 # notes
diff --git a/services/workflow_chat/tests/acceptance/test_simple_lang_bug.md b/services/workflow_chat/tests/acceptance/test_simple_lang_bug.md
index 2023591..1e76bfc 100644
--- a/services/workflow_chat/tests/acceptance/test_simple_lang_bug.md
+++ b/services/workflow_chat/tests/acceptance/test_simple_lang_bug.md
@@ -1,6 +1,7 @@
 ---
 id: workflow-chat.simple-lang-bug
 service: workflow_chat
+judges: [general, openfn_workflow_expert]
 ---
 
 # notes
diff --git a/services/workflow_chat/tests/acceptance/test_single_trigger_node.md b/services/workflow_chat/tests/acceptance/test_single_trigger_node.md
index 3f9dbec..eac31f6 100644
--- a/services/workflow_chat/tests/acceptance/test_single_trigger_node.md
+++ b/services/workflow_chat/tests/acceptance/test_single_trigger_node.md
@@ -1,6 +1,7 @@
 ---
 id: workflow-chat.single-trigger-node
 service: workflow_chat
+judges: [general, openfn_workflow_expert]
 ---
 
 # notes

From 7a5c272e58f632a4957a0952731812acad5ede71 Mon Sep 17 00:00:00 2001
From: "Hanna Paasivirta (OpenFn)" <hanna@openfn.org>
Date: Thu, 14 May 2026 19:43:57 +0900
Subject: [PATCH 10/15] fix prefix

---
 conftest.py                        |  9 +++++
 services/testing/judge.py          | 60 ++++++++++++++++++++++++++----
 services/testing/spec_collector.py | 35 +++++++++++------
 3 files changed, 85 insertions(+), 19 deletions(-)

diff --git a/conftest.py b/conftest.py
index ba0cf17..9b7d001 100644
--- a/conftest.py
+++ b/conftest.py
@@ -6,9 +6,18 @@
   construction so accidental I/O fails loud instead of timing out.
 """
 
+from pathlib import Path
 from unittest.mock import patch
 
 import pytest
+from dotenv import load_dotenv
+
+
+# Load services/.env into the pytest process so the judge (running in-process)
+# can read ANTHROPIC_API_KEY etc. Services load this same .env themselves via
+# entry.py — we point at the same file rather than maintaining a separate copy
+# at the repo root. `override=False` means real env vars win.
+load_dotenv(Path(__file__).parent / "services" / ".env", override=False)
 
 
 # The spec collector picks up acceptance test markdown specs from
diff --git a/services/testing/judge.py b/services/testing/judge.py
index 0d46d48..4ae16f4 100644
--- a/services/testing/judge.py
+++ b/services/testing/judge.py
@@ -28,6 +28,7 @@
 
 import json
 import os
+import re
 from dataclasses import dataclass, field
 from typing import Optional
 
@@ -70,7 +71,8 @@ def _build_system_prompt(judge_name: str) -> str:
     parts = [config.role.strip()]
     parts += [
         "",
-        "Return JSON with this exact shape:",
+        "Respond with ONLY a JSON object, no prose, no markdown fences. The object "
+        "must have this exact shape:",
         "{",
         '  "criteria": [{"criterion": str, "passed": bool, "reasoning": str}, ...],',
         '  "general_flags": [{"description": str, "severity": "note" | "regression"}, ...]',
@@ -121,21 +123,64 @@ def _build_user_prompt(
     return "\n".join(parts)
 
 
+def _extract_json_object(text: str) -> Optional[dict]:
+    """Find and parse the first top-level JSON object in text.
+
+    Tolerates markdown fences and leading/trailing prose. Returns None if no
+    parseable object is found.
+    """
+    fence = re.search(r"```(?:json)?\s*(\{.*?\})\s*```", text, re.DOTALL)
+    if fence:
+        try:
+            return json.loads(fence.group(1))
+        except json.JSONDecodeError:
+            pass
+
+    start = text.find("{")
+    if start < 0:
+        return None
+    depth = 0
+    in_string = False
+    escape = False
+    for i in range(start, len(text)):
+        ch = text[i]
+        if escape:
+            escape = False
+            continue
+        if ch == "\\" and in_string:
+            escape = True
+            continue
+        if ch == '"':
+            in_string = not in_string
+            continue
+        if in_string:
+            continue
+        if ch == "{":
+            depth += 1
+        elif ch == "}":
+            depth -= 1
+            if depth == 0:
+                try:
+                    return json.loads(text[start:i + 1])
+                except json.JSONDecodeError:
+                    return None
+    return None
+
+
 def _parse_verdict(
     raw_text: str,
     criteria: list[str],
     usage: dict,
     judge_name: str,
 ) -> Verdict:
-    try:
-        data = json.loads(raw_text)
-    except json.JSONDecodeError as e:
+    data = _extract_json_object(raw_text)
+    if data is None:
         return Verdict(
             passed=False,
             score=0.0,
             criteria=[],
-            general_flags=[GeneralFlag(description=f"judge_error: {e}", severity="regression")],
-            summary=f"judge_error ({judge_name}): failed to parse JSON output\n---\n{raw_text[:500]}",
+            general_flags=[GeneralFlag(description="judge_error: no JSON object in output", severity="regression")],
+            summary=f"judge_error ({judge_name}): no JSON object found in output\n---\n{raw_text[:500]}",
             judge_usage=usage,
             judge_name=judge_name,
         )
@@ -245,11 +290,10 @@ def evaluate(
         system=system_prompt,
         messages=[
             {"role": "user", "content": user_prompt},
-            {"role": "assistant", "content": "{"},  # prefill to force JSON
         ],
     )
 
-    raw_text = "{" + response.content[0].text
+    raw_text = response.content[0].text
     usage = {
         "input_tokens": response.usage.input_tokens,
         "output_tokens": response.usage.output_tokens,
diff --git a/services/testing/spec_collector.py b/services/testing/spec_collector.py
index e3a8710..ae399ad 100644
--- a/services/testing/spec_collector.py
+++ b/services/testing/spec_collector.py
@@ -47,24 +47,37 @@ def __init__(self, *, name, parent, spec: Spec, run_index: int):
         self.run_index = run_index
 
     def runtest(self):
-        payload = _build_payload(self.spec)
+        spec = self.spec
+        print(f"\n→ {spec.id}")
+        print(f"  service: {spec.service}")
+        print(f"  judges:  {', '.join(spec.judges)}")
+
+        payload = _build_payload(spec)
         client = ApolloClient()
-        response = client.call(self.spec.service, payload)
+
+        print(f"  calling {spec.service}...", flush=True)
+        response = client.call(spec.service, payload)
+        print("  ✓ service responded")
 
         # One service call, N judges evaluate the same response.
         # Consensus: the test passes only if every judge passes.
-        verdicts = [
-            judge.evaluate(
-                criteria=self.spec.quality_criteria,
+        verdicts = []
+        for judge_name in spec.judges:
+            print(f"  judging with {judge_name}...", flush=True)
+            v = judge.evaluate(
+                criteria=spec.quality_criteria,
                 candidate=response,
-                test_notes=self.spec.notes or None,
+                test_notes=spec.notes or None,
                 judge=judge_name,
             )
-            for judge_name in self.spec.judges
-        ]
-
-        if not all(v.passed for v in verdicts):
-            summary = "\n\n".join(v.summary for v in verdicts)
+            mark = "✓" if v.passed else "✗"
+            print(f"  {mark} {judge_name}: {'PASS' if v.passed else 'FAIL'} "
+                  f"(score={v.score:.2f}, flags={len(v.general_flags)})")
+            verdicts.append(v)
+
+        failing = [v for v in verdicts if not v.passed]
+        if failing:
+            summary = "\n\n".join(v.summary for v in failing)
             raise AssertionError(summary)
 
     def repr_failure(self, excinfo, style=None):

From 254619bd0149f14ce9e4524f3559c87a87e5215b Mon Sep 17 00:00:00 2001
From: "Hanna Paasivirta (OpenFn)" <hanna@openfn.org>
Date: Thu, 14 May 2026 19:51:44 +0900
Subject: [PATCH 11/15] add sumary

---
 services/testing/spec_collector.py | 75 ++++++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)

diff --git a/services/testing/spec_collector.py b/services/testing/spec_collector.py
index ae399ad..124607a 100644
--- a/services/testing/spec_collector.py
+++ b/services/testing/spec_collector.py
@@ -18,6 +18,11 @@
 from testing.spec_parser import Spec, parse_spec
 
 
+# Verdicts collected across the session for the end-of-run rollup.
+# Each entry is (spec_id, Verdict).
+_session_verdicts: list[tuple[str, judge.Verdict]] = []
+
+
 def pytest_collect_file(parent, file_path):
     if (
         file_path.suffix == ".md"
@@ -74,6 +79,7 @@ def runtest(self):
             print(f"  {mark} {judge_name}: {'PASS' if v.passed else 'FAIL'} "
                   f"(score={v.score:.2f}, flags={len(v.general_flags)})")
             verdicts.append(v)
+            _session_verdicts.append((spec.id, v))
 
         failing = [v for v in verdicts if not v.passed]
         if failing:
@@ -102,3 +108,72 @@ def _build_payload(spec: Spec) -> dict:
         payload["content"] = spec.current_turn["content"]
 
     return payload
+
+
+def pytest_sessionfinish(session, exitstatus):
+    """Print an acceptance-tier rollup after the session ends.
+
+    Only fires when at least one acceptance spec ran. Stays silent on unit /
+    integration runs.
+    """
+    if not _session_verdicts:
+        return
+
+    by_test: dict[str, list[judge.Verdict]] = {}
+    for spec_id, v in _session_verdicts:
+        by_test.setdefault(spec_id, []).append(v)
+
+    by_judge: dict[str, list[judge.Verdict]] = {}
+    for _, v in _session_verdicts:
+        by_judge.setdefault(v.judge_name, []).append(v)
+
+    test_pass_count = sum(1 for vs in by_test.values() if all(v.passed for v in vs))
+    test_total = len(by_test)
+    pct = (test_pass_count / test_total * 100) if test_total else 0
+
+    avg_score = sum(v.score for _, v in _session_verdicts) / len(_session_verdicts)
+
+    flag_counts = {"note": 0, "regression": 0}
+    for _, v in _session_verdicts:
+        for f in v.general_flags:
+            flag_counts[f.severity] = flag_counts.get(f.severity, 0) + 1
+
+    failing = [(tid, vs) for tid, vs in by_test.items() if not all(v.passed for v in vs)]
+
+    judge_col = max(len(name) for name in by_judge) + 2
+
+    print()
+    print("=== Acceptance summary ===")
+    print(f"Tests:  {test_pass_count}/{test_total} passed ({pct:.0f}%)")
+    print(f"Average score across all verdicts: {avg_score:.2f}")
+    print()
+    print("Per judge:")
+    for judge_name in sorted(by_judge):
+        verdicts = by_judge[judge_name]
+        passed = sum(1 for v in verdicts if v.passed)
+        print(f"  {judge_name:<{judge_col}}{passed}/{len(verdicts)} pass")
+    print()
+    print("Flags:")
+    for severity in ("regression", "note"):
+        print(f"  {severity:12} {flag_counts.get(severity, 0)}")
+
+    if failing:
+        print()
+        print("Failing tests:")
+        for spec_id, verdicts in failing:
+            print(f"  {spec_id}")
+            for v in verdicts:
+                if v.passed:
+                    continue
+                n_reg = sum(1 for f in v.general_flags if f.severity == "regression")
+                n_note = sum(1 for f in v.general_flags if f.severity == "note")
+                detail = []
+                if n_reg:
+                    detail.append(f"{n_reg} regression")
+                if n_note:
+                    detail.append(f"{n_note} note")
+                if v.score < 1.0:
+                    n_failed_criteria = sum(1 for c in v.criteria if not c.passed)
+                    detail.append(f"{n_failed_criteria} criterion fail")
+                detail_str = ", ".join(detail) if detail else "—"
+                print(f"    ✗ {v.judge_name}  ({detail_str})")

From ff0bf526b0c5dbc576010c0f099968c8621beed6 Mon Sep 17 00:00:00 2001
From: "Hanna Paasivirta (OpenFn)" <hanna@openfn.org>
Date: Thu, 14 May 2026 19:54:45 +0900
Subject: [PATCH 12/15] parallel judge

---
 services/testing/spec_collector.py | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/services/testing/spec_collector.py b/services/testing/spec_collector.py
index 124607a..3dd1553 100644
--- a/services/testing/spec_collector.py
+++ b/services/testing/spec_collector.py
@@ -11,6 +11,8 @@
   4. Fails with the judge's reasoning summary if `verdict.passed` is False.
 """
 
+from concurrent.futures import ThreadPoolExecutor
+
 import pytest
 
 from testing import judge
@@ -64,21 +66,24 @@ def runtest(self):
         response = client.call(spec.service, payload)
         print("  ✓ service responded")
 
-        # One service call, N judges evaluate the same response.
+        # One service call, N judges evaluate the same response in parallel.
         # Consensus: the test passes only if every judge passes.
-        verdicts = []
-        for judge_name in spec.judges:
-            print(f"  judging with {judge_name}...", flush=True)
-            v = judge.evaluate(
+        def _run_judge(judge_name: str) -> judge.Verdict:
+            return judge.evaluate(
                 criteria=spec.quality_criteria,
                 candidate=response,
                 test_notes=spec.notes or None,
                 judge=judge_name,
             )
+
+        print(f"  running {len(spec.judges)} judge(s) in parallel...", flush=True)
+        with ThreadPoolExecutor(max_workers=len(spec.judges)) as executor:
+            verdicts = list(executor.map(_run_judge, spec.judges))
+
+        for judge_name, v in zip(spec.judges, verdicts):
             mark = "✓" if v.passed else "✗"
             print(f"  {mark} {judge_name}: {'PASS' if v.passed else 'FAIL'} "
                   f"(score={v.score:.2f}, flags={len(v.general_flags)})")
-            verdicts.append(v)
             _session_verdicts.append((spec.id, v))
 
         failing = [v for v in verdicts if not v.passed]

From 23841d01d26ca3b8eb5cfa7ded977b41dd12ac03 Mon Sep 17 00:00:00 2001
From: "Hanna Paasivirta (OpenFn)" <hanna@openfn.org>
Date: Thu, 14 May 2026 21:32:36 +0900
Subject: [PATCH 13/15] add oneshot folder

---
 .../test_commcare_to_dhis2_tracker_with_specific_functions.md   | 0
 .../test_commcare_to_dhis2_with_job_code.md                     | 0
 .../test_gsheets_transform_salesforce_with_cron.md              | 0
 .../test_http_to_salesforce_three_steps_with_job_code.md        | 0
 .../{ => one_shot_workflows}/test_vague_gmail_to_database.md    | 0
 services/testing/spec_collector.py                              | 2 +-
 6 files changed, 1 insertion(+), 1 deletion(-)
 rename services/global_chat/tests/acceptance/{ => one_shot_workflows}/test_commcare_to_dhis2_tracker_with_specific_functions.md (100%)
 rename services/global_chat/tests/acceptance/{ => one_shot_workflows}/test_commcare_to_dhis2_with_job_code.md (100%)
 rename services/global_chat/tests/acceptance/{ => one_shot_workflows}/test_gsheets_transform_salesforce_with_cron.md (100%)
 rename services/global_chat/tests/acceptance/{ => one_shot_workflows}/test_http_to_salesforce_three_steps_with_job_code.md (100%)
 rename services/global_chat/tests/acceptance/{ => one_shot_workflows}/test_vague_gmail_to_database.md (100%)

diff --git a/services/global_chat/tests/acceptance/test_commcare_to_dhis2_tracker_with_specific_functions.md b/services/global_chat/tests/acceptance/one_shot_workflows/test_commcare_to_dhis2_tracker_with_specific_functions.md
similarity index 100%
rename from services/global_chat/tests/acceptance/test_commcare_to_dhis2_tracker_with_specific_functions.md
rename to services/global_chat/tests/acceptance/one_shot_workflows/test_commcare_to_dhis2_tracker_with_specific_functions.md
diff --git a/services/global_chat/tests/acceptance/test_commcare_to_dhis2_with_job_code.md b/services/global_chat/tests/acceptance/one_shot_workflows/test_commcare_to_dhis2_with_job_code.md
similarity index 100%
rename from services/global_chat/tests/acceptance/test_commcare_to_dhis2_with_job_code.md
rename to services/global_chat/tests/acceptance/one_shot_workflows/test_commcare_to_dhis2_with_job_code.md
diff --git a/services/global_chat/tests/acceptance/test_gsheets_transform_salesforce_with_cron.md b/services/global_chat/tests/acceptance/one_shot_workflows/test_gsheets_transform_salesforce_with_cron.md
similarity index 100%
rename from services/global_chat/tests/acceptance/test_gsheets_transform_salesforce_with_cron.md
rename to services/global_chat/tests/acceptance/one_shot_workflows/test_gsheets_transform_salesforce_with_cron.md
diff --git a/services/global_chat/tests/acceptance/test_http_to_salesforce_three_steps_with_job_code.md b/services/global_chat/tests/acceptance/one_shot_workflows/test_http_to_salesforce_three_steps_with_job_code.md
similarity index 100%
rename from services/global_chat/tests/acceptance/test_http_to_salesforce_three_steps_with_job_code.md
rename to services/global_chat/tests/acceptance/one_shot_workflows/test_http_to_salesforce_three_steps_with_job_code.md
diff --git a/services/global_chat/tests/acceptance/test_vague_gmail_to_database.md b/services/global_chat/tests/acceptance/one_shot_workflows/test_vague_gmail_to_database.md
similarity index 100%
rename from services/global_chat/tests/acceptance/test_vague_gmail_to_database.md
rename to services/global_chat/tests/acceptance/one_shot_workflows/test_vague_gmail_to_database.md
diff --git a/services/testing/spec_collector.py b/services/testing/spec_collector.py
index 3dd1553..60c0cde 100644
--- a/services/testing/spec_collector.py
+++ b/services/testing/spec_collector.py
@@ -29,7 +29,7 @@ def pytest_collect_file(parent, file_path):
     if (
         file_path.suffix == ".md"
         and not file_path.name.startswith("_")
-        and file_path.parent.name == "acceptance"
+        and "acceptance" in file_path.parts
     ):
         return SpecFile.from_parent(parent, path=file_path)
 

From a69fdf37287b360dbfcd0035f60e400a3ce4102e Mon Sep 17 00:00:00 2001
From: Joe Clark <jclark@openfn.org>
Date: Tue, 19 May 2026 11:22:10 +0100
Subject: [PATCH 14/15] fix imports

---
 services/global_chat/tests/test_routing_matrix.py | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/services/global_chat/tests/test_routing_matrix.py b/services/global_chat/tests/test_routing_matrix.py
index 995ef05..09ae300 100644
--- a/services/global_chat/tests/test_routing_matrix.py
+++ b/services/global_chat/tests/test_routing_matrix.py
@@ -15,14 +15,16 @@
 
 import pytest
 import yaml
+from testing.yaml_assertions import (
+    assert_yaml_equal_except,
+    assert_yaml_jobs_have_body,
+)
 from .test_utils import (
     call_global_chat_service,
     make_service_input,
     print_response_details,
     assert_routed_to,
     get_response_yaml,
-    assert_yaml_equal_except,
-    assert_yaml_jobs_have_body,
 )
 
 

From 853e26fd73ab76ce6a3c71972e7734e9189bb3e4 Mon Sep 17 00:00:00 2001
From: "Hanna Paasivirta (OpenFn)" <hanna@openfn.org>
Date: Tue, 19 May 2026 23:48:09 +0900
Subject: [PATCH 15/15] change folder

---
 services/testing/README.md         |  4 ++
 services/testing/judge.py          |  4 +-
 services/testing/spec_collector.py | 66 +++++++++++++++++++++++++++++-
 3 files changed, 70 insertions(+), 4 deletions(-)

diff --git a/services/testing/README.md b/services/testing/README.md
index 149f529..cb6c908 100644
--- a/services/testing/README.md
+++ b/services/testing/README.md
@@ -24,6 +24,10 @@ This directory is on the Python path via `pyproject.toml`
 - `spec_collector.py` — pytest plugin (registered via `pytest_plugins` in the
   repo-root `conftest.py`). Turns each MD spec into a pytest item that builds
   the service payload, calls the service via `ApolloClient`, and runs the judge.
+  Any project YAML in the response (`response_yaml`, `workflow_yaml`,
+  `content_yaml`, or a `workflow_yaml` attachment) is written to a `tmp/`
+  folder next to the spec file (e.g.
+  `services/workflow_chat/tests/acceptance/tmp/<spec_id>.yaml`) for inspection.
 - `apollo_client.py` — `ApolloClient` for dispatching to a chat service.
   Currently a subprocess-based stub; the integration tier will replace its
   internals with a real HTTP client (same `.call()` signature, no test changes).
diff --git a/services/testing/judge.py b/services/testing/judge.py
index 4ae16f4..d3e3d38 100644
--- a/services/testing/judge.py
+++ b/services/testing/judge.py
@@ -241,9 +241,9 @@ def _format_summary(judge_name: str, criteria: list[CriterionResult], flags: lis
                 lines.append(f"      → {c.reasoning}")
     if flags:
         lines.append("")
-        lines.append("General flags:")
+        lines.append("Flags raised by the judge:")
         for f in flags:
-            lines.append(f"  [{f.severity}] {f.description}")
+            lines.append(f"  ✗ [{f.severity}] {f.description}")
     return "\n".join(lines)
 
 
diff --git a/services/testing/spec_collector.py b/services/testing/spec_collector.py
index 60c0cde..5e6ccf9 100644
--- a/services/testing/spec_collector.py
+++ b/services/testing/spec_collector.py
@@ -7,11 +7,14 @@
 Each item:
   1. Builds the service payload from `# settings`, `# history`, and `# turn`.
   2. Dispatches to the named service via ApolloClient.
-  3. Calls `judge.evaluate()` with the spec's quality_criteria.
-  4. Fails with the judge's reasoning summary if `verdict.passed` is False.
+  3. Captures any YAML in the response to a `tmp/` folder next to the spec
+     file (e.g. `services/workflow_chat/tests/acceptance/tmp/<spec_id>.yaml`).
+  4. Calls `judge.evaluate()` with the spec's quality_criteria.
+  5. Fails with the judge's reasoning summary if `verdict.passed` is False.
 """
 
 from concurrent.futures import ThreadPoolExecutor
+from pathlib import Path
 
 import pytest
 
@@ -66,6 +69,12 @@ def runtest(self):
         response = client.call(spec.service, payload)
         print("  ✓ service responded")
 
+        yaml_path = _capture_response_yaml(
+            response, spec.id, self.run_index, spec.runs, self.path.parent / "tmp"
+        )
+        if yaml_path is not None:
+            print(f"  ✓ project YAML saved to {yaml_path}")
+
         # One service call, N judges evaluate the same response in parallel.
         # Consensus: the test passes only if every judge passes.
         def _run_judge(judge_name: str) -> judge.Verdict:
@@ -115,6 +124,59 @@ def _build_payload(spec: Spec) -> dict:
     return payload
 
 
+def _extract_yaml_from_response(response: dict) -> str | None:
+    """Pull the full project YAML out of a service response, if any.
+
+    Different services use different keys:
+      - workflow_chat: `response_yaml`
+      - global_chat: `attachments` list with `type=workflow_yaml`
+    Falls back to `workflow_yaml` / `content_yaml` for any other service that
+    might use them.
+    """
+    if not isinstance(response, dict):
+        return None
+
+    for key in ("response_yaml", "workflow_yaml", "content_yaml"):
+        value = response.get(key)
+        if isinstance(value, str) and value.strip():
+            return value
+
+    for attachment in response.get("attachments") or []:
+        if (
+            isinstance(attachment, dict)
+            and attachment.get("type") == "workflow_yaml"
+            and isinstance(attachment.get("content"), str)
+            and attachment["content"].strip()
+        ):
+            return attachment["content"]
+
+    return None
+
+
+def _capture_response_yaml(
+    response: dict,
+    spec_id: str,
+    run_index: int,
+    runs: int,
+    output_dir: Path,
+) -> Path | None:
+    """Write the response's project YAML to `output_dir/<spec_id>.yaml`.
+
+    For multi-run specs, appends `__run-N` to the filename so each run is
+    preserved. Returns the written path, or None if no YAML was present.
+    """
+    yaml_str = _extract_yaml_from_response(response)
+    if yaml_str is None:
+        return None
+
+    output_dir.mkdir(parents=True, exist_ok=True)
+    suffix = f"__run-{run_index}" if runs > 1 else ""
+    safe_id = spec_id.replace("/", "_")
+    path = output_dir / f"{safe_id}{suffix}.yaml"
+    path.write_text(yaml_str)
+    return path
+
+
 def pytest_sessionfinish(session, exitstatus):
     """Print an acceptance-tier rollup after the session ends.