Behavioral tests for the robotcode skill. They don't check exact output — they check that an agent with the skill loaded reaches for the right robotcode command and avoids the habits the skill warns against (grepping .robot files, loading output.xml, writing a test for an exploratory task, guessing keyword args instead of using libdoc).
Each case targets one behavior. They're ordered simplest first — from "just run it" through inventory and lookups to the debugger ("why does this fail?").
| # | Case | Checks |
|---|---|---|
| 01 | run-tests | runs via robotcode robot -i smoke, reports counts |
| 02 | results-summary | inspects a finished run with results, not raw output.xml |
| 03 | results-diff | results diff baseline vs current to find the regression |
| 04 | inventory-discover-not-grep | discover, never grep over .robot |
| 05 | libdoc-first | libdoc before generic knowledge / web |
| 06 | analyze-before-run | analyze code (static) before executing |
| 07 | repl-explore-no-file | a "watch me" task in the REPL, no test file written |
| 08 | debug-why-test-fails | debug the actual failing test with robot-debug — don't paste it into a REPL |
| 09 | debug-break-at-line | line breakpoint + inspect variables in scope |
| 10 | repl-interactive-breakpoint | break into a keyword you build at the REPL prompt |
(02 and 03 need ./setup.sh run first; 07 needs a browser library — see the fixture README.)
One JSON file per case in cases/. The first three fields are the standard Skill-eval shape from Anthropic's best-practices guide; the must_* fields are this harness's machine-checkable additions.
| Field | Meaning |
|---|---|
skills |
Skills that should be active (["robotcode"]). |
query |
The user request to send the agent. |
files |
Fixtures the case assumes exist in the project (informational — you provide them). |
expected_behavior |
Free-text rubric — judged by you or an LLM, not by the harness. |
must_run |
Regexes that should match a command the agent ran (e.g. robotcode (robot-debug|run-debug)). |
must_not_run |
Regexes that must not match any command (e.g. cat …output\.xml). |
must_not_create |
Regexes on written file paths that must not match (e.g. \.robot$). |
All regexes are Python re.search, case-insensitive.
Per the best-practices guide, "there is not currently a built-in way to run these evaluations." Two ways to do it here:
Open a fresh Claude Code session in a real (or fixture) Robot Framework project with the skill installed, paste a case's query, and watch what it does:
- Did the skill trigger at all?
- Do the commands it runs satisfy
must_run/ avoidmust_not_run? - Does its behavior match every
expected_behaviorbullet?
Best for a quick read on a few cases, and the most faithful to real usage.
run.py drives a headless claude -p session per case, extracts the Bash commands and files written from the stream-json transcript, and applies the must_* regex checks. The expected_behavior rubric is printed for you to tick off. It defaults to the bundled fixtures/demo-project — a self-contained, offline Robot project built for these cases (see its README).
cd evals/robotcode
./fixtures/demo-project/setup.sh # once — runs the suites so the results/diff cases have finished runs
./run.py --allow-all # all cases against the bundled fixture
./run.py --case 01 --allow-all # just one case
# test across the models you ship for
./run.py --case 01 --model opus --allow-all
./run.py --case 01 --model sonnet --allow-all
./run.py --case 01 --model haiku --allow-all--project DIR points it at a different project instead. Exit code is 0 only if every case passes its regex checks (the rubric stays manual).
Prerequisites
claudeCLI onPATH, with the robotcode skill available to it (install the plugin — see the marketplace README — or run where it is already loaded).robotcodeinstalled in the target project's environment (pip install robotcode[all]). The bundled fixture needs nothing else and runs offline.--allow-alladds--dangerously-skip-permissionsso bash isn't gated — fine for the throwaway fixture; be careful pointing--projectat a real project, since the harness really executes the commands the agent chooses.- Cases 02 and 03 (results, diff) need finished runs — run
fixtures/demo-project/setup.shfirst. Case 07 needs a browser library installed (SeleniumLibrary or Browser) — the only non-offline case.
The harness only grades the objective must_* checks. For the expected_behavior rubric, capture the full transcript and hand it plus the rubric to a grader model for a pass/fail verdict — useful when behavior is fuzzier than "which command ran".
- Regex checks are necessary, not sufficient. They confirm the agent reached for the right command; the rubric covers the rest (did it report counts first? step through the debugger interactively and resume? avoid hanging?).
- Test across models. The guide recommends Haiku, Sonnet, and Opus — a skill that works on Opus may need more guidance for Haiku.
- Evals are the source of truth for changes. When you edit the skill, re-run the affected cases and compare; add a new case whenever you find a behavior the skill should enforce but doesn't.
Copy any file in cases/, give it the next number, write the query and expected_behavior, and add must_run / must_not_run / must_not_create for the behavioral signal. Keep one behavior per case.