A preference-ranking harness for evaluating AI-generated code.
codejudge takes several candidate solutions to the same programming task —
typically generated by different LLMs — runs each one in an isolated subprocess,
scores it on correctness, performance, and static quality, and emits
a ranked leaderboard plus pairwise preference judgments.
Those pairwise preferences are exactly the shape of the data used to train reward
models in RLHF (Reinforcement Learning from Human Feedback): given two
responses to the same prompt, which is better, and why. codejudge produces them
mechanically and reproducibly so a human reviewer can start from a defensible
ranking instead of a blank page.
Task: two-sum
# candidate pass correct perf qual score
-------------------------------------------------------------
1 optimal_hashmap 5/5 1.00 0.06 0.88 0.748
2 brute_force 5/5 1.00 0.00 0.76 0.713
3 buggy_offbyone 1/5 0.20 1.00 0.88 0.502
4 crashes_indexerror 0/5 0.00 0.00 0.80 0.120
Pairwise preferences (RLHF-style):
optimal_hashmap > brute_force (delta 0.035) - optimal_hashmap preferred over brute_force: stronger quality (+0.12); aggregate +0.035
optimal_hashmap > buggy_offbyone (delta 0.246) - optimal_hashmap preferred over buggy_offbyone: stronger correctness (+0.80); aggregate +0.246
optimal_hashmap > crashes_indexerror (delta 0.628) - optimal_hashmap preferred over crashes_indexerror: stronger correctness (+1.00); aggregate +0.628
... and 3 more
When you evaluate AI-generated code for a living, three questions recur for every
candidate: Does it actually work? Is it efficient? Is it clean enough to ship?
codejudge answers all three automatically and turns the answers into a ranking
you can audit — every number traces back to a test result or an AST metric, and
every preference comes with a one-line reason.
It is intentionally small, dependency-light (one runtime dependency), and fully tested, so it is easy to read end to end and extend.
- Sandboxed execution — each candidate runs in its own
python -Isubprocess with a wall-clock timeout; infinite loops, crashes, and strayprints can't take down or corrupt the harness. - Three-dimension scoring — correctness (test pass rate), performance (relative speed on passing cases), and static quality (AST complexity, docstrings, brevity), combined with configurable weights.
- RLHF-style preferences — full pairwise judgments with margins and rationales.
- Plain-text, Markdown, and JSON output — drop the Markdown into a PR comment or pipe the JSON into a dataset.
- Bring-your-own tasks — a task is just a
task.yamlplus a folder of.pycandidates. No code changes required. - Zero-dependency core — only
PyYAMLat runtime;pytest+rufffor dev.
git clone https://github.com/khaitha/codejudge.git
cd codejudge
pip install -e ".[dev]"Requires Python 3.9+.
# Evaluate a bundled example
codejudge run examples/two_sum
# Limit the preference list, and export machine-readable output
codejudge run examples/two_sum --max-prefs 5 --json report.json --markdown report.md
# Re-weight the dimensions (correctness, performance, quality)
codejudge run examples/two_sum --weights 0.8,0.1,0.1
# Inspect a task without running it
codejudge show examples/two_sumUse it as a library, too:
from codejudge import evaluate_task
report = evaluate_task("examples/two_sum")
print(report.reports[0].candidate_id) # 'optimal_hashmap'
for pref in report.preferences:
print(pref.winner, "≻", pref.loser, "—", pref.reason) ┌── load_task / load_candidates (loader.py)
│
task.yaml ─┤ for each candidate:
candidates ┤ run_candidate ── python -I _worker.py (runner.py + _worker.py)
│ │ isolated subprocess, JSON in/out, timeout
│ analyze_quality ── AST metrics (scorer.py)
│
└── rank ── normalize → weight → sort → pairwise preferences (ranker.py)
│
└── render text / markdown / json (report.py)
Each stage is a pure, separately tested function; evaluator.evaluate_task wires
them together.
Create a directory with a task.yaml and a candidates/ folder:
my_task/
task.yaml
candidates/
gpt.py
claude.py
human_baseline.py
# my_task/task.yaml
id: reverse-words
prompt: |
Given a string, reverse the order of the words.
entrypoint: reverse_words # the function each candidate must define
time_limit_s: 2.0
weights:
correctness: 0.6
performance: 0.25
quality: 0.15
cases:
- name: basic
args: ["the sky is blue"]
expected: "blue is sky the"
- name: trailing_spaces
args: [" hello world "]
expected: "world hello"Each candidate file defines the entrypoint function:
# my_task/candidates/claude.py
def reverse_words(s):
return " ".join(reversed(s.split()))Then codejudge run my_task.
| Dimension | Source | Range |
|---|---|---|
| Correctness | fraction of test cases passed | 0–1 |
| Performance | fastest passing candidate ÷ this candidate's avg time | 0–1 |
| Quality | AST complexity + docstring + brevity heuristic | 0–1 |
The aggregate is the weight-normalized sum. The full methodology, including the
exact quality formula and the known limitations of each metric, is documented in
docs/DESIGN.md.
The runner provides process isolation with a timeout, not a security
sandbox. Candidate code runs with your user's permissions and can touch the
filesystem and network. Only evaluate code you are willing to run locally, or
wrap the worker in a container/seccomp profile. See docs/DESIGN.md for the full
threat model and a list of metric caveats (e.g. why a candidate that only passes
the cheap cases can show a high perf score).
make install # editable install with dev tools
make test # pytest
make lint # ruff
make run # evaluate the two_sum exampleMIT © Khai Tha — see LICENSE.