codejudge

A preference-ranking harness for evaluating AI-generated code.

codejudge takes several candidate solutions to the same programming task — typically generated by different LLMs — runs each one in an isolated subprocess, scores it on correctness, performance, and static quality, and emits a ranked leaderboard plus pairwise preference judgments.

Those pairwise preferences are exactly the shape of the data used to train reward models in RLHF (Reinforcement Learning from Human Feedback): given two responses to the same prompt, which is better, and why. codejudge produces them mechanically and reproducibly so a human reviewer can start from a defensible ranking instead of a blank page.

Task: two-sum
 #  candidate             pass  correct   perf   qual   score
-------------------------------------------------------------
 1  optimal_hashmap        5/5     1.00   0.06   0.88   0.748
 2  brute_force            5/5     1.00   0.00   0.76   0.713
 3  buggy_offbyone         1/5     0.20   1.00   0.88   0.502
 4  crashes_indexerror     0/5     0.00   0.00   0.80   0.120

Pairwise preferences (RLHF-style):
  optimal_hashmap  >  brute_force   (delta 0.035)  - optimal_hashmap preferred over brute_force: stronger quality (+0.12); aggregate +0.035
  optimal_hashmap  >  buggy_offbyone   (delta 0.246)  - optimal_hashmap preferred over buggy_offbyone: stronger correctness (+0.80); aggregate +0.246
  optimal_hashmap  >  crashes_indexerror   (delta 0.628)  - optimal_hashmap preferred over crashes_indexerror: stronger correctness (+1.00); aggregate +0.628
  ... and 3 more

Why this exists

When you evaluate AI-generated code for a living, three questions recur for every candidate: Does it actually work? Is it efficient? Is it clean enough to ship? codejudge answers all three automatically and turns the answers into a ranking you can audit — every number traces back to a test result or an AST metric, and every preference comes with a one-line reason.

It is intentionally small, dependency-light (one runtime dependency), and fully tested, so it is easy to read end to end and extend.

Features

Sandboxed execution — each candidate runs in its own python -I subprocess with a wall-clock timeout; infinite loops, crashes, and stray prints can't take down or corrupt the harness.
Three-dimension scoring — correctness (test pass rate), performance (relative speed on passing cases), and static quality (AST complexity, docstrings, brevity), combined with configurable weights.
RLHF-style preferences — full pairwise judgments with margins and rationales.
Plain-text, Markdown, and JSON output — drop the Markdown into a PR comment or pipe the JSON into a dataset.
Bring-your-own tasks — a task is just a task.yaml plus a folder of .py candidates. No code changes required.
Zero-dependency core — only PyYAML at runtime; pytest + ruff for dev.

Install

git clone https://github.com/khaitha/codejudge.git
cd codejudge
pip install -e ".[dev]"

Requires Python 3.9+.

Quickstart

# Evaluate a bundled example
codejudge run examples/two_sum

# Limit the preference list, and export machine-readable output
codejudge run examples/two_sum --max-prefs 5 --json report.json --markdown report.md

# Re-weight the dimensions (correctness, performance, quality)
codejudge run examples/two_sum --weights 0.8,0.1,0.1

# Inspect a task without running it
codejudge show examples/two_sum

Use it as a library, too:

from codejudge import evaluate_task

report = evaluate_task("examples/two_sum")
print(report.reports[0].candidate_id)        # 'optimal_hashmap'
for pref in report.preferences:
    print(pref.winner, "≻", pref.loser, "—", pref.reason)

How it works

            ┌── load_task / load_candidates  (loader.py)
            │
 task.yaml ─┤   for each candidate:
 candidates ┤     run_candidate ── python -I _worker.py  (runner.py + _worker.py)
            │         │  isolated subprocess, JSON in/out, timeout
            │     analyze_quality ── AST metrics           (scorer.py)
            │
            └── rank ── normalize → weight → sort → pairwise preferences (ranker.py)
                          │
                          └── render text / markdown / json (report.py)

Each stage is a pure, separately tested function; evaluator.evaluate_task wires them together.

Defining your own task

Create a directory with a task.yaml and a candidates/ folder:

my_task/
  task.yaml
  candidates/
    gpt.py
    claude.py
    human_baseline.py

# my_task/task.yaml
id: reverse-words
prompt: |
  Given a string, reverse the order of the words.
entrypoint: reverse_words        # the function each candidate must define
time_limit_s: 2.0
weights:
  correctness: 0.6
  performance: 0.25
  quality: 0.15
cases:
  - name: basic
    args: ["the sky is blue"]
    expected: "blue is sky the"
  - name: trailing_spaces
    args: ["  hello world  "]
    expected: "world hello"

Each candidate file defines the entrypoint function:

# my_task/candidates/claude.py
def reverse_words(s):
    return " ".join(reversed(s.split()))

Then codejudge run my_task.

Scoring at a glance

Dimension	Source	Range
Correctness	fraction of test cases passed	0–1
Performance	fastest passing candidate ÷ this candidate's avg time	0–1
Quality	AST complexity + docstring + brevity heuristic	0–1

The aggregate is the weight-normalized sum. The full methodology, including the exact quality formula and the known limitations of each metric, is documented in docs/DESIGN.md.

Safety & limitations

The runner provides process isolation with a timeout, not a security sandbox. Candidate code runs with your user's permissions and can touch the filesystem and network. Only evaluate code you are willing to run locally, or wrap the worker in a container/seccomp profile. See docs/DESIGN.md for the full threat model and a list of metric caveats (e.g. why a candidate that only passes the cheap cases can show a high perf score).

Development

make install   # editable install with dev tools
make test      # pytest
make lint      # ruff
make run       # evaluate the two_sum example

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
src/codejudge		src/codejudge
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
conftest.py		conftest.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

codejudge

Why this exists

Features

Install

Quickstart

How it works

Defining your own task

Scoring at a glance

Safety & limitations

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

codejudge

Why this exists

Features

Install

Quickstart

How it works

Defining your own task

Scoring at a glance

Safety & limitations

Development

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages