Skip to content

khaitha/codejudge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

76 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

codejudge

A preference-ranking harness for evaluating AI-generated code.

CI Python License: MIT

codejudge takes several candidate solutions to the same programming task — typically generated by different LLMs — runs each one in an isolated subprocess, scores it on correctness, performance, and static quality, and emits a ranked leaderboard plus pairwise preference judgments.

Those pairwise preferences are exactly the shape of the data used to train reward models in RLHF (Reinforcement Learning from Human Feedback): given two responses to the same prompt, which is better, and why. codejudge produces them mechanically and reproducibly so a human reviewer can start from a defensible ranking instead of a blank page.

Task: two-sum
 #  candidate             pass  correct   perf   qual   score
-------------------------------------------------------------
 1  optimal_hashmap        5/5     1.00   0.06   0.88   0.748
 2  brute_force            5/5     1.00   0.00   0.76   0.713
 3  buggy_offbyone         1/5     0.20   1.00   0.88   0.502
 4  crashes_indexerror     0/5     0.00   0.00   0.80   0.120

Pairwise preferences (RLHF-style):
  optimal_hashmap  >  brute_force   (delta 0.035)  - optimal_hashmap preferred over brute_force: stronger quality (+0.12); aggregate +0.035
  optimal_hashmap  >  buggy_offbyone   (delta 0.246)  - optimal_hashmap preferred over buggy_offbyone: stronger correctness (+0.80); aggregate +0.246
  optimal_hashmap  >  crashes_indexerror   (delta 0.628)  - optimal_hashmap preferred over crashes_indexerror: stronger correctness (+1.00); aggregate +0.628
  ... and 3 more

Why this exists

When you evaluate AI-generated code for a living, three questions recur for every candidate: Does it actually work? Is it efficient? Is it clean enough to ship? codejudge answers all three automatically and turns the answers into a ranking you can audit — every number traces back to a test result or an AST metric, and every preference comes with a one-line reason.

It is intentionally small, dependency-light (one runtime dependency), and fully tested, so it is easy to read end to end and extend.

Features

  • Sandboxed execution — each candidate runs in its own python -I subprocess with a wall-clock timeout; infinite loops, crashes, and stray prints can't take down or corrupt the harness.
  • Three-dimension scoring — correctness (test pass rate), performance (relative speed on passing cases), and static quality (AST complexity, docstrings, brevity), combined with configurable weights.
  • RLHF-style preferences — full pairwise judgments with margins and rationales.
  • Plain-text, Markdown, and JSON output — drop the Markdown into a PR comment or pipe the JSON into a dataset.
  • Bring-your-own tasks — a task is just a task.yaml plus a folder of .py candidates. No code changes required.
  • Zero-dependency core — only PyYAML at runtime; pytest + ruff for dev.

Install

git clone https://github.com/khaitha/codejudge.git
cd codejudge
pip install -e ".[dev]"

Requires Python 3.9+.

Quickstart

# Evaluate a bundled example
codejudge run examples/two_sum

# Limit the preference list, and export machine-readable output
codejudge run examples/two_sum --max-prefs 5 --json report.json --markdown report.md

# Re-weight the dimensions (correctness, performance, quality)
codejudge run examples/two_sum --weights 0.8,0.1,0.1

# Inspect a task without running it
codejudge show examples/two_sum

Use it as a library, too:

from codejudge import evaluate_task

report = evaluate_task("examples/two_sum")
print(report.reports[0].candidate_id)        # 'optimal_hashmap'
for pref in report.preferences:
    print(pref.winner, "≻", pref.loser, "—", pref.reason)

How it works

            ┌── load_task / load_candidates  (loader.py)
            │
 task.yaml ─┤   for each candidate:
 candidates ┤     run_candidate ── python -I _worker.py  (runner.py + _worker.py)
            │         │  isolated subprocess, JSON in/out, timeout
            │     analyze_quality ── AST metrics           (scorer.py)
            │
            └── rank ── normalize → weight → sort → pairwise preferences (ranker.py)
                          │
                          └── render text / markdown / json (report.py)

Each stage is a pure, separately tested function; evaluator.evaluate_task wires them together.

Defining your own task

Create a directory with a task.yaml and a candidates/ folder:

my_task/
  task.yaml
  candidates/
    gpt.py
    claude.py
    human_baseline.py
# my_task/task.yaml
id: reverse-words
prompt: |
  Given a string, reverse the order of the words.
entrypoint: reverse_words        # the function each candidate must define
time_limit_s: 2.0
weights:
  correctness: 0.6
  performance: 0.25
  quality: 0.15
cases:
  - name: basic
    args: ["the sky is blue"]
    expected: "blue is sky the"
  - name: trailing_spaces
    args: ["  hello world  "]
    expected: "world hello"

Each candidate file defines the entrypoint function:

# my_task/candidates/claude.py
def reverse_words(s):
    return " ".join(reversed(s.split()))

Then codejudge run my_task.

Scoring at a glance

Dimension Source Range
Correctness fraction of test cases passed 0–1
Performance fastest passing candidate ÷ this candidate's avg time 0–1
Quality AST complexity + docstring + brevity heuristic 0–1

The aggregate is the weight-normalized sum. The full methodology, including the exact quality formula and the known limitations of each metric, is documented in docs/DESIGN.md.

Safety & limitations

The runner provides process isolation with a timeout, not a security sandbox. Candidate code runs with your user's permissions and can touch the filesystem and network. Only evaluate code you are willing to run locally, or wrap the worker in a container/seccomp profile. See docs/DESIGN.md for the full threat model and a list of metric caveats (e.g. why a candidate that only passes the cheap cases can show a high perf score).

Development

make install   # editable install with dev tools
make test      # pytest
make lint      # ruff
make run       # evaluate the two_sum example

License

MIT © Khai Tha — see LICENSE.

About

A preference-ranking harness for evaluating AI-generated code: runs candidate solutions in isolated subprocesses, scores correctness/performance/quality, and emits a leaderboard + RLHF-style pairwise preferences.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors