This repository hosts the reproducibility package for the Programming Exercise Consistency Verification (PECV) benchmark. It contains a versioned dataset, CLI configs, and reference LLM pipeline used to evaluate cross-artifact inconsistency detectors in the accompanying paper.
The dataset is split into two versions:
- V1 — 91 variants across three Java exercises (Lectures, Panic at Seal Saloon, Space Seal Farm), annotating 93 inconsistencies across six ontology categories. This is the version used for the published benchmark results.
- V2 — Extended dataset covering 13 exercises across multiple programming languages (Java, Python, Assembly, SQL, Swift, and others), adding more variety in exercise type and language.
The reference pipeline—implemented in pecv-reference/ and requiring external LLM API credentials—reproduces the published results and writes traceable reports under results/.
Workflow overview for the packaged benchmark:
Results below are for V1 (results/V1/pecv-reference/).
| Benchmark | Config Key | N runs | TP | FP | FN | Precision | Recall | F1 | Span F1 | IoU | Avg Time (s) | Avg Cost ($) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| pecv-reference | model=openai:o4-mini, reasoning_effort=medium | 3 | 254 | 148 | 25 | 0.632 | 0.910 | 0.746 | 0.676 | 0.565 | 32.958 | 0.0338 |
| pecv-reference | model=openrouter:google/gemini-2.5-flash, reasoning_effort=medium | 3 | 263 | 623 | 15 | 0.297 | 0.946 | 0.452 | 0.597 | 0.474 | 26.380 | 0.0244 |
| pecv-reference | model=openrouter:google/gemini-2.5-flash-lite-preview-06-17, reasoning_effort=medium | 3 | 216 | 288 | 21 | 0.429 | 0.911 | 0.583 | 0.594 | 0.485 | 16.975 | 0.0063 |
| pecv-reference | model=openrouter:x-ai/grok-3-mini, reasoning_effort=medium | 3 | 233 | 222 | 46 | 0.512 | 0.835 | 0.635 | 0.640 | 0.534 | 14.306 | 0.0061 |
- N runs: Number of benchmark executions aggregated to compute the averages above.
- TP / FP / FN: True positives, false positives, and false negatives describing how many inconsistencies were correctly found, mistakenly flagged, or missed.
- Precision: Share of flagged inconsistencies that are correct, helping gauge how often the detector avoids false alarms.
- Recall: Share of gold inconsistencies the detector recovers, reflecting its ability to avoid misses.
- F1: Harmonic mean of precision and recall, balancing the trade-off between catching issues and avoiding false alerts.
- Span F1: Harmonic mean of span-level precision and recall, rewarding predictions that capture both the correct label and exact boundaries of inconsistent spans.
- IoU: Intersection over Union between predicted and gold spans, measuring overlap quality (also known as the Jaccard index).
- Avg Time (s): Mean wall-clock seconds per benchmark run, indicating latency.
- Avg Cost ($): Mean API cost per run, useful for budgeting experiments.
- Catch misalignments across problem statements, templates, solutions, and tests before students see them.
- Benchmark new detection approaches on annotated variants with labeled inconsistencies spanning six ontology categories.
- Reproduce and extend validated LLM baselines using the packaged CLI, configs, and reporting pipeline.
git clone https://github.com/ls1intum/pecv-bench.git
cd pecv-benchpython3 -m venv .venv
source .venv/bin/activateInstall the workspace directly in editable mode:
pip install --upgrade pip
pip install -e .For development workflows that need linters and formatters:
pip install -e .[dev]Whenever you start a new shell session, reactivate the virtual environment with source .venv/bin/activate before running any CLI commands.
The reference runner loads credentials from environment variables (or a .env file via pydantic-settings). Copy the template and fill in the keys for the providers you plan to use:
cp pecv-reference/.env.example pecv-reference/.envUpdate pecv-reference/.env (or export the variables in your shell) with the following values:
-
OpenAI (for
openai:*models)OPENAI_API_KEY– requiredOPENAI_BASE_URL– optional override if you proxy requests (defaults to the official API)
-
Azure OpenAI (for
azure_openai:*models)AZURE_OPENAI_API_KEY– requiredAZURE_OPENAI_ENDPOINT– required (format:https://<resource-name>.openai.azure.com)AZURE_OPENAI_API_VERSION– optional, defaults to the latest supported version
-
OpenRouter (for
openrouter:*models)OPENROUTER_API_KEY– requiredOPENROUTER_BASE_URL– optional if you self-host or use a regional endpoint
-
Other providers
- Add keys for any other providers you plan to use (e.g.,
ANTHROPIC_API_KEY,AI21_API_KEY, etc.) as described in the LangChaininit_chat_modeldocumentation.
- Add keys for any other providers you plan to use (e.g.,
-
LangSmith tracing & cost aggregation
LANGSMITH_API_KEY– required for authenticated tracingLANGSMITH_TRACING=true– leave enabled to collect tracesLANGCHAIN_TRACING_V2=true– ensures LangChain routes telemetry to LangSmithLANGCHAIN_PROJECT=pecv-bench(or any project name you prefer)- Optional:
LANGCHAIN_ENDPOINT=https://api.smith.langchain.comif you use a self-hosted deployment
Exercise paths always include the version prefix (V1 or V2).
Run the reference pipeline on all V1 exercises with OpenAI o4-mini at medium reasoning effort:
Run on a specific exercise from V1:
pecv-bench run-benchmark \
--exercise V1/ITP2425/H01E01-Lectures \
--model openai:o4-mini \
--reasoning-effort medium
--max-concurrency 5Run on a specific exercise from V2:
pecv-bench run-benchmark \
--exercise V2/QCSL25/QC03-Magic_State_Distillation \
--model openai:o4-mini \
--reasoning-effort medium
--max-concurrency 5Results are written to results/V1/pecv-reference/ or results/V2/pecv-reference/ depending on the exercise version.
Aggregate completed runs into Markdown/JSON/LaTeX summaries. Pass --results-dir as the full path to the benchmark directory. The version is inferred from the path automatically.
# V1 (default — equivalent to omitting --results-dir)
pecv-bench report
# V2
pecv-bench report --results-dir results/V2/pecv-reference
# Custom benchmark name
pecv-bench report --results-dir results/V1/my-experimentThe variants-analysis command groups results by model and generates scatter plots (tokens vs. F1).
# V1 (default)
pecv-bench variants-analysis
# V2
pecv-bench variants-analysis --results-dir results/V2/pecv-reference
# Clear previous analysis artifacts, then re-run and plot
pecv-bench variants-analysis --results-dir results/V2/pecv-reference --clear
pecv-bench variants-analysis --results-dir results/V2/pecv-reference --plotThe entry point pecv-bench exposes all automation helpers. Use the built-in help to explore each command:
pecv-bench --help
pecv-bench variants --help
pecv-bench run-benchmark --help
pecv-bench report --help
pecv-bench variants-analysis --helpManage dataset variants (list, init, materialize, clean, annotate).
All subcommands accept -e VERSION/COURSE/EXERCISE. Omitting -e operates across all versions and exercises.
# List all variants across all versions
pecv-bench variants list
# List all V2 variants
pecv-bench variants list -e V2
# List variants for one exercise
pecv-bench variants list -e V1/ITP2425/H01E01-Lectures
# Materialize a variant (applies its patch)
pecv-bench variants materialize -e V2/QCSL25/QC03-Magic_State_Distillation -v 007
# Clean materialized artifacts
pecv-bench variants clean -e V2/QCSL25/QC03-Magic_State_Distillation -v 007
# Generate a gold annotation using the reference pipeline
pecv-bench variants generate-annotation \
-e V2/QCSL25/QC03-Magic_State_Distillation -v 007 \
--model openai:o4-mini --reasoning-effort mediumExecute the benchmark pipeline for one or more exercises. Results are stored under results/VERSION/APPROACH/RUN-ID/cases/.
pecv-bench run-benchmark --exercise V1/ITP2425/H01E01-Lectures
pecv-bench run-benchmark --exercise V2/ISE22/H10E01-Containers --max-concurrency 4Generate summary.json, summary.md, and summary.tex from completed runs.
pecv-bench report # results/V1/pecv-reference (default)
pecv-bench report --results-dir results/V2/pecv-reference
pecv-bench report --results-dir results/V1/my-experimentAnalyse result consistency across runs and models, optionally generating scatter plots.
pecv-bench variants-analysis # results/V1/pecv-reference (default)
pecv-bench variants-analysis --results-dir results/V2/pecv-reference --plotresults/
├── V1/
│ └── pecv-reference/
│ ├── <timestamped-run-id>/
│ │ ├── cases/
│ │ │ └── ITP2425/H01E01-Lectures/001.json
│ │ └── run_report.json
│ ├── variants_report.json
│ ├── variants_report_plots/
│ │ ├── per_model.png
│ │ └── per_model_per_exercise.png
│ ├── summary.json
│ ├── summary.md
│ └── summary.tex
└── V2/
└── pecv-reference/
└── <same structure>
runs/
├── V1/
│ └── pecv-reference/
│ └── <timestamped-run-id>.yaml
└── V2/
└── pecv-reference/
└── <timestamped-run-id>.yaml
Run metadata lives in runs/VERSION/APPROACH/<run-id>.yaml, enabling resumable and auditable experiments.
- Tasks & datasets: Programming exercises from the Artemis learning management system. V1 covers three Java exercises; V2 extends coverage to additional courses and languages including Python, Assembly, SQL, and Swift.
- Inconsistency taxonomy: Six ontology categories—
ATTRIBUTE_TYPE_MISMATCH,METHOD_RETURN_TYPE_MISMATCH,IDENTIFIER_NAMING_INCONSISTENCY,METHOD_PARAMETER_MISMATCH,VISIBILITY_MISMATCH,CONSTRUCTOR_PARAMETER_MISMATCH. - Evaluation pipeline:
run-benchmarkorchestrates prompt construction, model execution, and output parsing;reportaligns predictions with gold spans and aggregates metrics (precision, recall, F1, span F1, IoU, latency, and cost).
- Configurations:
configs/pecv-reference.yamlcaptures model presets, reasoning effort, and run identifiers. Commit edited configs alongside experiments for traceability. - Determinism: Reasoning models introduce variability in outputs due to inherent randomness.
- Captured artifacts: Each run stores raw case outputs under
results/VERSION/APPROACH/<run-id>/cases/plus structured summaries (run_report.json). Metadata inruns/VERSION/APPROACH/<run-id>.yamlrecords CLI arguments, timestamps, and configuration digests.
- The At-a-Glance table above surfaces cross-run V1 metrics for the packaged reference configs.
- Detailed aggregates:
results/V1/pecv-reference/summary.md, machine-readablesummary.json, and LaTeX-readysummary.tex. - Per-run diagnostics: inspect
results/V1/pecv-reference/<run-id>/run_report.jsonalongside per-case artifacts inresults/V1/pecv-reference/<run-id>/cases/.
- Create a config (or reuse
configs/pecv-reference.yaml) and runpecv-bench run-benchmark ...with your approach. - Results are placed automatically under
results/VERSION/APPROACH/<run-id>/and metadata inruns/VERSION/APPROACH/<run-id>.yaml. - Re-run
pecv-bench report --results-dir results/VERSION/APPROACHto update summaries.
- Software: MIT License covers the CLI, reference pipeline, and supporting tooling.
- Benchmark data, annotations, schemas, and packaged results: Creative Commons Attribution 4.0 International.
Questions, bug reports, or contributions are always welcome—open an issue or pull request to get involved.
