Introducing Code Review category#610
Conversation
…C-Bench into category/code-review
…display and refactoring comment parsing logic
…egory/code-review
…egory/code-review
haoranpb
left a comment
There was a problem hiding this comment.
Solid progress.
One thing we should discuss: do we want this synthatic dataset? Or do we want to invest in real-world production PRs?
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
…egory/code-review
| def _severity_mae(matched_pairs: list[tuple[ReviewComment, ReviewComment]]) -> float: | ||
| if not matched_pairs: | ||
| return 0.0 | ||
| total_error: int = sum(abs(expected.severity.level - generated.severity.level) for expected, generated in matched_pairs) | ||
| return total_error / len(matched_pairs) | ||
|
|
There was a problem hiding this comment.
docstring for this function, and fullname instead of shorthand
Groenbech96
left a comment
There was a problem hiding this comment.
I would consider using a LLM as a judge to compare the review comment. Probably instruct that is has to be the same intent with the comment.
|
Take a look at the score aggregation |
…t branch
Snapshots all non-experiment files from experiment/code-review-al-skill onto
category/code-review. Experiment-specific assets (al-code-review skill and
custom instructions under microsoft-BCApps/instructions/*.md) remain only on
the experiment branch.
Highlights:
- Dataset: enriched 28 zero-expected entries (security/privacy/style/upgrade)
with in-domain expected_comments; cleaned up OOD bait across pre-existing
entries; renumbered performance and privacy entries to be contiguous.
- Eval: domain-aware code-review evaluation, codereview_judge for LLM-confirmed
matches, improved review parsing, grouped per-domain summary layout.
- Results: domain-split metrics, leaderboard refresh, severity_mae, macro_f1.
- Tooling: probe_codereview_case/batch harness for local skill testing,
apply_enrichment + unindent_bait_files + fix_enrichment_iteration_{1,2}
scripts used to produce the dataset enrichment, dump_entries, ood_worklist,
run_entry helpers.
- Hooks: Python log_tool_usage hook (Linux-compatible) with process log capture
and unit tests.
- Workflow: copilot-evaluation.yml updates for category routing and metrics.
- Lint: ruff/ty cleanups across tools/, tests/, and shared hooks.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
The /al-code-review skill prompt, custom instructions, and skills are experiment-specific. Revert config.yaml to the defaults (/review prompt, instructions/skills disabled). Experiment branch keeps its own version. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…egory/code-review
| table.add_column("MCP Servers", style="yellow") | ||
| table.add_column("Custom Instructions", style="yellow") | ||
| table.add_column("Skills", style="yellow") | ||
| table.add_column("Custom Agent", style="yellow") |
| if extra_columns: | ||
| markdown_summary += f"| Instance ID | Project | Status | {extra_headers} | Error Message |\n" | ||
| markdown_summary += f"|-------------|---------|--------|{extra_separator}|---------------|\n" | ||
| markdown_summary += f"| Instance ID | Project | Status | {extra_headers} |\n" | ||
| markdown_summary += f"|-------------|---------|--------|{extra_separator}|\n" | ||
| else: | ||
| markdown_summary += "| Instance ID | Project | Status | Error Message |\n" | ||
| markdown_summary += "|-------------|---------|--------|---------------|\n" | ||
| markdown_summary += "| Instance ID | Project | Status |\n" | ||
| markdown_summary += "|-------------|---------|--------|\n" | ||
|
|
||
| for result in results: | ||
| for result in sorted(results, key=lambda r: r.sort_key): | ||
| _, status_icon = _status_style(result.status_label) | ||
| status_text = f"{status_icon} {result.status_label}" | ||
| error_msg = _get_short_error_message(result.error_message) | ||
| extra_values = " | ".join(result.display_row.values()) | ||
| if extra_columns: | ||
| markdown_summary += f"| `{result.instance_id}` | `{result.project}` | {status_text} | {extra_values} | {error_msg} |\n" | ||
| markdown_summary += f"| `{result.instance_id}` | `{result.project}` | {status_text} | {extra_values} |\n" | ||
| else: | ||
| markdown_summary += f"| `{result.instance_id}` | `{result.project}` | {status_text} | {error_msg} |\n" | ||
| markdown_summary += f"| `{result.instance_id}` | `{result.project}` | {status_text} |\n" | ||
|
|
| cr_runs: list[CodeReviewResultSummary] = [run for run in runs if isinstance(run, CodeReviewResultSummary)] | ||
| n = len(cr_runs) | ||
|
|
||
| f1_ci = bootstrap_ci([r.f1 for r in cr_runs]) | ||
| macro_f1_ci = bootstrap_ci([r.macro_f1 for r in cr_runs]) |
| case "invalid": | ||
| result = CodeReviewResult.create_invalid(context, output="MOCK_REVIEW_OUTPUT", expected_comments=[]) | ||
| case "valid": | ||
| result = CodeReviewResult.create(context, "MOCK_INVALID_REVIEW_OUTPUT", [], [], 0) |
| def main() -> None: | ||
| iid = sys.argv[1] | ||
| stamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S") | ||
| out_dir = pathlib.Path(f"evaluation_results/iter/{iid}_{stamp}") |
| print( | ||
| f"METRICS matched={r.get('matched_comment_count')} missed={r.get('missed_comment_count')} " | ||
| f"incorrect={r.get('incorrect_comment_count')} precision={r.get('precision'):.3f} " | ||
| f"recall={r.get('recall'):.3f} f1={r.get('f1'):.3f}" | ||
| ) |
| REPORT_ROOT = REPO_ROOT / "tmp" / "cr-probe-reports" | ||
| DEFAULT_MODEL = "claude-opus-4.8" | ||
|
|
||
| PROMPT_TEMPLATE = """/al-code-review |
haoranpb
left a comment
There was a problem hiding this comment.
We also need to add the display on the GitHub Page, similar to https://github.com/microsoft/BC-Bench/blob/main/docs/bug-fix.md, but that could be a follow up after we have some results to show
| class CodeReviewEntry(BaseDatasetEntry): | ||
| """Dataset entry for the code-review category.""" | ||
|
|
||
| domain: str | None = None |
There was a problem hiding this comment.
This property looks unused, I see it's already present on the comments
| def _looks_like_full_file_patch(patch: str) -> bool: | ||
| return "@@" not in patch and "\n--- " in f"\n{patch}" and "\n+++ " in f"\n{patch}" | ||
|
|
||
|
|
||
| def _materialize_full_file_patch(repo_path: Path, patch: str) -> list[str]: | ||
| file_count = 0 | ||
| current_path: Path | None = None | ||
| current_content: list[str] = [] | ||
| materialized_paths: list[str] = [] | ||
|
|
||
| def flush_current() -> None: | ||
| nonlocal file_count, current_path, current_content | ||
| if current_path is None: | ||
| return | ||
| current_path.parent.mkdir(parents=True, exist_ok=True) | ||
| current_path.write_text("\n".join(current_content) + "\n", encoding="utf-8") | ||
| materialized_paths.append(current_path.relative_to(repo_path).as_posix()) | ||
| file_count += 1 | ||
| current_path = None | ||
| current_content = [] | ||
|
|
||
| for line in patch.splitlines(): | ||
| if line.startswith("--- "): | ||
| flush_current() | ||
| continue | ||
|
|
||
| if line.startswith("+++ "): | ||
| relative_path = re.sub(r"^[ab]/", "", line[4:].strip()) | ||
| current_path = repo_path / relative_path | ||
| current_content = [] | ||
| continue | ||
|
|
||
| if current_path is None: | ||
| continue | ||
|
|
||
| if line.startswith("+"): | ||
| current_content.append(line[1:]) | ||
| continue | ||
|
|
||
| if line.startswith("\\ No newline at end of file"): | ||
| continue | ||
|
|
||
| flush_current() | ||
| return materialized_paths |
There was a problem hiding this comment.
The code here is unnecessary, it is trying to handle incorrect git patch, which shouldn't happen in the code review production envrionment, where patch is collected by git diff.
If a patch won't apply, we should fix the patch instead, see c9193e5
| if entry.patch.strip(): | ||
| try: | ||
| apply_patch(repo_path, entry.patch, f"{entry.instance_id} review patch") | ||
| except PatchApplicationError: | ||
| if not _looks_like_full_file_patch(entry.patch): | ||
| raise | ||
| materialized_paths = _materialize_full_file_patch(repo_path, entry.patch) | ||
| if not materialized_paths: | ||
| raise | ||
| # Mark untracked files as intent-to-add so `git diff HEAD` includes them. | ||
| subprocess.run(["git", "add", "-N", "--", *materialized_paths], cwd=repo_path, stdout=subprocess.DEVNULL, stderr=subprocess.PIPE, check=True) | ||
| logger.info(f"Materialized {len(materialized_paths)} file(s) from simplified review patch for {entry.instance_id}") | ||
|
|
||
| if paths := _patched_paths(entry.patch): | ||
| subprocess.run( | ||
| ["git", "add", "-N", "--", *paths], | ||
| cwd=repo_path, | ||
| stdout=subprocess.DEVNULL, | ||
| stderr=subprocess.PIPE, | ||
| check=True, | ||
| ) |
There was a problem hiding this comment.
Same here, if apply_patch fails, we should fix the patch instead, so it's aligned with production environment
| JUDGE_RESULT_FILE = "judge_results.json" | ||
|
|
||
| _JUDGE_PROMPT_TEMPLATE = """\ | ||
| You are a code review evaluation judge. Your task is to determine whether pairs of code review \ |
There was a problem hiding this comment.
Not sure if all those \ escape is necessary here
| def _parse_judge_results(result_path: Path, num_pairs: int) -> list[bool]: | ||
| if not result_path.exists(): | ||
| return [True] * num_pairs | ||
|
|
||
| try: | ||
| raw = json.loads(result_path.read_text(encoding="utf-8")) | ||
| except (json.JSONDecodeError, OSError): | ||
| return [True] * num_pairs | ||
|
|
||
| if not isinstance(raw, list): | ||
| return [True] * num_pairs |
There was a problem hiding this comment.
Shouldn't we raise or return [False] when the judge fails?
| class CodeReviewResult(BaseEvaluationResult): | ||
| """Result for the code-review category.""" | ||
|
|
||
| domain: str = "unknown" |
There was a problem hiding this comment.
Should domain only reside in ReviewComment?
| def _with_comment_domains(generated_comments: list[ReviewComment], domain: str) -> list[ReviewComment]: | ||
| """Stamp the entry domain onto comments that have no explicit domain. All comments are kept.""" | ||
| return [comment if comment.domain else comment.model_copy(update={"domain": domain}) for comment in generated_comments] |
There was a problem hiding this comment.
Hmm, bit confused now what is the relationship between domain and the code review result class
| incorrect_comment_count=max(0, len(generated_comments) - matched_count), | ||
| missed_comment_count=max(0, len(expected_comments) - matched_count), |
There was a problem hiding this comment.
Get rid of max? If the value gets under 0, we probably want to know what happened
| @property | ||
| def sort_key(self) -> tuple[object, ...]: | ||
| """Sort key for the detailed results table. | ||
|
|
||
| Default orders by natural-sorted instance_id (so ``-002`` precedes ``-010``). | ||
| Subclasses can prepend category-specific keys (e.g. domain for code-review). | ||
| """ | ||
| return (natural_sort_key(self.instance_id),) |
There was a problem hiding this comment.
Not sure this is necessary on the base class, could be specific on the code review sub class
| def _get_short_error_message(error_message: str | None) -> str: | ||
| """Extract the first line of an error message for summary display.""" | ||
| if not error_message: | ||
| return "" | ||
| first_line = error_message.split("\n")[0].rstrip(":") | ||
| return first_line.replace("|", "\\|") |
There was a problem hiding this comment.
I think the error message column here is still necessary for all categories, a common one could be: copilot CLI timeout
…egory/code-review
| path: ${{ env.EVALUATION_RESULTS_DIR }}/**/*.jsonl | ||
| path: | | ||
| ${{ env.EVALUATION_RESULTS_DIR }}/**/*.jsonl | ||
| ${{ env.EVALUATION_RESULTS_DIR }}/**/*.log |
There was a problem hiding this comment.
the debug log might include too much information, what information are we trying to extrat here? Maybe a seperate log file
A POC showing how Code Review category can be implemented.
Try it out locally:
Getting Started
Code Review team to implement
codereview.jsonlLeaderboard related logic is intentionally left unimplemented for now.