Skip to content

Introducing Code Review category#610

Open
haoranpb wants to merge 69 commits into
mainfrom
category/code-review
Open

Introducing Code Review category#610
haoranpb wants to merge 69 commits into
mainfrom
category/code-review

Conversation

@haoranpb

@haoranpb haoranpb commented Apr 12, 2026

Copy link
Copy Markdown
Collaborator

A POC showing how Code Review category can be implemented.

Try it out locally:

# See the placeholder task in the dataset
uv run bcbench dataset view synthetic__performance-001 --category code-review

# See the default copilot review live
uv run bcbench -v evaluate copilot synthetic__performance-001 --category code-review --repo-path C:\depot\BCApps

Getting Started

Code Review team to implement

  • Replace the dataset, see codereview.jsonl
  • Implement the scoring and evaluation metrics calculation, we currently have placeholder metrics: count of comments
  • Add the AL Code Review skill

Leaderboard related logic is intentionally left unimplemented for now.

Comment thread src/bcbench/evaluate/codereview.py Fixed

@haoranpb haoranpb left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid progress.

One thing we should discuss: do we want this synthatic dataset? Or do we want to invest in real-world production PRs?

Comment thread tools/code-review/run_all_evals.ps1
Comment thread src/bcbench/agent/copilot/agent.py Outdated
Comment thread src/bcbench/agent/shared/config.yaml Outdated
Comment thread src/bcbench/evaluate/codereview.py Outdated
Comment thread src/bcbench/evaluate/codereview.py Outdated
Comment thread src/bcbench/results/codereview.py Outdated
Comment thread src/bcbench/results/summary.py Outdated
Comment thread src/bcbench/results/codereview.py Fixed
haoranpb and others added 3 commits June 1, 2026 15:03
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Comment thread src/bcbench/results/codereview.py
Comment on lines +62 to +67
def _severity_mae(matched_pairs: list[tuple[ReviewComment, ReviewComment]]) -> float:
if not matched_pairs:
return 0.0
total_error: int = sum(abs(expected.severity.level - generated.severity.level) for expected, generated in matched_pairs)
return total_error / len(matched_pairs)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring for this function, and fullname instead of shorthand

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Comment thread src/bcbench/results/codereview.py
Comment thread src/bcbench/results/codereview.py

@Groenbech96 Groenbech96 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider using a LLM as a judge to compare the review comment. Probably instruct that is has to be the same intent with the comment.

@WaelAbuSeada

Copy link
Copy Markdown
Member

Take a look at the score aggregation

…t branch

Snapshots all non-experiment files from experiment/code-review-al-skill onto
category/code-review. Experiment-specific assets (al-code-review skill and
custom instructions under microsoft-BCApps/instructions/*.md) remain only on
the experiment branch.

Highlights:
- Dataset: enriched 28 zero-expected entries (security/privacy/style/upgrade)
  with in-domain expected_comments; cleaned up OOD bait across pre-existing
  entries; renumbered performance and privacy entries to be contiguous.
- Eval: domain-aware code-review evaluation, codereview_judge for LLM-confirmed
  matches, improved review parsing, grouped per-domain summary layout.
- Results: domain-split metrics, leaderboard refresh, severity_mae, macro_f1.
- Tooling: probe_codereview_case/batch harness for local skill testing,
  apply_enrichment + unindent_bait_files + fix_enrichment_iteration_{1,2}
  scripts used to produce the dataset enrichment, dump_entries, ood_worklist,
  run_entry helpers.
- Hooks: Python log_tool_usage hook (Linux-compatible) with process log capture
  and unit tests.
- Workflow: copilot-evaluation.yml updates for category routing and metrics.
- Lint: ruff/ty cleanups across tools/, tests/, and shared hooks.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread tools/unindent_bait_files.py Fixed
WaelAbuSeada and others added 4 commits June 12, 2026 14:18
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
The /al-code-review skill prompt, custom instructions, and skills are
experiment-specific. Revert config.yaml to the defaults (/review prompt,
instructions/skills disabled). Experiment branch keeps its own version.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 46 out of 47 changed files in this pull request and generated 8 comments.

Comment thread src/bcbench/operations/hooks_operations.py Outdated
Comment on lines +51 to +54
table.add_column("MCP Servers", style="yellow")
table.add_column("Custom Instructions", style="yellow")
table.add_column("Skills", style="yellow")
table.add_column("Custom Agent", style="yellow")
Comment on lines 112 to 127
if extra_columns:
markdown_summary += f"| Instance ID | Project | Status | {extra_headers} | Error Message |\n"
markdown_summary += f"|-------------|---------|--------|{extra_separator}|---------------|\n"
markdown_summary += f"| Instance ID | Project | Status | {extra_headers} |\n"
markdown_summary += f"|-------------|---------|--------|{extra_separator}|\n"
else:
markdown_summary += "| Instance ID | Project | Status | Error Message |\n"
markdown_summary += "|-------------|---------|--------|---------------|\n"
markdown_summary += "| Instance ID | Project | Status |\n"
markdown_summary += "|-------------|---------|--------|\n"

for result in results:
for result in sorted(results, key=lambda r: r.sort_key):
_, status_icon = _status_style(result.status_label)
status_text = f"{status_icon} {result.status_label}"
error_msg = _get_short_error_message(result.error_message)
extra_values = " | ".join(result.display_row.values())
if extra_columns:
markdown_summary += f"| `{result.instance_id}` | `{result.project}` | {status_text} | {extra_values} | {error_msg} |\n"
markdown_summary += f"| `{result.instance_id}` | `{result.project}` | {status_text} | {extra_values} |\n"
else:
markdown_summary += f"| `{result.instance_id}` | `{result.project}` | {status_text} | {error_msg} |\n"
markdown_summary += f"| `{result.instance_id}` | `{result.project}` | {status_text} |\n"

Comment on lines +143 to +147
cr_runs: list[CodeReviewResultSummary] = [run for run in runs if isinstance(run, CodeReviewResultSummary)]
n = len(cr_runs)

f1_ci = bootstrap_ci([r.f1 for r in cr_runs])
macro_f1_ci = bootstrap_ci([r.macro_f1 for r in cr_runs])
Comment on lines +246 to +249
case "invalid":
result = CodeReviewResult.create_invalid(context, output="MOCK_REVIEW_OUTPUT", expected_comments=[])
case "valid":
result = CodeReviewResult.create(context, "MOCK_INVALID_REVIEW_OUTPUT", [], [], 0)
Comment thread tools/run_entry.py
Comment on lines +17 to +20
def main() -> None:
iid = sys.argv[1]
stamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
out_dir = pathlib.Path(f"evaluation_results/iter/{iid}_{stamp}")
Comment thread tools/run_entry.py
Comment on lines +59 to +63
print(
f"METRICS matched={r.get('matched_comment_count')} missed={r.get('missed_comment_count')} "
f"incorrect={r.get('incorrect_comment_count')} precision={r.get('precision'):.3f} "
f"recall={r.get('recall'):.3f} f1={r.get('f1'):.3f}"
)
REPORT_ROOT = REPO_ROOT / "tmp" / "cr-probe-reports"
DEFAULT_MODEL = "claude-opus-4.8"

PROMPT_TEMPLATE = """/al-code-review

@haoranpb haoranpb left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to add the display on the GitHub Page, similar to https://github.com/microsoft/BC-Bench/blob/main/docs/bug-fix.md, but that could be a follow up after we have some results to show

class CodeReviewEntry(BaseDatasetEntry):
"""Dataset entry for the code-review category."""

domain: str | None = None

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This property looks unused, I see it's already present on the comments

Comment on lines +24 to +67
def _looks_like_full_file_patch(patch: str) -> bool:
return "@@" not in patch and "\n--- " in f"\n{patch}" and "\n+++ " in f"\n{patch}"


def _materialize_full_file_patch(repo_path: Path, patch: str) -> list[str]:
file_count = 0
current_path: Path | None = None
current_content: list[str] = []
materialized_paths: list[str] = []

def flush_current() -> None:
nonlocal file_count, current_path, current_content
if current_path is None:
return
current_path.parent.mkdir(parents=True, exist_ok=True)
current_path.write_text("\n".join(current_content) + "\n", encoding="utf-8")
materialized_paths.append(current_path.relative_to(repo_path).as_posix())
file_count += 1
current_path = None
current_content = []

for line in patch.splitlines():
if line.startswith("--- "):
flush_current()
continue

if line.startswith("+++ "):
relative_path = re.sub(r"^[ab]/", "", line[4:].strip())
current_path = repo_path / relative_path
current_content = []
continue

if current_path is None:
continue

if line.startswith("+"):
current_content.append(line[1:])
continue

if line.startswith("\\ No newline at end of file"):
continue

flush_current()
return materialized_paths

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here is unnecessary, it is trying to handle incorrect git patch, which shouldn't happen in the code review production envrionment, where patch is collected by git diff.

If a patch won't apply, we should fix the patch instead, see c9193e5

Comment on lines +84 to +104
if entry.patch.strip():
try:
apply_patch(repo_path, entry.patch, f"{entry.instance_id} review patch")
except PatchApplicationError:
if not _looks_like_full_file_patch(entry.patch):
raise
materialized_paths = _materialize_full_file_patch(repo_path, entry.patch)
if not materialized_paths:
raise
# Mark untracked files as intent-to-add so `git diff HEAD` includes them.
subprocess.run(["git", "add", "-N", "--", *materialized_paths], cwd=repo_path, stdout=subprocess.DEVNULL, stderr=subprocess.PIPE, check=True)
logger.info(f"Materialized {len(materialized_paths)} file(s) from simplified review patch for {entry.instance_id}")

if paths := _patched_paths(entry.patch):
subprocess.run(
["git", "add", "-N", "--", *paths],
cwd=repo_path,
stdout=subprocess.DEVNULL,
stderr=subprocess.PIPE,
check=True,
)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, if apply_patch fails, we should fix the patch instead, so it's aligned with production environment

JUDGE_RESULT_FILE = "judge_results.json"

_JUDGE_PROMPT_TEMPLATE = """\
You are a code review evaluation judge. Your task is to determine whether pairs of code review \

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if all those \ escape is necessary here

Comment on lines +49 to +59
def _parse_judge_results(result_path: Path, num_pairs: int) -> list[bool]:
if not result_path.exists():
return [True] * num_pairs

try:
raw = json.loads(result_path.read_text(encoding="utf-8"))
except (json.JSONDecodeError, OSError):
return [True] * num_pairs

if not isinstance(raw, list):
return [True] * num_pairs

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we raise or return [False] when the judge fails?

Comment thread src/bcbench/operations/hooks_operations.py Outdated
class CodeReviewResult(BaseEvaluationResult):
"""Result for the code-review category."""

domain: str = "unknown"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should domain only reside in ReviewComment?

Comment on lines +67 to +69
def _with_comment_domains(generated_comments: list[ReviewComment], domain: str) -> list[ReviewComment]:
"""Stamp the entry domain onto comments that have no explicit domain. All comments are kept."""
return [comment if comment.domain else comment.model_copy(update={"domain": domain}) for comment in generated_comments]

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, bit confused now what is the relationship between domain and the code review result class

Comment on lines +171 to +172
incorrect_comment_count=max(0, len(generated_comments) - matched_count),
missed_comment_count=max(0, len(expected_comments) - matched_count),

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get rid of max? If the value gets under 0, we probably want to know what happened

Comment on lines +100 to +107
@property
def sort_key(self) -> tuple[object, ...]:
"""Sort key for the detailed results table.

Default orders by natural-sorted instance_id (so ``-002`` precedes ``-010``).
Subclasses can prepend category-specific keys (e.g. domain for code-review).
"""
return (natural_sort_key(self.instance_id),)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is necessary on the base class, could be specific on the code review sub class

Comment on lines -66 to -71
def _get_short_error_message(error_message: str | None) -> str:
"""Extract the first line of an error message for summary display."""
if not error_message:
return ""
first_line = error_message.split("\n")[0].rstrip(":")
return first_line.replace("|", "\\|")

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the error message column here is still necessary for all categories, a common one could be: copilot CLI timeout

path: ${{ env.EVALUATION_RESULTS_DIR }}/**/*.jsonl
path: |
${{ env.EVALUATION_RESULTS_DIR }}/**/*.jsonl
${{ env.EVALUATION_RESULTS_DIR }}/**/*.log

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the debug log might include too much information, what information are we trying to extrat here? Maybe a seperate log file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants