Introducing Code Review category by haoranpb · Pull Request #610 · microsoft/BC-Bench

haoranpb · 2026-04-12T08:56:09Z

A POC showing how Code Review category can be implemented.

Try it out locally:

# See the placeholder task in the dataset
uv run bcbench dataset view synthetic__performance-001 --category code-review

# See the default copilot review live
uv run bcbench -v evaluate copilot synthetic__performance-001 --category code-review --repo-path C:\depot\BCApps

Getting Started

Code Review team to implement

Replace the dataset, see codereview.jsonl
Implement the scoring and evaluation metrics calculation, we currently have placeholder metrics: count of comments
Add the AL Code Review skill

~~Leaderboard related logic is intentionally left unimplemented for now.~~

…C-Bench into category/code-review

…display and refactoring comment parsing logic

…egory/code-review

haoranpb

Solid progress.

One thing we should discuss: do we want this synthatic dataset? Or do we want to invest in real-world production PRs?

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

…egory/code-review

haoranpb · 2026-06-01T07:04:53Z

+def _severity_mae(matched_pairs: list[tuple[ReviewComment, ReviewComment]]) -> float:
+    if not matched_pairs:
+        return 0.0
+    total_error: int = sum(abs(expected.severity.level - generated.severity.level) for expected, generated in matched_pairs)
+    return total_error / len(matched_pairs)
+


docstring for this function, and fullname instead of shorthand

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Groenbech96

I would consider using a LLM as a judge to compare the review comment. Probably instruct that is has to be the same intent with the comment.

…egory/code-review

WaelAbuSeada · 2026-06-08T12:35:25Z

Take a look at the score aggregation

…t branch Snapshots all non-experiment files from experiment/code-review-al-skill onto category/code-review. Experiment-specific assets (al-code-review skill and custom instructions under microsoft-BCApps/instructions/*.md) remain only on the experiment branch. Highlights: - Dataset: enriched 28 zero-expected entries (security/privacy/style/upgrade) with in-domain expected_comments; cleaned up OOD bait across pre-existing entries; renumbered performance and privacy entries to be contiguous. - Eval: domain-aware code-review evaluation, codereview_judge for LLM-confirmed matches, improved review parsing, grouped per-domain summary layout. - Results: domain-split metrics, leaderboard refresh, severity_mae, macro_f1. - Tooling: probe_codereview_case/batch harness for local skill testing, apply_enrichment + unindent_bait_files + fix_enrichment_iteration_{1,2} scripts used to produce the dataset enrichment, dump_entries, ood_worklist, run_entry helpers. - Hooks: Python log_tool_usage hook (Linux-compatible) with process log capture and unit tests. - Workflow: copilot-evaluation.yml updates for category routing and metrics. - Lint: ruff/ty cleanups across tools/, tests/, and shared hooks. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

The /al-code-review skill prompt, custom instructions, and skills are experiment-specific. Revert config.yaml to the defaults (/review prompt, instructions/skills disabled). Experiment branch keeps its own version. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…egory/code-review

Copilot

Pull request overview

Copilot reviewed 46 out of 47 changed files in this pull request and generated 8 comments.

+    table.add_column("MCP Servers", style="yellow")
+    table.add_column("Custom Instructions", style="yellow")
+    table.add_column("Skills", style="yellow")
+    table.add_column("Custom Agent", style="yellow")


    if extra_columns:
-        markdown_summary += f"| Instance ID | Project | Status | {extra_headers} | Error Message |\n"
-        markdown_summary += f"|-------------|---------|--------|{extra_separator}|---------------|\n"
+        markdown_summary += f"| Instance ID | Project | Status | {extra_headers} |\n"
+        markdown_summary += f"|-------------|---------|--------|{extra_separator}|\n"
    else:
-        markdown_summary += "| Instance ID | Project | Status | Error Message |\n"
-        markdown_summary += "|-------------|---------|--------|---------------|\n"
+        markdown_summary += "| Instance ID | Project | Status |\n"
+        markdown_summary += "|-------------|---------|--------|\n"

-    for result in results:
+    for result in sorted(results, key=lambda r: r.sort_key):
        _, status_icon = _status_style(result.status_label)
        status_text = f"{status_icon} {result.status_label}"
-        error_msg = _get_short_error_message(result.error_message)
        extra_values = " | ".join(result.display_row.values())
        if extra_columns:
-            markdown_summary += f"| `{result.instance_id}` | `{result.project}` | {status_text} | {extra_values} | {error_msg} |\n"
+            markdown_summary += f"| `{result.instance_id}` | `{result.project}` | {status_text} | {extra_values} |\n"
        else:
-            markdown_summary += f"| `{result.instance_id}` | `{result.project}` | {status_text} | {error_msg} |\n"
+            markdown_summary += f"| `{result.instance_id}` | `{result.project}` | {status_text} |\n"



+        cr_runs: list[CodeReviewResultSummary] = [run for run in runs if isinstance(run, CodeReviewResultSummary)]
+        n = len(cr_runs)
+
+        f1_ci = bootstrap_ci([r.f1 for r in cr_runs])
+        macro_f1_ci = bootstrap_ci([r.macro_f1 for r in cr_runs])


+            case "invalid":
+                result = CodeReviewResult.create_invalid(context, output="MOCK_REVIEW_OUTPUT", expected_comments=[])
+            case "valid":
+                result = CodeReviewResult.create(context, "MOCK_INVALID_REVIEW_OUTPUT", [], [], 0)


+def main() -> None:
+    iid = sys.argv[1]
+    stamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
+    out_dir = pathlib.Path(f"evaluation_results/iter/{iid}_{stamp}")


+    print(
+        f"METRICS matched={r.get('matched_comment_count')} missed={r.get('missed_comment_count')} "
+        f"incorrect={r.get('incorrect_comment_count')} precision={r.get('precision'):.3f} "
+        f"recall={r.get('recall'):.3f} f1={r.get('f1'):.3f}"
+    )


+REPORT_ROOT = REPO_ROOT / "tmp" / "cr-probe-reports"
+DEFAULT_MODEL = "claude-opus-4.8"
+
+PROMPT_TEMPLATE = """/al-code-review


haoranpb

We also need to add the display on the GitHub Page, similar to https://github.com/microsoft/BC-Bench/blob/main/docs/bug-fix.md, but that could be a follow up after we have some results to show

haoranpb · 2026-06-15T07:38:18Z

+class CodeReviewEntry(BaseDatasetEntry):
+    """Dataset entry for the code-review category."""
+
+    domain: str | None = None


This property looks unused, I see it's already present on the comments

haoranpb · 2026-06-15T07:47:04Z

+def _looks_like_full_file_patch(patch: str) -> bool:
+    return "@@" not in patch and "\n--- " in f"\n{patch}" and "\n+++ " in f"\n{patch}"
+
+
+def _materialize_full_file_patch(repo_path: Path, patch: str) -> list[str]:
+    file_count = 0
+    current_path: Path | None = None
+    current_content: list[str] = []
+    materialized_paths: list[str] = []
+
+    def flush_current() -> None:
+        nonlocal file_count, current_path, current_content
+        if current_path is None:
+            return
+        current_path.parent.mkdir(parents=True, exist_ok=True)
+        current_path.write_text("\n".join(current_content) + "\n", encoding="utf-8")
+        materialized_paths.append(current_path.relative_to(repo_path).as_posix())
+        file_count += 1
+        current_path = None
+        current_content = []
+
+    for line in patch.splitlines():
+        if line.startswith("--- "):
+            flush_current()
+            continue
+
+        if line.startswith("+++ "):
+            relative_path = re.sub(r"^[ab]/", "", line[4:].strip())
+            current_path = repo_path / relative_path
+            current_content = []
+            continue
+
+        if current_path is None:
+            continue
+
+        if line.startswith("+"):
+            current_content.append(line[1:])
+            continue
+
+        if line.startswith("\\ No newline at end of file"):
+            continue
+
+    flush_current()
+    return materialized_paths


The code here is unnecessary, it is trying to handle incorrect git patch, which shouldn't happen in the code review production envrionment, where patch is collected by git diff.

If a patch won't apply, we should fix the patch instead, see c9193e5

haoranpb · 2026-06-15T07:48:32Z

+        if entry.patch.strip():
+            try:
+                apply_patch(repo_path, entry.patch, f"{entry.instance_id} review patch")
+            except PatchApplicationError:
+                if not _looks_like_full_file_patch(entry.patch):
+                    raise
+                materialized_paths = _materialize_full_file_patch(repo_path, entry.patch)
+                if not materialized_paths:
+                    raise
+                # Mark untracked files as intent-to-add so `git diff HEAD` includes them.
+                subprocess.run(["git", "add", "-N", "--", *materialized_paths], cwd=repo_path, stdout=subprocess.DEVNULL, stderr=subprocess.PIPE, check=True)
+                logger.info(f"Materialized {len(materialized_paths)} file(s) from simplified review patch for {entry.instance_id}")
+
+            if paths := _patched_paths(entry.patch):
+                subprocess.run(
+                    ["git", "add", "-N", "--", *paths],
+                    cwd=repo_path,
+                    stdout=subprocess.DEVNULL,
+                    stderr=subprocess.PIPE,
+                    check=True,
+                )


Same here, if apply_patch fails, we should fix the patch instead, so it's aligned with production environment

haoranpb · 2026-06-15T07:52:00Z

+JUDGE_RESULT_FILE = "judge_results.json"
+
+_JUDGE_PROMPT_TEMPLATE = """\
+You are a code review evaluation judge. Your task is to determine whether pairs of code review \


Not sure if all those \ escape is necessary here

haoranpb · 2026-06-15T07:57:00Z

+def _parse_judge_results(result_path: Path, num_pairs: int) -> list[bool]:
+    if not result_path.exists():
+        return [True] * num_pairs
+
+    try:
+        raw = json.loads(result_path.read_text(encoding="utf-8"))
+    except (json.JSONDecodeError, OSError):
+        return [True] * num_pairs
+
+    if not isinstance(raw, list):
+        return [True] * num_pairs


Shouldn't we raise or return [False] when the judge fails?

haoranpb · 2026-06-15T08:31:23Z

+class CodeReviewResult(BaseEvaluationResult):
+    """Result for the code-review category."""
+
+    domain: str = "unknown"


Should domain only reside in ReviewComment?

haoranpb · 2026-06-15T08:33:05Z

+def _with_comment_domains(generated_comments: list[ReviewComment], domain: str) -> list[ReviewComment]:
+    """Stamp the entry domain onto comments that have no explicit domain. All comments are kept."""
+    return [comment if comment.domain else comment.model_copy(update={"domain": domain}) for comment in generated_comments]


Hmm, bit confused now what is the relationship between domain and the code review result class

haoranpb · 2026-06-15T08:34:12Z

+            incorrect_comment_count=max(0, len(generated_comments) - matched_count),
+            missed_comment_count=max(0, len(expected_comments) - matched_count),


Get rid of max? If the value gets under 0, we probably want to know what happened

haoranpb · 2026-06-15T08:36:13Z

+    @property
+    def sort_key(self) -> tuple[object, ...]:
+        """Sort key for the detailed results table.
+
+        Default orders by natural-sorted instance_id (so ``-002`` precedes ``-010``).
+        Subclasses can prepend category-specific keys (e.g. domain for code-review).
+        """
+        return (natural_sort_key(self.instance_id),)


Not sure this is necessary on the base class, could be specific on the code review sub class

haoranpb · 2026-06-15T09:13:27Z

-def _get_short_error_message(error_message: str | None) -> str:
-    """Extract the first line of an error message for summary display."""
-    if not error_message:
-        return ""
-    first_line = error_message.split("\n")[0].rstrip(":")
-    return first_line.replace("|", "\\|")


I think the error message column here is still necessary for all categories, a common one could be: copilot CLI timeout

…egory/code-review

haoranpb · 2026-06-16T14:59:45Z

-          path: ${{ env.EVALUATION_RESULTS_DIR }}/**/*.jsonl
+          path: |
+            ${{ env.EVALUATION_RESULTS_DIR }}/**/*.jsonl
+            ${{ env.EVALUATION_RESULTS_DIR }}/**/*.log


the debug log might include too much information, what information are we trying to extrat here? Maybe a seperate log file

haoranpb added 9 commits April 8, 2026 13:57

few more udpates for new categories

db388bd

Refactor evaluation and dataset operations for improved workspace setup

57c004e

enable skipping container setup in action

8e2f216

fix missing implementation for MockEvaluationPipeline

69a8db8

Refactor evaluation result classes to be more generic

7549d92

Merge branch 'main' into fix/more-ready-for-categories

f32dd00

Improve readabilty of GitHub Action summary

a4089b9

fix failing tests

99af6b2

Code Review POC

e1b0b93

haoranpb assigned darjoo and WaelAbuSeada Apr 12, 2026

haoranpb and others added 12 commits April 13, 2026 08:15

Merge branch 'main' into category/code-review

1a68d78

fix merge conflict resolution mistake

3ec10a0

Merge branch 'main' into category/code-review

4e52832

Make container parameters optional in evaluate and run commands

a9f59d9

Merge branch 'category/code-review' of https://github.com/microsoft/B…

065e1aa

…C-Bench into category/code-review

Enhance code review functionality by adding expected review comments …

4ad4bd9

…display and refactoring comment parsing logic

better hanlding container for not required categories

92951c4

Merge branch 'main' into category/code-review

7902610

Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…

dad9289

…egory/code-review

Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…

f1c4894

…egory/code-review

prefer copilot.exe executable

aa48a29

Normalize code-review dataset and preserve eval outputs

a244503

github-code-quality Bot found potential problems May 16, 2026

View reviewed changes

Comment thread src/bcbench/evaluate/codereview.py Fixed

haoranpb commented May 18, 2026

View reviewed changes

WaelAbuSeada added 5 commits May 19, 2026 18:31

Fix code-review branch setup and workflow wiring

9f6c353

Require review.json and add log-based recovery fallback

1a58e44

Harden code-review prompt for Windows copilot.cmd parsing

d0e8076

Expand code-review detailed table metrics

7411246

Update config and container setup action

c7131a4

fix failing tests

49a5cef

github-code-quality Bot found potential problems Jun 1, 2026

View reviewed changes

Comment thread src/bcbench/results/codereview.py Fixed

haoranpb and others added 3 commits June 1, 2026 15:03

Potential fix for pull request finding 'Module imports itself'

35c5045

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…

eaa1a2c

…egory/code-review

add CodeReview to mock tests

e00e939

haoranpb commented Jun 1, 2026

View reviewed changes

haoranpb requested a review from Copilot June 1, 2026 14:19

Copilot started reviewing on behalf of haoranpb June 1, 2026 14:20 View session

Copilot AI reviewed Jun 1, 2026

kmhansen reviewed Jun 2, 2026

View reviewed changes

Comment thread src/bcbench/results/codereview.py

kmhansen reviewed Jun 2, 2026

View reviewed changes

Comment thread src/bcbench/results/codereview.py

Groenbech96 reviewed Jun 3, 2026

View reviewed changes

haoranpb and others added 3 commits June 3, 2026 14:10

Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…

13b568c

…egory/code-review

Remove skills and instructions from category branch

d34742b

Restore al-test-generation assets on category branch

9eab86e

github-code-quality Bot found potential problems Jun 12, 2026

View reviewed changes

Comment thread tools/unindent_bait_files.py Fixed

WaelAbuSeada and others added 4 commits June 12, 2026 14:18

Potential fix for pull request finding 'Variable defined multiple times'

f5825f2

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

Include domain and suggestedCode in code-review finding schema

e77c93b

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…

1f01d45

…egory/code-review

haoranpb requested a review from Copilot June 15, 2026 07:34

Copilot started reviewing on behalf of haoranpb June 15, 2026 07:35 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

haoranpb commented Jun 15, 2026

View reviewed changes

Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…

37ec7e0

…egory/code-review

haoranpb commented Jun 16, 2026

View reviewed changes

OS agnostic hook

2dad93f

		incorrect_comment_count=max(0, len(generated_comments) - matched_count),
		missed_comment_count=max(0, len(expected_comments) - matched_count),

Conversation

haoranpb commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review team to implement

Uh oh!

Uh oh!

haoranpb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Groenbech96 left a comment

Choose a reason for hiding this comment

Uh oh!

WaelAbuSeada commented Jun 8, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

haoranpb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

haoranpb commented Apr 12, 2026 •

edited

Loading