Add DiskANN benchmark pipeline to GitHub Actions#857
Add DiskANN benchmark pipeline to GitHub Actions#857YuanyuanTian-hh wants to merge 22 commits intomainfrom
Conversation
- Add benchmarks.yml workflow using workflow_dispatch, comparing current branch against a configurable baseline ref - Add compare_disk_index_json_output.py to diff benchmark crate JSON outputs into a CSV suitable for benchmark_result_parse.py - Add benchmark_result_parse.py for validating results and posting PR comments - Add wikipedia-100K-disk-index.json benchmark config using the public Wikipedia-100K dataset from big-ann-benchmarks (100K Cohere embeddings, 768-dim, cosine distance) to replace internal ADO datasets
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #857 +/- ##
==========================================
+ Coverage 89.28% 90.45% +1.17%
==========================================
Files 442 442
Lines 83009 83248 +239
==========================================
+ Hits 74111 75301 +1190
+ Misses 8898 7947 -951
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
…or ADO mimir-enron, not applicable to public datasets on GitHub runners. Threshold calibration tracked in PBI.
…-normalized, metric is inner product)
…p, not cosine similarity)
There was a problem hiding this comment.
Pull request overview
Adds a GitHub Actions workflow to run DiskANN macro-benchmarks (disk-index build + search) on two public 100K datasets, compare current vs a baseline ref, and validate performance regressions via CSV-based threshold checks.
Changes:
- Add benchmark input configs for Wikipedia-100K and OpenAI ArXiv 100K disk-index benchmarks.
- Add a new
BenchmarksGitHub Actions workflow that runs baseline + target, diffs results, and uploads artifacts. - Add Python scripts to (1) diff disk-index JSON outputs into a CSV and (2) parse/validate the CSV against thresholds (optionally commenting on PRs).
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
diskann-benchmark/perf_test_inputs/wikipedia-100K-disk-index.json |
Adds a disk-index build/search benchmark config for Wikipedia-100K. |
diskann-benchmark/perf_test_inputs/openai-100K-disk-index.json |
Adds a disk-index build/search benchmark config for OpenAI ArXiv 100K. |
.github/workflows/benchmarks.yml |
Introduces a workflow to run baseline vs current benchmarks, diff, validate, and upload artifacts. |
.github/scripts/compare_disk_index_json_output.py |
Generates a comparison CSV from two disk-index benchmark JSON outputs. |
.github/scripts/benchmark_result_parse.py |
Parses the comparison CSV, checks thresholds/contracts, and reports regressions (with optional PR commenting). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| metrics["mean_hops"] = sr.get("mean_hops", 0) | ||
| metrics["mean_io_time"] = sr.get("mean_io_time", 0) | ||
| metrics["mean_cpus"] = sr.get("mean_cpu_time", 0) | ||
| metrics["latency_95"] = sr.get("p999_latency", 0) # Use p999 as proxy for 95th percentile |
There was a problem hiding this comment.
latency_95 is currently populated from p999_latency, which makes the reported/validated latency metric incorrect (the benchmark JSON includes p95_latency separately). Use the real p95 field (or rename the metric/key everywhere to p999) so the CSV and threshold checks match what they claim to measure.
| metrics["latency_95"] = sr.get("p999_latency", 0) # Use p999 as proxy for 95th percentile | |
| metrics["latency_95"] = sr.get("p95_latency", 0) # Use actual 95th percentile latency |
|
|
||
| # Total build time (in seconds) | ||
| build_time = build.get("build_time") | ||
| if build_time: |
There was a problem hiding this comment.
build_time is checked with a truthiness test (if build_time:), which will skip recording total_time if the value is 0. Prefer an explicit is not None check so 0 is handled consistently and the intent is clear.
| if build_time: | |
| if build_time is not None: |
| # Parse values: [current, baseline, change%] | ||
| try: | ||
| value_current = float(values[0]) | ||
| value_baseline = float(values[1]) | ||
| change = float(values[2]) if values[2] else 0.0 | ||
| except (ValueError, IndexError) as e: | ||
| print(f"ERROR: Failed to parse {category}/{metric}: {e}") | ||
| return True, f"Parse error for {category}/{metric}" | ||
|
|
||
| # Get threshold config | ||
| threshold_config = thresholds[category][metric] | ||
| threshold_pct = threshold_config[0] | ||
| direction = threshold_config[1] | ||
| contract_value = threshold_config[2] | ||
|
|
||
| # Check thresholds | ||
| target_range = get_target_change_range(threshold_pct, direction, mode) | ||
| threshold_failed = is_change_threshold_failed(change, target_range) | ||
| promise_broken, target_formatted = is_promise_broken(value_current, contract_value, direction) | ||
|
|
||
| if threshold_failed: | ||
| print(f"THRESHOLD FAILED: {category}/{metric} change={change}% allowed={format_interval(*target_range)}") | ||
| if promise_broken: | ||
| print(f"PROMISE BROKEN: {category}/{metric} value={value_current} required={target_formatted}") | ||
|
|
||
| if threshold_failed or promise_broken: | ||
| outcome = get_outcome_message(threshold_failed, promise_broken) | ||
| failed_rows.append( | ||
| f"| {category}/{metric} | {value_baseline} | {value_current} | " | ||
| f"{target_formatted} | {change}% | {format_interval(*target_range)} | {outcome} |" | ||
| ) |
There was a problem hiding this comment.
parse_csv() appends values for every matching row, but check_thresholds() only reads values[0..2]. If the CSV contains multiple runs/jobs for the same category/metric, later entries will be ignored (or worse, the list will be misinterpreted). Consider storing each row as a structured triple (or iterating values in chunks of 3) and validating all entries.
| # Parse values: [current, baseline, change%] | |
| try: | |
| value_current = float(values[0]) | |
| value_baseline = float(values[1]) | |
| change = float(values[2]) if values[2] else 0.0 | |
| except (ValueError, IndexError) as e: | |
| print(f"ERROR: Failed to parse {category}/{metric}: {e}") | |
| return True, f"Parse error for {category}/{metric}" | |
| # Get threshold config | |
| threshold_config = thresholds[category][metric] | |
| threshold_pct = threshold_config[0] | |
| direction = threshold_config[1] | |
| contract_value = threshold_config[2] | |
| # Check thresholds | |
| target_range = get_target_change_range(threshold_pct, direction, mode) | |
| threshold_failed = is_change_threshold_failed(change, target_range) | |
| promise_broken, target_formatted = is_promise_broken(value_current, contract_value, direction) | |
| if threshold_failed: | |
| print(f"THRESHOLD FAILED: {category}/{metric} change={change}% allowed={format_interval(*target_range)}") | |
| if promise_broken: | |
| print(f"PROMISE BROKEN: {category}/{metric} value={value_current} required={target_formatted}") | |
| if threshold_failed or promise_broken: | |
| outcome = get_outcome_message(threshold_failed, promise_broken) | |
| failed_rows.append( | |
| f"| {category}/{metric} | {value_baseline} | {value_current} | " | |
| f"{target_formatted} | {change}% | {format_interval(*target_range)} | {outcome} |" | |
| ) | |
| # Values may contain multiple triples: [current, baseline, change%] * N | |
| # Validate each triple independently. | |
| for i in range(0, len(values), 3): | |
| triple = values[i:i + 3] | |
| if len(triple) < 3: | |
| # Malformed data: incomplete triple | |
| print(f"ERROR: Incomplete data for {category}/{metric} at index {i}: {triple}") | |
| return True, f"Parse error for {category}/{metric}" | |
| try: | |
| value_current = float(triple[0]) | |
| value_baseline = float(triple[1]) | |
| change = float(triple[2]) if triple[2] else 0.0 | |
| except (ValueError, IndexError) as e: | |
| print(f"ERROR: Failed to parse {category}/{metric} at index {i}: {e}") | |
| return True, f"Parse error for {category}/{metric}" | |
| # Get threshold config | |
| threshold_config = thresholds[category][metric] | |
| threshold_pct = threshold_config[0] | |
| direction = threshold_config[1] | |
| contract_value = threshold_config[2] | |
| # Check thresholds | |
| target_range = get_target_change_range(threshold_pct, direction, mode) | |
| threshold_failed = is_change_threshold_failed(change, target_range) | |
| promise_broken, target_formatted = is_promise_broken(value_current, contract_value, direction) | |
| if threshold_failed: | |
| print(f"THRESHOLD FAILED: {category}/{metric} change={change}% allowed={format_interval(*target_range)}") | |
| if promise_broken: | |
| print(f"PROMISE BROKEN: {category}/{metric} value={value_current} required={target_formatted}") | |
| if threshold_failed or promise_broken: | |
| outcome = get_outcome_message(threshold_failed, promise_broken) | |
| failed_rows.append( | |
| f"| {category}/{metric} | {value_baseline} | {value_current} | " | |
| f"{target_formatted} | {change}% | {format_interval(*target_range)} | {outcome} |" | |
| ) |
| choices=['aa', 'pr', 'lkg'], | ||
| help='Benchmark mode: aa=A/A test (symmetric), pr=PR test (directional), lkg=last known good' |
There was a problem hiding this comment.
--mode accepts lkg, but the threshold logic treats any non-aa mode as PR-directional. Either implement distinct lkg behavior (and document it), or remove lkg from the accepted choices to avoid misleading callers.
| choices=['aa', 'pr', 'lkg'], | |
| help='Benchmark mode: aa=A/A test (symmetric), pr=PR test (directional), lkg=last known good' | |
| choices=['aa', 'pr'], | |
| help='Benchmark mode: aa=A/A test (symmetric), pr=PR test (directional)' |
| push: | ||
| branches: | ||
| - 'user/tianyuanyuan/add-benchmark-pipeline' | ||
| paths: |
There was a problem hiding this comment.
The workflow is titled as a general benchmark pipeline, but the push trigger is hard-coded to a single user branch. That means it won’t run for typical PRs/branches in this repo. Consider switching to pull_request and/or push on main (or removing the push trigger entirely if this is intended to be manual-only via workflow_dispatch).
| python .github/scripts/benchmark_result_parse.py \ | ||
| --mode pr \ | ||
| --file target/tmp/wikipedia-100K_change.csv | ||
| env: | ||
| GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} | ||
| GITHUB_REPOSITORY: ${{ github.repository }} | ||
| GITHUB_PR_NUMBER: ${{ github.event.pull_request.number }} | ||
| GITHUB_RUN_ID: ${{ github.run_id }} |
There was a problem hiding this comment.
GITHUB_PR_NUMBER is populated from github.event.pull_request.number, but this workflow currently triggers on workflow_dispatch/push (no pull_request payload), so this env var will be empty and PR commenting can’t work. Either add a pull_request trigger or gate PR-comment behavior on the event type / provide the PR number as an input.
|
Hey @YuanyuanTian-hh — thanks for putting this together! Porting automated benchmark regression detection into our CI is definitely something we need! I've been thinking about the overall architecture and want to share some thoughts for discussion among the broader team. This is a tricky design problem, and I think a bit of planning now will save us a lot of time in the future. My main concern is cross-language hot-potato with results and data coupling. The pipeline currently flows as Every step in this chain introduces a place where something can go wrong and with GitHub-centric checking scripts, it's difficult to validate changes locally. The benchmark crate already owns the Rust types for these results. I think the right long-term approach is to have the benchmark crate itself handle the comparison and threshold checking with something like (this is just a sketch): This way:
Suggested path forwardI don't want this effort to go to waste, there's good stuff here. Here's what I'd suggest:
I know this is a bigger scope change than you might have expected, and I'm sorry about that. I want to make sure we build this in a way that's maintainable as the benchmark crate evolves and can be reused more broadly. I'll put together an issue with enough detail that we can work through the design async — let me know if you have questions or want to discuss any of this further! A smaller note on dataset downloadsThe |
|
"A/B (branch vs main)" number in the description doesn't make sense. It is essentially the same as "A/A (branch vs itself)", no? why do we want to present both numbers? |
| type: string | ||
| push: | ||
| branches: | ||
| - 'user/tianyuanyuan/add-benchmark-pipeline' |
There was a problem hiding this comment.
Why is this specific branch here?
There was a problem hiding this comment.
This code allows the new pipeline to be tested on my branch, I will update it to 'main' before I merge it.
| push: | ||
| branches: | ||
| - 'user/tianyuanyuan/add-benchmark-pipeline' | ||
| paths: |
There was a problem hiding this comment.
Why do we limit the triggering to these paths?
There was a problem hiding this comment.
This code allows the new pipeline to be tested on my branch, I will update it to 'main' before I merge it.
|
|
||
| name: Benchmarks | ||
|
|
||
| on: |
There was a problem hiding this comment.
We want A/A testing to happen daily and fail if deviation is more than expected threshold - ideally notifying diskann-admins about failure. Is it enabled?
There was a problem hiding this comment.
I added in this PR as a separate workflow benchmarks-aa.yml.
It runs daily, any deviation beyond the calibrated range fails the job
. On failure, a notify-on-failure job automatically creates a GitHub issue and cc diskann-admin.
It will activate once this PR is merged to main.
|
Coudl the datasets be hosted on github artifacts? This would reduce the chance of upstream data not being available and potentually simplifies the code. |
You are right. Since this PR only adds workflow files and scripts — it doesn't change the actual core library — the "A/B (branch vs main)" runs are building and running the exact same benchmark binary as the "A/A (main vs main)" runs. Both sides compile identical Rust code, so the results are functionally A/A in both cases. I consolidated the A/A and A/B tables in PR summary into a single "20 A/A runs, 19 passed, 95%" since both test identical benchmark code. |
I pre downloaded the data here https://github.com/microsoft/DiskANN/releases/tag/benchmark-data-v1, now the pipeline will directly download them. 1d18ae5 |
Agree that we should adopt this new framework to keep it less error-prone. Please share the issue you have it. |
1d18ae5 to
f58e084
Compare
|
@YuanyuanTian-hh please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
1 similar comment
|
@YuanyuanTian-hh please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
Add DiskANN benchmark pipeline to GitHub Actions
Summary
Adds an automated benchmark pipeline to GitHub Actions, enabling regression detection for every PR. The pipeline builds and searches two public ANN datasets, compares performance against a baseline branch, and gates on configurable thresholds for recall, QPS, latency, and I/O metrics.
What's Added
.github/workflows/benchmarks.ymlworkflow_dispatchwith configurable baseline ref.github/workflows/benchmarks-aa.yml@microsoft/diskann-admin.github/scripts/compare_disk_index_json_output.py.github/scripts/benchmark_result_parse.pydiskann-benchmark/perf_test_inputs/wikipedia-100K-disk-index.jsondiskann-benchmark/perf_test_inputs/openai-100K-disk-index.jsonHow It Works
main)diskann-benchmarkcrateDatasets
Reliability Testing (A/A)
Ran 20 A/A workflow executions (both sides build identical benchmark code) to validate pipeline stability on shared GitHub runners:
mean_cpusdeviated 13.19% on a noisy shared runner (threshold is ±10%)Pipeline Runtime
Both jobs run in parallel → ~78 min wall-clock per workflow run, gated by OpenAI-100K.
Before Merge TODO
pushtrigger for feature branch (keep onlyworkflow_dispatch)mean_cpusthreshold from ±10% to ±15–20% for shared runners