Add DiskANN benchmark pipeline to GitHub Actions by YuanyuanTian-hh · Pull Request #857 · microsoft/DiskANN

YuanyuanTian-hh · 2026-03-20T05:40:54Z

Add DiskANN benchmark pipeline to GitHub Actions

Summary

Adds an automated benchmark pipeline to GitHub Actions, enabling regression detection for every PR. The pipeline builds and searches two public ANN datasets, compares performance against a baseline branch, and gates on configurable thresholds for recall, QPS, latency, and I/O metrics.

What's Added

File	Description
`.github/workflows/benchmarks.yml`	Main workflow — 2 parallel jobs (Wikipedia-100K + OpenAI-100K), triggered via `workflow_dispatch` with configurable baseline ref
`.github/workflows/benchmarks-aa.yml`	Daily A/A stability test (main vs main) at 9 AM UTC — creates a GitHub issue on failure to notify `@microsoft/diskann-admin`
`.github/scripts/compare_disk_index_json_output.py`	Diffs two benchmark crate JSON outputs → CSV with deviation % for every metric
`.github/scripts/benchmark_result_parse.py`	Parses diff CSV, validates against relative/absolute thresholds, posts PR comments on regression
`diskann-benchmark/perf_test_inputs/wikipedia-100K-disk-index.json`	Benchmark config: 768-dim, inner_product, R=59, L=72, 4 threads
`diskann-benchmark/perf_test_inputs/openai-100K-disk-index.json`	Benchmark config: 1536-dim, squared_l2, R=59, L=80, SQ_1_2.0, 4 threads

How It Works

Checkout both current branch and baseline (defaults to main)
Download public datasets from big-ann-benchmarks (Wikipedia-100K and OpenAI ArXiv-100K)
Build & search disk index on both branches using diskann-benchmark crate
Compare baseline vs target metrics (recall, QPS, latency, I/O, CPU time, etc.)
Validate against configurable thresholds — fail the workflow if regression exceeds allowed %
Upload artifacts (CSV + JSON) for 30-day retention

Datasets

Dataset	Dimensions	Distance	Vectors	Queries	Groundtruth K
Wikipedia-100K	768	inner_product	100K	5,000	100
OpenAI ArXiv-100K	1,536	squared_l2	100K	20,000	100

Reliability Testing (A/A)

Ran 20 A/A workflow executions (both sides build identical benchmark code) to validate pipeline stability on shared GitHub runners:

Runs	Passed	Pass Rate
20	19	95%

Recall is deterministic: Wikipedia 99.87% ±0.005%, OpenAI 99.67% ±0.003%
QPS stable at 9.6 across all runs
The 1 failure was a false positive: mean_cpus deviated 13.19% on a noisy shared runner (threshold is ±10%)

Pipeline Runtime

Dataset	Job Duration	Baseline Step	Target Step
Wikipedia-100K	~25 min	~11 min	~12.5 min
OpenAI-100K	~78 min	~38 min	~39 min

Both jobs run in parallel → ~78 min wall-clock per workflow run, gated by OpenAI-100K.

Before Merge TODO

Remove push trigger for feature branch (keep only workflow_dispatch)
Consider loosening mean_cpus threshold from ±10% to ±15–20% for shared runners

- Add benchmarks.yml workflow using workflow_dispatch, comparing current branch against a configurable baseline ref - Add compare_disk_index_json_output.py to diff benchmark crate JSON outputs into a CSV suitable for benchmark_result_parse.py - Add benchmark_result_parse.py for validating results and posting PR comments - Add wikipedia-100K-disk-index.json benchmark config using the public Wikipedia-100K dataset from big-ann-benchmarks (100K Cohere embeddings, 768-dim, cosine distance) to replace internal ADO datasets

codecov-commenter · 2026-03-20T06:15:05Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.45%. Comparing base (504b2ca) to head (5b0e7f7).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #857      +/-   ##
==========================================
+ Coverage   89.28%   90.45%   +1.17%     
==========================================
  Files         442      442              
  Lines       83009    83248     +239     
==========================================
+ Hits        74111    75301    +1190     
+ Misses       8898     7947     -951

Flag	Coverage Δ
miri	`90.45% <ø> (+1.17%)`	⬆️
unittests	`90.41% <ø> (+1.29%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 72 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…or ADO mimir-enron, not applicable to public datasets on GitHub runners. Threshold calibration tracked in PBI.

…mit)

…-normalized, metric is inner product)

…p, not cosine similarity)

Copilot

Pull request overview

Adds a GitHub Actions workflow to run DiskANN macro-benchmarks (disk-index build + search) on two public 100K datasets, compare current vs a baseline ref, and validate performance regressions via CSV-based threshold checks.

Changes:

Add benchmark input configs for Wikipedia-100K and OpenAI ArXiv 100K disk-index benchmarks.
Add a new Benchmarks GitHub Actions workflow that runs baseline + target, diffs results, and uploads artifacts.
Add Python scripts to (1) diff disk-index JSON outputs into a CSV and (2) parse/validate the CSV against thresholds (optionally commenting on PRs).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`diskann-benchmark/perf_test_inputs/wikipedia-100K-disk-index.json`	Adds a disk-index build/search benchmark config for Wikipedia-100K.
`diskann-benchmark/perf_test_inputs/openai-100K-disk-index.json`	Adds a disk-index build/search benchmark config for OpenAI ArXiv 100K.
`.github/workflows/benchmarks.yml`	Introduces a workflow to run baseline vs current benchmarks, diff, validate, and upload artifacts.
`.github/scripts/compare_disk_index_json_output.py`	Generates a comparison CSV from two disk-index benchmark JSON outputs.
`.github/scripts/benchmark_result_parse.py`	Parses the comparison CSV, checks thresholds/contracts, and reports regressions (with optional PR commenting).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-25T02:39:03Z

.github/scripts/compare_disk_index_json_output.py

+            metrics["mean_hops"] = sr.get("mean_hops", 0)
+            metrics["mean_io_time"] = sr.get("mean_io_time", 0)
+            metrics["mean_cpus"] = sr.get("mean_cpu_time", 0)
+            metrics["latency_95"] = sr.get("p999_latency", 0)  # Use p999 as proxy for 95th percentile


latency_95 is currently populated from p999_latency, which makes the reported/validated latency metric incorrect (the benchmark JSON includes p95_latency separately). Use the real p95 field (or rename the metric/key everywhere to p999) so the CSV and threshold checks match what they claim to measure.

Suggested change

metrics["latency_95"] = sr.get("p999_latency", 0) # Use p999 as proxy for 95th percentile

metrics["latency_95"] = sr.get("p95_latency", 0) # Use actual 95th percentile latency

Copilot · 2026-03-25T02:39:03Z

.github/scripts/compare_disk_index_json_output.py

+
+    # Total build time (in seconds)
+    build_time = build.get("build_time")
+    if build_time:


build_time is checked with a truthiness test (if build_time:), which will skip recording total_time if the value is 0. Prefer an explicit is not None check so 0 is handled consistently and the intent is clear.

Suggested change

if build_time:

if build_time is not None:

Copilot · 2026-03-25T02:39:03Z

.github/scripts/benchmark_result_parse.py

+            # Parse values: [current, baseline, change%]
+            try:
+                value_current = float(values[0])
+                value_baseline = float(values[1])
+                change = float(values[2]) if values[2] else 0.0
+            except (ValueError, IndexError) as e:
+                print(f"ERROR: Failed to parse {category}/{metric}: {e}")
+                return True, f"Parse error for {category}/{metric}"
+
+            # Get threshold config
+            threshold_config = thresholds[category][metric]
+            threshold_pct = threshold_config[0]
+            direction = threshold_config[1]
+            contract_value = threshold_config[2]
+
+            # Check thresholds
+            target_range = get_target_change_range(threshold_pct, direction, mode)
+            threshold_failed = is_change_threshold_failed(change, target_range)
+            promise_broken, target_formatted = is_promise_broken(value_current, contract_value, direction)
+
+            if threshold_failed:
+                print(f"THRESHOLD FAILED: {category}/{metric} change={change}% allowed={format_interval(*target_range)}")
+            if promise_broken:
+                print(f"PROMISE BROKEN: {category}/{metric} value={value_current} required={target_formatted}")
+
+            if threshold_failed or promise_broken:
+                outcome = get_outcome_message(threshold_failed, promise_broken)
+                failed_rows.append(
+                    f"| {category}/{metric} | {value_baseline} | {value_current} | "
+                    f"{target_formatted} | {change}% | {format_interval(*target_range)} | {outcome} |"
+                )


parse_csv() appends values for every matching row, but check_thresholds() only reads values[0..2]. If the CSV contains multiple runs/jobs for the same category/metric, later entries will be ignored (or worse, the list will be misinterpreted). Consider storing each row as a structured triple (or iterating values in chunks of 3) and validating all entries.

Suggested change

# Parse values: [current, baseline, change%]

try:

value_current = float(values[0])

value_baseline = float(values[1])

change = float(values[2]) if values[2] else 0.0

except (ValueError, IndexError) as e:

print(f"ERROR: Failed to parse {category}/{metric}: {e}")

return True, f"Parse error for {category}/{metric}"

# Get threshold config

threshold_config = thresholds[category][metric]

threshold_pct = threshold_config[0]

direction = threshold_config[1]

contract_value = threshold_config[2]

# Check thresholds

target_range = get_target_change_range(threshold_pct, direction, mode)

threshold_failed = is_change_threshold_failed(change, target_range)

promise_broken, target_formatted = is_promise_broken(value_current, contract_value, direction)

if threshold_failed:

print(f"THRESHOLD FAILED: {category}/{metric} change={change}% allowed={format_interval(*target_range)}")

if promise_broken:

print(f"PROMISE BROKEN: {category}/{metric} value={value_current} required={target_formatted}")

if threshold_failed or promise_broken:

outcome = get_outcome_message(threshold_failed, promise_broken)

failed_rows.append(

f"| {category}/{metric} | {value_baseline} | {value_current} | "

f"{target_formatted} | {change}% | {format_interval(*target_range)} | {outcome} |"

)

# Values may contain multiple triples: [current, baseline, change%] * N

# Validate each triple independently.

for i in range(0, len(values), 3):

triple = values[i:i + 3]

if len(triple) < 3:

# Malformed data: incomplete triple

print(f"ERROR: Incomplete data for {category}/{metric} at index {i}: {triple}")

return True, f"Parse error for {category}/{metric}"

try:

value_current = float(triple[0])

value_baseline = float(triple[1])

change = float(triple[2]) if triple[2] else 0.0

except (ValueError, IndexError) as e:

print(f"ERROR: Failed to parse {category}/{metric} at index {i}: {e}")

return True, f"Parse error for {category}/{metric}"

# Get threshold config

threshold_config = thresholds[category][metric]

threshold_pct = threshold_config[0]

direction = threshold_config[1]

contract_value = threshold_config[2]

# Check thresholds

target_range = get_target_change_range(threshold_pct, direction, mode)

threshold_failed = is_change_threshold_failed(change, target_range)

promise_broken, target_formatted = is_promise_broken(value_current, contract_value, direction)

if threshold_failed:

print(f"THRESHOLD FAILED: {category}/{metric} change={change}% allowed={format_interval(*target_range)}")

if promise_broken:

print(f"PROMISE BROKEN: {category}/{metric} value={value_current} required={target_formatted}")

if threshold_failed or promise_broken:

outcome = get_outcome_message(threshold_failed, promise_broken)

failed_rows.append(

f"| {category}/{metric} | {value_baseline} | {value_current} | "

f"{target_formatted} | {change}% | {format_interval(*target_range)} | {outcome} |"

)

Copilot · 2026-03-25T02:39:04Z

.github/scripts/benchmark_result_parse.py

+        choices=['aa', 'pr', 'lkg'],
+        help='Benchmark mode: aa=A/A test (symmetric), pr=PR test (directional), lkg=last known good'


--mode accepts lkg, but the threshold logic treats any non-aa mode as PR-directional. Either implement distinct lkg behavior (and document it), or remove lkg from the accepted choices to avoid misleading callers.

Suggested change

choices=['aa', 'pr', 'lkg'],

help='Benchmark mode: aa=A/A test (symmetric), pr=PR test (directional), lkg=last known good'

choices=['aa', 'pr'],

help='Benchmark mode: aa=A/A test (symmetric), pr=PR test (directional)'

Copilot · 2026-03-25T02:39:04Z

.github/workflows/benchmarks.yml

+  push:
+    branches:
+      - 'user/tianyuanyuan/add-benchmark-pipeline'
+    paths:


The workflow is titled as a general benchmark pipeline, but the push trigger is hard-coded to a single user branch. That means it won’t run for typical PRs/branches in this repo. Consider switching to pull_request and/or push on main (or removing the push trigger entirely if this is intended to be manual-only via workflow_dispatch).

Copilot · 2026-03-25T02:39:04Z

.github/workflows/benchmarks.yml

+          python .github/scripts/benchmark_result_parse.py \
+            --mode pr \
+            --file target/tmp/wikipedia-100K_change.csv
+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          GITHUB_REPOSITORY: ${{ github.repository }}
+          GITHUB_PR_NUMBER: ${{ github.event.pull_request.number }}
+          GITHUB_RUN_ID: ${{ github.run_id }}


GITHUB_PR_NUMBER is populated from github.event.pull_request.number, but this workflow currently triggers on workflow_dispatch/push (no pull_request payload), so this env var will be empty and PR commenting can’t work. Either add a pull_request trigger or gate PR-comment behavior on the event type / provide the PR number as an input.

hildebrandmw · 2026-03-25T17:06:12Z

Hey @YuanyuanTian-hh — thanks for putting this together! Porting automated benchmark regression detection into our CI is definitely something we need!

I've been thinking about the overall architecture and want to share some thoughts for discussion among the broader team. This is a tricky design problem, and I think a bit of planning now will save us a lot of time in the future.

My main concern is cross-language hot-potato with results and data coupling. The pipeline currently flows as Rust structs -> JSON -> Python (CSV conversion) -> Python (threshold check + Markdown). That means the field names from #[derive(Serialize)] in our Rust benchmark structs are implicitly depended on by two Python scripts. If someone renames a field in Rust (say mean_cpu_time to cpu_time_mean), the Python won't crash, it'll silently read 0 via .get() defaults and either false-alarm or quietly pass. That kind of failure mode is really hard to debug because it looks like the pipeline is working fine. Or if it does fail, it's only after a long CI run.

Every step in this chain introduces a place where something can go wrong and with GitHub-centric checking scripts, it's difficult to validate changes locally.

The benchmark crate already owns the Rust types for these results. I think the right long-term approach is to have the benchmark crate itself handle the comparison and threshold checking with something like (this is just a sketch):

cargo run -p diskann-benchmark -- compare \
    baseline.json target.json \
    --thresholds thresholds.toml

This way:

Field renames either become compile errors, not silent bugs, or get fixed automatically.
We eliminate two serialization boundaries (JSON→CSV→threshold check becomes just: deserialize both JSONs into the same Rust types and compare).
The comparison logic is testable with cargo test instead of only being exercisable by running the full workflow.
Adding new metrics to the benchmark automatically makes them available for comparison.
A/B comparison becomes available for local runs as well, resulting in an overall more broadly useful tool.

Suggested path forward

I don't want this effort to go to waste, there's good stuff here. Here's what I'd suggest:

Keep the workflow structure and benchmark JSON configs from this PR. This is good!
Add comparison utilities diskann-benchmark: This would take two JSON output files and a threshold config, do the diff, and emit results directly. This would take a little design work to make robust and allow early detection of errors (rather than erroring at the very end of a run) but is much more impactful to the project long term.
Simplify the workflow to call that single command and deduplicate the nearly identical jobs into a matrix strategy or reusable workflow.

I know this is a bigger scope change than you might have expected, and I'm sorry about that. I want to make sure we build this in a way that's maintainable as the benchmark crate evolves and can be reused more broadly. I'll put together an issue with enough detail that we can work through the design async — let me know if you have questions or want to discuss any of this further!

A smaller note on dataset downloads

The big-ann-benchmarks clone + create_dataset.py flow works, but it pulls in a full Python project (plus numpy/scipy) just to download a handful of data files. We could likely replace that step with a couple of curl commands, which would remove the transitive dependency and speed up the job. Not a blocker for now, just something to consider simplifying.

diskann-benchmark/perf_test_inputs/openai-100K-disk-index.json

arrayka · 2026-03-26T23:08:36Z

"A/B (branch vs main)" number in the description doesn't make sense. It is essentially the same as "A/A (branch vs itself)", no? why do we want to present both numbers?

arrayka · 2026-03-26T23:00:52Z

.github/workflows/benchmarks.yml

+        type: string
+  push:
+    branches:
+      - 'user/tianyuanyuan/add-benchmark-pipeline'


Why is this specific branch here?

This code allows the new pipeline to be tested on my branch, I will update it to 'main' before I merge it.

arrayka · 2026-03-26T23:02:14Z

.github/workflows/benchmarks.yml

+  push:
+    branches:
+      - 'user/tianyuanyuan/add-benchmark-pipeline'
+    paths:


Why do we limit the triggering to these paths?

This code allows the new pipeline to be tested on my branch, I will update it to 'main' before I merge it.

.github/workflows/benchmarks.yml

arrayka · 2026-03-26T23:10:29Z

.github/workflows/benchmarks.yml

+
+name: Benchmarks
+
+on:


We want A/A testing to happen daily and fail if deviation is more than expected threshold - ideally notifying diskann-admins about failure. Is it enabled?

I added in this PR as a separate workflow benchmarks-aa.yml.
It runs daily, any deviation beyond the calibrated range fails the job
. On failure, a notify-on-failure job automatically creates a GitHub issue and cc diskann-admin.
It will activate once this PR is merged to main.

harsha-simhadri · 2026-03-27T02:00:34Z

Coudl the datasets be hosted on github artifacts? This would reduce the chance of upstream data not being available and potentually simplifies the code.

YuanyuanTian-hh · 2026-03-27T03:30:40Z

"A/B (branch vs main)" number in the description doesn't make sense. It is essentially the same as "A/A (branch vs itself)", no? why do we want to present both numbers?

You are right. Since this PR only adds workflow files and scripts — it doesn't change the actual core library — the "A/B (branch vs main)" runs are building and running the exact same benchmark binary as the "A/A (main vs main)" runs. Both sides compile identical Rust code, so the results are functionally A/A in both cases. I consolidated the A/A and A/B tables in PR summary into a single "20 A/A runs, 19 passed, 95%" since both test identical benchmark code.

YuanyuanTian-hh · 2026-03-27T03:52:56Z

Coudl the datasets be hosted on github artifacts? This would reduce the chance of upstream data not being available and potentually simplifies the code.

I pre downloaded the data here https://github.com/microsoft/DiskANN/releases/tag/benchmark-data-v1, now the pipeline will directly download them. 1d18ae5

randybird · 2026-03-27T04:19:21Z

Hey @YuanyuanTian-hh — thanks for putting this together! Porting automated benchmark regression detection into our CI is definitely something we need!

I've been thinking about the overall architecture and want to share some thoughts for discussion among the broader team. This is a tricky design problem, and I think a bit of planning now will save us a lot of time in the future.

My main concern is cross-language hot-potato with results and data coupling. The pipeline currently flows as Rust structs -> JSON -> Python (CSV conversion) -> Python (threshold check + Markdown). That means the field names from #[derive(Serialize)] in our Rust benchmark structs are implicitly depended on by two Python scripts. If someone renames a field in Rust (say mean_cpu_time to cpu_time_mean), the Python won't crash, it'll silently read 0 via .get() defaults and either false-alarm or quietly pass. That kind of failure mode is really hard to debug because it looks like the pipeline is working fine. Or if it does fail, it's only after a long CI run.

Every step in this chain introduces a place where something can go wrong and with GitHub-centric checking scripts, it's difficult to validate changes locally.

The benchmark crate already owns the Rust types for these results. I think the right long-term approach is to have the benchmark crate itself handle the comparison and threshold checking with something like (this is just a sketch):
cargo run -p diskann-benchmark -- compare \
    baseline.json target.json \
    --thresholds thresholds.toml
This way:

Field renames either become compile errors, not silent bugs, or get fixed automatically.

We eliminate two serialization boundaries (JSON→CSV→threshold check becomes just: deserialize both JSONs into the same Rust types and compare).

The comparison logic is testable with cargo test instead of only being exercisable by running the full workflow.

Adding new metrics to the benchmark automatically makes them available for comparison.

A/B comparison becomes available for local runs as well, resulting in an overall more broadly useful tool.

Suggested path forward

I don't want this effort to go to waste, there's good stuff here. Here's what I'd suggest:

Keep the workflow structure and benchmark JSON configs from this PR. This is good!

Add comparison utilities diskann-benchmark: This would take two JSON output files and a threshold config, do the diff, and emit results directly. This would take a little design work to make robust and allow early detection of errors (rather than erroring at the very end of a run) but is much more impactful to the project long term.

Simplify the workflow to call that single command and deduplicate the nearly identical jobs into a matrix strategy or reusable workflow.

I know this is a bigger scope change than you might have expected, and I'm sorry about that. I want to make sure we build this in a way that's maintainable as the benchmark crate evolves and can be reused more broadly. I'll put together an issue with enough detail that we can work through the design async — let me know if you have questions or want to discuss any of this further!

A smaller note on dataset downloads

The big-ann-benchmarks clone + create_dataset.py flow works, but it pulls in a full Python project (plus numpy/scipy) just to download a handful of data files. We could likely replace that step with a couple of curl commands, which would remove the transitive dependency and speed up the job. Not a blocker for now, just something to consider simplifying.

Agree that we should adopt this new framework to keep it less error-prone. Please share the issue you have it.
And for this current PR, we can first land to have the original quality gate working immediately.

microsoft-github-policy-service · 2026-03-27T06:11:40Z

@YuanyuanTian-hh please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

Contribution License Agreement

This Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
and conveys certain license rights to Microsoft Corporation and its affiliates (“Microsoft”) for Your
contributions to Microsoft open source projects. This Agreement is effective as of the latest signature
date below.

Definitions.
“Code” means the computer software code, whether in human-readable or machine-executable form,
that is delivered by You to Microsoft under this Agreement.
“Project” means any of the projects owned or managed by Microsoft and offered under a license
approved by the Open Source Initiative (www.opensource.org).
“Submit” is the act of uploading, submitting, transmitting, or distributing code or other content to any
Project, including but not limited to communication on electronic mailing lists, source code control
systems, and issue tracking systems that are managed by, or on behalf of, the Project for the purpose of
discussing and improving that Project, but excluding communication that is conspicuously marked or
otherwise designated in writing by You as “Not a Submission.”
“Submission” means the Code and any other copyrightable material Submitted by You, including any
associated comments and documentation.
Your Submission. You must agree to the terms of this Agreement before making a Submission to any
Project. This Agreement covers any and all Submissions that You, now or in the future (except as
described in Section 4 below), Submit to any Project.
Originality of Work. You represent that each of Your Submissions is entirely Your original work.
Should You wish to Submit materials that are not Your original work, You may Submit them separately
to the Project if You (a) retain all copyright and license information that was in the materials as You
received them, (b) in the description accompanying Your Submission, include the phrase “Submission
containing materials of a third party:” followed by the names of the third party and any licenses or other
restrictions of which You are aware, and (c) follow any other instructions in the Project’s written
guidelines concerning Submissions.
Your Employer. References to “employer” in this Agreement include Your employer or anyone else
for whom You are acting in making Your Submission, e.g. as a contractor, vendor, or agent. If Your
Submission is made in the course of Your work for an employer or Your employer has intellectual
property rights in Your Submission by contract or applicable law, You must secure permission from Your
employer to make the Submission before signing this Agreement. In that case, the term “You” in this
Agreement will refer to You and the employer collectively. If You change employers in the future and
desire to Submit additional Submissions for the new employer, then You agree to sign a new Agreement
and secure permission from the new employer before Submitting those Submissions.
Licenses.

Copyright License. You grant Microsoft, and those who receive the Submission directly or
indirectly from Microsoft, a perpetual, worldwide, non-exclusive, royalty-free, irrevocable license in the
Submission to reproduce, prepare derivative works of, publicly display, publicly perform, and distribute
the Submission and such derivative works, and to sublicense any or all of the foregoing rights to third
parties.
Patent License. You grant Microsoft, and those who receive the Submission directly or
indirectly from Microsoft, a perpetual, worldwide, non-exclusive, royalty-free, irrevocable license under
Your patent claims that are necessarily infringed by the Submission or the combination of the
Submission with the Project to which it was Submitted to make, have made, use, offer to sell, sell and
import or otherwise dispose of the Submission alone or with the Project.
Other Rights Reserved. Each party reserves all rights not expressly granted in this Agreement.
No additional licenses or rights whatsoever (including, without limitation, any implied licenses) are
granted by implication, exhaustion, estoppel or otherwise.

Representations and Warranties. You represent that You are legally entitled to grant the above
licenses. You represent that each of Your Submissions is entirely Your original work (except as You may
have disclosed under Section 3). You represent that You have secured permission from Your employer to
make the Submission in cases where Your Submission is made in the course of Your work for Your
employer or Your employer has intellectual property rights in Your Submission by contract or applicable
law. If You are signing this Agreement on behalf of Your employer, You represent and warrant that You
have the necessary authority to bind the listed employer to the obligations contained in this Agreement.
You are not expected to provide support for Your Submission, unless You choose to do so. UNLESS
REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING, AND EXCEPT FOR THE WARRANTIES
EXPRESSLY STATED IN SECTIONS 3, 4, AND 6, THE SUBMISSION PROVIDED UNDER THIS AGREEMENT IS
PROVIDED WITHOUT WARRANTY OF ANY KIND, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTY OF
NONINFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
Notice to Microsoft. You agree to notify Microsoft in writing of any facts or circumstances of which
You later become aware that would make Your representations in this Agreement inaccurate in any
respect.
Information about Submissions. You agree that contributions to Projects and information about
contributions may be maintained indefinitely and disclosed publicly, including Your name and other
information that You submit with Your Submission.
Governing Law/Jurisdiction. This Agreement is governed by the laws of the State of Washington, and
the parties consent to exclusive jurisdiction and venue in the federal courts sitting in King County,
Washington, unless no federal subject matter jurisdiction exists, in which case the parties consent to
exclusive jurisdiction and venue in the Superior Court of King County, Washington. The parties waive all
defenses of lack of personal jurisdiction and forum non-conveniens.
Entire Agreement/Assignment. This Agreement is the entire agreement between the parties, and
supersedes any and all prior agreements, understandings or communications, written or oral, between
the parties relating to the subject matter hereof. This Agreement may be assigned by Microsoft.

microsoft-github-policy-service · 2026-03-27T06:11:40Z

@YuanyuanTian-hh please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

Contribution License Agreement

This Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
and conveys certain license rights to Microsoft Corporation and its affiliates (“Microsoft”) for Your
contributions to Microsoft open source projects. This Agreement is effective as of the latest signature
date below.

Definitions.
“Code” means the computer software code, whether in human-readable or machine-executable form,
that is delivered by You to Microsoft under this Agreement.
“Project” means any of the projects owned or managed by Microsoft and offered under a license
approved by the Open Source Initiative (www.opensource.org).
“Submit” is the act of uploading, submitting, transmitting, or distributing code or other content to any
Project, including but not limited to communication on electronic mailing lists, source code control
systems, and issue tracking systems that are managed by, or on behalf of, the Project for the purpose of
discussing and improving that Project, but excluding communication that is conspicuously marked or
otherwise designated in writing by You as “Not a Submission.”
“Submission” means the Code and any other copyrightable material Submitted by You, including any
associated comments and documentation.
Your Submission. You must agree to the terms of this Agreement before making a Submission to any
Project. This Agreement covers any and all Submissions that You, now or in the future (except as
described in Section 4 below), Submit to any Project.
Originality of Work. You represent that each of Your Submissions is entirely Your original work.
Should You wish to Submit materials that are not Your original work, You may Submit them separately
to the Project if You (a) retain all copyright and license information that was in the materials as You
received them, (b) in the description accompanying Your Submission, include the phrase “Submission
containing materials of a third party:” followed by the names of the third party and any licenses or other
restrictions of which You are aware, and (c) follow any other instructions in the Project’s written
guidelines concerning Submissions.
Your Employer. References to “employer” in this Agreement include Your employer or anyone else
for whom You are acting in making Your Submission, e.g. as a contractor, vendor, or agent. If Your
Submission is made in the course of Your work for an employer or Your employer has intellectual
property rights in Your Submission by contract or applicable law, You must secure permission from Your
employer to make the Submission before signing this Agreement. In that case, the term “You” in this
Agreement will refer to You and the employer collectively. If You change employers in the future and
desire to Submit additional Submissions for the new employer, then You agree to sign a new Agreement
and secure permission from the new employer before Submitting those Submissions.
Licenses.

Copyright License. You grant Microsoft, and those who receive the Submission directly or
indirectly from Microsoft, a perpetual, worldwide, non-exclusive, royalty-free, irrevocable license in the
Submission to reproduce, prepare derivative works of, publicly display, publicly perform, and distribute
the Submission and such derivative works, and to sublicense any or all of the foregoing rights to third
parties.
Patent License. You grant Microsoft, and those who receive the Submission directly or
indirectly from Microsoft, a perpetual, worldwide, non-exclusive, royalty-free, irrevocable license under
Your patent claims that are necessarily infringed by the Submission or the combination of the
Submission with the Project to which it was Submitted to make, have made, use, offer to sell, sell and
import or otherwise dispose of the Submission alone or with the Project.
Other Rights Reserved. Each party reserves all rights not expressly granted in this Agreement.
No additional licenses or rights whatsoever (including, without limitation, any implied licenses) are
granted by implication, exhaustion, estoppel or otherwise.

Representations and Warranties. You represent that You are legally entitled to grant the above
licenses. You represent that each of Your Submissions is entirely Your original work (except as You may
have disclosed under Section 3). You represent that You have secured permission from Your employer to
make the Submission in cases where Your Submission is made in the course of Your work for Your
employer or Your employer has intellectual property rights in Your Submission by contract or applicable
law. If You are signing this Agreement on behalf of Your employer, You represent and warrant that You
have the necessary authority to bind the listed employer to the obligations contained in this Agreement.
You are not expected to provide support for Your Submission, unless You choose to do so. UNLESS
REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING, AND EXCEPT FOR THE WARRANTIES
EXPRESSLY STATED IN SECTIONS 3, 4, AND 6, THE SUBMISSION PROVIDED UNDER THIS AGREEMENT IS
PROVIDED WITHOUT WARRANTY OF ANY KIND, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTY OF
NONINFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
Notice to Microsoft. You agree to notify Microsoft in writing of any facts or circumstances of which
You later become aware that would make Your representations in this Agreement inaccurate in any
respect.
Information about Submissions. You agree that contributions to Projects and information about
contributions may be maintained indefinitely and disclosed publicly, including Your name and other
information that You submit with Your Submission.
Governing Law/Jurisdiction. This Agreement is governed by the laws of the State of Washington, and
the parties consent to exclusive jurisdiction and venue in the federal courts sitting in King County,
Washington, unless no federal subject matter jurisdiction exists, in which case the parties consent to
exclusive jurisdiction and venue in the Superior Court of King County, Washington. The parties waive all
defenses of lack of personal jurisdiction and forum non-conveniens.
Entire Agreement/Assignment. This Agreement is the entire agreement between the parties, and
supersedes any and all prior agreements, understandings or communications, written or oral, between
the parties relating to the subject matter hereof. This Agreement may be assigned by Microsoft.

Yuanyuan Tian (from Dev Box) added 4 commits March 19, 2026 16:08

Fix openai-100K distance metric and add gitignore patterns

2217353

Add push trigger to benchmarks workflow for pre-merge testing

6b7b250

Fix baseline run: use input config from current branch checkout

d4a6abd

Yuanyuan Tian (from Dev Box) added 11 commits March 20, 2026 14:17

Fix markdown conversion: replace broken csvtomd with inline Python

c6d54a9

Update benchmark configs: align build/search params to fix low recall

5775198

Fix recall_at: set to 100 to match groundtruth file K=100

7ac841e

Remove stale absolute contracts (qps/recall/total_time): calibrated f…

7d5a72e

…or ADO mimir-enron, not applicable to public datasets on GitHub runners. Threshold calibration tracked in PBI.

remove comments

c294c16

Fix build_ram_limit_gb: reduce 10->4 to fit GitHub runner RAM (7GB li…

b8e1d6d

…mit)

Fix wikipedia distance: cosine_normalized->cosine (vectors are not L2…

137cae0

…-normalized, metric is inner product)

Fix wikipedia distance: cosine->inner_product (groundtruth uses raw i…

96b63a8

…p, not cosine similarity)

Align build threads

1b4cfc8

Speed up benchmarks: build threads 1->4, openai pq_chunks 384->192

5e6a6e0

Temp: disable concurrency cancellation for A/A batch testing

59d25b9

YuanyuanTian-hh marked this pull request as ready for review March 25, 2026 02:34

YuanyuanTian-hh requested review from a team and Copilot March 25, 2026 02:34

Copilot started reviewing on behalf of YuanyuanTian-hh March 25, 2026 02:35 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

ChenSunriseJiaBao reviewed Mar 26, 2026

View reviewed changes

revert A/A test settings, update OpenAI config to SQ_1_2.0

751e775

hildebrandmw mentioned this pull request Mar 26, 2026

Cleanup diskann-benchmark-runner and friends. #865

Open

arrayka requested changes Mar 26, 2026

View reviewed changes

Remove micro-benchmark-iai comments

2c4d235

YuanyuanTian-hh closed this Mar 27, 2026

YuanyuanTian-hh reopened this Mar 27, 2026

use GitHub Release assets for benchmark datasets

f58e084

YuanyuanTian-hh force-pushed the user/tianyuanyuan/add-benchmark-pipeline branch from 1d18ae5 to f58e084 Compare March 27, 2026 05:59

extract csv-to-markdown into reusable script

0f5c277

Yuanyuan Tian (from Dev Box) added 3 commits March 27, 2026 15:03

calibrate contract thresholds from GitHub runner data

b8cddb5

add daily A/A benchmark stability test with failure notification

accdf2c

widen mean_cpus threshold to 15% for shared-runner CPU noise

5b0e7f7

YuanyuanTian-hh requested review from ChenSunriseJiaBao and arrayka March 27, 2026 07:40

	metrics["latency_95"] = sr.get("p999_latency", 0) # Use p999 as proxy for 95th percentile
	metrics["latency_95"] = sr.get("p95_latency", 0) # Use actual 95th percentile latency

		choices=['aa', 'pr', 'lkg'],
		help='Benchmark mode: aa=A/A test (symmetric), pr=PR test (directional), lkg=last known good'


		name: Benchmarks

		on:

Conversation

YuanyuanTian-hh commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!