feat(migrate): [7/7] Performance benchmarks#566
feat(migrate): [7/7] Performance benchmarks#566nkanu17 wants to merge 10 commits intofeat/migrate-cli-docsfrom
Conversation
🛡️ Jit Security Scan Results✅ No security findings were detected in this PR
Security scan by Jit
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 99c262d1de
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
Adds benchmark and visualization tooling under tests/benchmarks/ to measure migrator and retrieval performance (throughput/latency/memory) across multiple index configurations and dataset scales, plus scripts to render result charts.
Changes:
- Add an end-to-end migration timing benchmark (sync + async) with index-size scaling and memory reporting.
- Add retrieval benchmarks comparing FLAT vs HNSW and FP32 vs FP16, including overlap/recall-style metrics.
- Add matplotlib-based result visualization for benchmark JSON outputs.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 10 comments.
| File | Description |
|---|---|
| tests/benchmarks/visualize_results.py | New charting script to generate memory/precision/QPS/latency/recall plots from benchmark JSON results. |
| tests/benchmarks/retrieval_benchmark.py | New retrieval benchmark runner to build/query multiple index configs and output structured metrics. |
| tests/benchmarks/migration_benchmark.py | New migration benchmark runner to populate synthetic corpora and time migrations (sync/async), including timing breakdowns. |
| tests/benchmarks/index_migrator_real_benchmark.py | New “real dataset” benchmark (AG News + sentence-transformers) to measure migration + query quality/memory on realistic text data. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 99c262d1de
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 99c262d1de
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Fix SentenceTransformer local_files_only fallback for first run - Remove unused imports (string, struct) - Fix dead store enum_s (prefix with _) - Guard ZeroDivisionError in chart_memory_savings - Fix compute_recall to return 0.0 for empty ground truth - Wrap logger handler in try/finally to prevent leak on exception
Scoped cleanup to current database only to avoid affecting other databases on shared Redis instances.
751a0e5 to
49ce997
Compare
- Remove leftover numbered list fragment in migrate-indexes.md (#565 #20) - Extract duplicated progress_callback into shared _make_progress_callback method in migrate.py (#565 #21) - Add rvl migrate section to CLI reference docs (#565 #19) - Add try/except for matplotlib import in visualize_results.py (#566 #14)
99c262d to
26ce024
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 26ce0246e7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Fix SentenceTransformer local_files_only fallback for first run - Remove unused imports (string, struct) - Fix dead store enum_s (prefix with _) - Guard ZeroDivisionError in chart_memory_savings - Fix compute_recall to return 0.0 for empty ground truth - Wrap logger handler in try/finally to prevent leak on exception
49ce997 to
a1d54c9
Compare
Scoped cleanup to current database only to avoid affecting other databases on shared Redis instances.
- Remove leftover numbered list fragment in migrate-indexes.md (#565 #20) - Extract duplicated progress_callback into shared _make_progress_callback method in migrate.py (#565 #21) - Add rvl migrate section to CLI reference docs (#565 #19) - Add try/except for matplotlib import in visualize_results.py (#566 #14)
- Clean up source index on failure in benchmark finally block - Include queried_top_k in latency metrics for clarity when effective_top_k differs
07a1312 to
00a11fb
Compare
- Clean up source index on failure in benchmark finally block - Include queried_top_k in latency metrics for clarity when effective_top_k differs
- Skip CSV header row if present (detect non-digit first column) - Handle short CSV rows gracefully - Switch flushall to flushdb to avoid wiping other databases
- plan_path pointed to deleted TemporaryDirectory, removed from results dict
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 00a11fb6d6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ground_truth_results, | ||
| result["result_sets"], | ||
| k_values=valid_k_values, | ||
| ground_truth_depth=top_k, |
There was a problem hiding this comment.
Compute recall against full ground-truth depth
The recall curve is currently computed with ground_truth_depth=top_k, which is 10 by default, even though queries are fetched up to recall_k_max (default 100). This makes recall@20/50/100 measure recovery of only the top-10 neighbors, so curves saturate early and can materially overstate ANN quality at larger K values. In practical runs with default flags, this misreports the intended recall-vs-K behavior and can lead to incorrect benchmarking conclusions.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
| continue | ||
| if len(records) >= required_docs: | ||
| break | ||
| label, title, description = row |
There was a problem hiding this comment.
CSV row unpacking crashes on more than three columns
Low Severity
The guard on line 150 only skips rows with fewer than 3 columns (len(row) < 3), but label, title, description = row on line 157 will raise ValueError: too many values to unpack if a CSV row produces more than 3 fields. This can happen with non-standard AG News CSV files that have extra columns or malformed quoting. Using row[0], row[1], row[2] or starred unpacking would be more robust.
…hput, latency, memory measurement, and visualization Benchmark scripts for measuring migration performance across different index sizes and configurations. Includes real-world AG News dataset benchmarks, synthetic end-to-end timing benchmarks with memory tracking, retrieval quality comparisons (FLAT vs HNSW, FP32 vs FP16), and matplotlib visualization tooling for benchmark results.
- Fix SentenceTransformer local_files_only fallback for first run - Remove unused imports (string, struct) - Fix dead store enum_s (prefix with _) - Guard ZeroDivisionError in chart_memory_savings - Fix compute_recall to return 0.0 for empty ground truth - Wrap logger handler in try/finally to prevent leak on exception
Scoped cleanup to current database only to avoid affecting other databases on shared Redis instances.
- Remove leftover numbered list fragment in migrate-indexes.md (#565 #20) - Extract duplicated progress_callback into shared _make_progress_callback method in migrate.py (#565 #21) - Add rvl migrate section to CLI reference docs (#565 #19) - Add try/except for matplotlib import in visualize_results.py (#566 #14)
- Clean up source index on failure in benchmark finally block - Include queried_top_k in latency metrics for clarity when effective_top_k differs
- Skip CSV header row if present (detect non-digit first column) - Handle short CSV rows gracefully - Switch flushall to flushdb to avoid wiping other databases
- plan_path pointed to deleted TemporaryDirectory, removed from results dict
e851d24 to
5d52713
Compare
00a11fb to
78ca865
Compare
|
@codex review |
|
Closing in favor of restructured PR stack (Option A ordering). |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 78ca86567e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # Use the larger of top_k and recall_k_max for querying | ||
| # so we have enough results for recall curve computation | ||
| effective_top_k = max(top_k, recall_k_max) | ||
|
|
There was a problem hiding this comment.
Measure latency/QPS with requested top_k
benchmark_scale inflates top_k to max(top_k, recall_k_max) and uses that value for all timed queries, so with defaults (top_k=10, recall_k_max=100) the reported p50/p95/QPS are actually for top-100 retrieval, not top-10. This skews the main throughput/latency metrics and makes comparisons across runs sensitive to recall settings rather than the requested query depth.
Useful? React with 👍 / 👎.
| { | ||
| "doc_id": f"ag-news-{len(records)}", | ||
| "text": text, | ||
| "label": AG_NEWS_LABELS[int(label) - 1], |
There was a problem hiding this comment.
Accept both 0-based and 1-based CSV labels
load_ag_news_records_from_csv always subtracts 1 from the CSV label before indexing AG_NEWS_LABELS, which crashes for common AG News exports that already use 0-3 labels (e.g., Hugging Face-style dumps). Because --dataset-csv is documented as a generic local AG News input path, this makes valid inputs fail with KeyError before benchmarking starts.
Useful? React with 👍 / 👎.


Summary
Adds a benchmarking suite for the Index Migrator, measuring throughput, latency, memory usage, and index size across different migration scenarios (field changes, vector algorithm changes, quantization, key prefix changes, and mixed operations).
Usage
What is included
tests/benchmarks/migration_benchmark.py): Comprehensive benchmark suite that measures:tests/benchmarks/index_migrator_real_benchmark.py): Benchmark with realistic data volumes and schema complexity.tests/benchmarks/conftest.py): Pytest fixtures for benchmark Redis instances and data setup.Details
Benchmark scenarios covered:
Results are written as YAML reports for comparison across runs.
PR Stack
feat/migrate-corefeat/migrate-executorfeat/migrate-asyncfeat/migrate-batchfeat/migrate-wizardfeat/migrate-cli-docsfeat/migrate-benchmarks