feat(migrate): [2/7] Quantization reliability layer#561
feat(migrate): [2/7] Quantization reliability layer#561nkanu17 wants to merge 1 commit intofeat/migrate-corefrom
Conversation
🛡️ Jit Security Scan Results✅ No security findings were detected in this PR
Security scan by Jit
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 50cff884e6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
Adds the synchronous migration execution layer for redisvl.migration, including crash-safe reliability primitives (checkpointed/resumable vector re-encoding) and integration tests to validate supported migration routes end-to-end.
Changes:
- Introduces
MigrationExecutorimplementing drop/recreate migrations with prefix renames, field renames, and vector datatype re-encoding. - Adds
reliability.pywith quantization checkpointing, idempotent dtype detection, bounded undo buffering, and BGSAVE safety snapshot helpers. - Adds integration test coverage for the apply+validate flow and a broad set of supported migration operations (including JSON storage and quantized dtypes).
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
redisvl/migration/executor.py |
Sync migration executor implementing key enumeration, renames, quantization, drop/recreate, and validation orchestration. |
redisvl/migration/reliability.py |
Crash-safety utilities: dtype detection, checkpoint persistence, BGSAVE wait helpers, and rollback buffer. |
redisvl/migration/__init__.py |
Exposes MigrationExecutor as part of the public migration API. |
tests/integration/test_migration_v1.py |
Adds a basic end-to-end plan→apply→validate integration test. |
tests/integration/test_migration_routes.py |
Adds integration tests for supported algorithm/dtype/distance-metric migration routes. |
tests/integration/test_migration_comprehensive.py |
Adds a comprehensive integration suite covering many supported migration operations and JSON storage cases. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 50cff884e6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 50cff884e6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Fix partial key renames: fail fast on collision instead of continuing - Fix completed checkpoint: resume from post-drop state for index recreation - Pass rename_operations to get_vector_datatype_changes for renamed vectors - Add warning when SCAN falls back to '*' (full keyspace scan) - Warn when field rename overwrites existing destination field - Fix has_prefix_change falsy check for empty string prefixes - Add logger to migration utils
- Fix _quantize_vectors docstring: 'documents quantized' not 'processed' - Wrap test_migration_v1 assertions in try/finally for cleanup
50cff88 to
33f2a40
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 33f2a40b2d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Fix partial key renames: fail fast on collision instead of continuing - Fix completed checkpoint: resume from post-drop state for index recreation - Pass rename_operations to get_vector_datatype_changes for renamed vectors - Add warning when SCAN falls back to '*' (full keyspace scan) - Warn when field rename overwrites existing destination field - Fix has_prefix_change falsy check for empty string prefixes - Add logger to migration utils
- Fix _quantize_vectors docstring: 'documents quantized' not 'processed' - Wrap test_migration_v1 assertions in try/finally for cleanup
- Remap datatype_changes keys to post-rename field names before quantization - Only resume from completed checkpoint when source index is actually gone
- Remap datatype_changes keys to post-rename field names before quantization - Only resume from completed checkpoint when source index is actually gone
33f2a40 to
42aa7bb
Compare
- Fix partial key renames: fail fast on collision instead of continuing - Fix completed checkpoint: resume from post-drop state for index recreation - Pass rename_operations to get_vector_datatype_changes for renamed vectors - Add warning when SCAN falls back to '*' (full keyspace scan) - Warn when field rename overwrites existing destination field - Fix has_prefix_change falsy check for empty string prefixes - Add logger to migration utils
- Fix _quantize_vectors docstring: 'documents quantized' not 'processed' - Wrap test_migration_v1 assertions in try/finally for cleanup
- Remap datatype_changes keys to post-rename field names before quantization - Only resume from completed checkpoint when source index is actually gone
- Switch from import logging to redisvl.utils.log.get_logger in executor, reliability, utils - Remove unused import logging from utils.py - Verify BGSAVE success via rdb_last_bgsave_status in both sync and async versions
- Handle missing source index in current_source_matches_snapshot (return False) - Always remap datatype_changes keys on resume (plan uses source field names) - Delete stale completed checkpoint on fresh run to prevent skipping quantization
- Remap datatype_changes keys to post-rename field names before quantization - Only resume from completed checkpoint when source index is actually gone
1897678 to
63e248f
Compare
|
@codex review |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import yaml | ||
|
|
||
| from redisvl.utils.log import get_logger | ||
|
|
||
| logger = get_logger(__name__) | ||
|
|
||
| from redisvl.index import SearchIndex | ||
| from redisvl.migration.models import ( |
There was a problem hiding this comment.
Import sorting / module layout: logger = get_logger(__name__) is placed before from redisvl.index import SearchIndex, which leaves an import after executable code. This will fail make check-sort-imports/isort and violates the “imports first” rule. Move all imports (including SearchIndex) above the logger initialization.
| try: | ||
| current_index = SearchIndex.from_existing( | ||
| index_name, | ||
| redis_url=redis_url, | ||
| redis_client=redis_client, | ||
| ) | ||
| except Exception: | ||
| # Index no longer exists (e.g. already dropped during migration) | ||
| return False |
There was a problem hiding this comment.
current_source_matches_snapshot() catches Exception broadly and returns False. That will also treat transient Redis connection/auth issues as “schema mismatch / index missing”, which can mislead users and block migrations with the wrong manual action. Narrow the exception handling to the specific “index does not exist” error(s) (e.g., redis.exceptions.ResponseError matching unknown index) and re-raise unexpected exceptions.
| import asyncio | ||
| from redisvl.utils.log import get_logger | ||
| import os | ||
| import tempfile | ||
| import time | ||
| from pathlib import Path | ||
| from typing import Any, Dict, List, Optional, Tuple | ||
|
|
||
| import yaml | ||
| from pydantic import BaseModel, Field | ||
|
|
||
| from redisvl.migration.models import DTYPE_BYTES |
There was a problem hiding this comment.
Import sorting: stdlib imports (asyncio, os, tempfile, time, Path, typing) are interleaved with local imports (get_logger). With isort enabled (make check-sort-imports), this file should group stdlib, third-party, then local imports. Reorder these imports accordingly.
redisvl/migration/executor.py
Outdated
|
|
||
| from redisvl.utils.log import get_logger | ||
| import time | ||
| from typing import Any, Callable, Dict, Generator, List, Optional | ||
|
|
||
| from redis.exceptions import ResponseError | ||
|
|
There was a problem hiding this comment.
Import sorting: import time (stdlib) comes after from redisvl.utils.log import get_logger (local), which will fail isort’s grouping rules. Please reorder imports to the standard (stdlib → third-party → local) layout so make check-sort-imports passes.
| from redisvl.utils.log import get_logger | |
| import time | |
| from typing import Any, Callable, Dict, Generator, List, Optional | |
| from redis.exceptions import ResponseError | |
| import time | |
| from typing import Any, Callable, Dict, Generator, List, Optional | |
| from redis.exceptions import ResponseError | |
| from redisvl.utils.log import get_logger |
| if checkpoint_path and has_same_width_quantization: | ||
| report.validation.errors.append( | ||
| "Crash-safe resume is not supported for same-width datatype " | ||
| "changes (float16<->bfloat16 or int8<->uint8)." | ||
| ) | ||
| report.manual_actions.append( | ||
| "Re-run without --resume for same-width vector conversions, or " | ||
| "split the migration to avoid same-width datatype changes." |
There was a problem hiding this comment.
Same-width dtype conversions (float16↔bfloat16, int8↔uint8) are not idempotent by byte-length detection, so any retry after a partial run (crash, timeout, manual stop) can corrupt already-converted vectors by re-interpreting target bytes as the source dtype. The code currently only blocks --resume for these cases, but a non-resume re-run is still unsafe. Consider blocking same-width conversions entirely, or require an explicit “unsafe” opt-in and add a strong warning/manual action explaining that recovery requires restoring from snapshot (e.g., the BGSAVE) rather than re-running.
| if checkpoint_path and has_same_width_quantization: | |
| report.validation.errors.append( | |
| "Crash-safe resume is not supported for same-width datatype " | |
| "changes (float16<->bfloat16 or int8<->uint8)." | |
| ) | |
| report.manual_actions.append( | |
| "Re-run without --resume for same-width vector conversions, or " | |
| "split the migration to avoid same-width datatype changes." | |
| if has_same_width_quantization: | |
| report.validation.errors.append( | |
| "Same-width vector datatype changes (float16<->bfloat16 or " | |
| "int8<->uint8) are not supported by this migration because they " | |
| "are not crash-safe or safe to re-run: any partial run followed " | |
| "by a retry can corrupt already-converted vectors." | |
| ) | |
| report.manual_actions.append( | |
| "Adjust the migration plan to avoid same-width vector datatype " | |
| "changes (for example, migrate via a different dtype or use " | |
| "offline tooling to convert embeddings). If this migration has " | |
| "already run partially, restore the data from a clean snapshot " | |
| "(e.g., a BGSAVE backup) before attempting a new migration." |
Move the sync MigrationExecutor into the core migration PR. This version handles: field renames, key renames (prefix change), drop/recreate index, wait for indexing, and validation. Quantization reliability (checkpoints, BGSAVE, undo buffers) is intentionally excluded and will be layered in PR #561. Also includes: - utils.py: logger, try/except in current_source_matches_snapshot, warnings in build_scan_match_patterns - Integration tests (non-quantization): comprehensive + v1 flow
Add reliability module for crash-safe quantization during migrations: - reliability.py: QuantizationCheckpoint, BatchUndoBuffer, trigger_bgsave_and_wait - Enhanced executor.py: _quantize_vectors, checkpoint resume, BGSAVE sync - test_migration_routes.py: quantization-specific integration tests - test_migration_comprehensive.py: add datatype quantization test
63e248f to
34c5f95
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 34c5f95d1b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| logger.warning( | ||
| "Checkpoint index '%s' does not match plan index '%s', ignoring", | ||
| existing_checkpoint.index_name, | ||
| plan.source.index_name, | ||
| ) |
There was a problem hiding this comment.
Disable mismatched checkpoint before continuing apply
When the checkpoint index does not match the plan index, this branch only logs that it is being ignored, but checkpoint_path is still passed into _quantize_vectors. In quantized migrations, the run can proceed through drop/rename steps and then fail with a ValueError from _quantize_vectors on the same mismatch, leaving the source index dropped and migration incomplete. If a checkpoint is declared ignorable here, it should be detached (or removed) before the execution phase.
Useful? React with 👍 / 👎.
| if checkpoint_path and has_same_width_quantization: | ||
| report.validation.errors.append( | ||
| "Crash-safe resume is not supported for same-width datatype " | ||
| "changes (float16<->bfloat16 or int8<->uint8)." | ||
| ) |
There was a problem hiding this comment.
Scope same-width resume guard to hash quantization only
This early return blocks same-width datatype migrations whenever checkpoint_path is provided, even when the index uses JSON storage. For JSON, needs_quantization is false and vectors are re-indexed during recreate, so no in-place quantization/resume is performed. As written, valid JSON migrations like float16↔bfloat16 fail unnecessarily if callers pass a checkpoint path.
Useful? React with 👍 / 👎.
| for key in batch: | ||
| for field_name, change in datatype_changes.items(): | ||
| field_data: bytes | None = client.hget(key, field_name) # type: ignore[misc,assignment] | ||
| if not field_data: |
There was a problem hiding this comment.
Unpipelined individual HGET reads cause major slowdown
Medium Severity
In _quantize_vectors, each vector field for every key in a batch is read with an individual client.hget() call (one Redis round trip each), while the writes are correctly pipelined. With batch_size=500 and N vector fields, this produces 500×N sequential round trips per batch. The analogous _rename_field_in_hash method pipelines its reads (lines 314–318). For remote Redis or large key counts, this makes quantization orders of magnitude slower than necessary.
Add reliability module for crash-safe quantization during migrations: - reliability.py: QuantizationCheckpoint, BatchUndoBuffer, trigger_bgsave_and_wait - Enhanced executor.py: _quantize_vectors, checkpoint resume, BGSAVE sync - test_migration_routes.py: quantization-specific integration tests - test_migration_comprehensive.py: add datatype quantization test
34c5f95 to
70520ba
Compare
|
@codex review |
|
Closing in favor of restructured PR stack (Option A ordering). |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
| "split the migration to avoid same-width datatype changes." | ||
| ) | ||
| report.finished_at = timestamp_utc() | ||
| return report |
There was a problem hiding this comment.
Same-width check falsely blocks JSON storage migrations
Low Severity
The guard checkpoint_path and has_same_width_quantization blocks the entire migration without checking needs_quantization. Since needs_quantization is False for JSON storage (line 561), a JSON-backed index with a same-width vector datatype change (e.g., float16→bfloat16) would be unnecessarily rejected when a checkpoint_path is provided, even though no quantization is performed for JSON.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Check if we are resuming from a checkpoint (post-drop crash). | ||
| # If so, the source index may no longer exist in Redis, so we | ||
| # skip live schema validation and construct from the plan snapshot. | ||
| resuming_from_checkpoint = False | ||
| if checkpoint_path: |
There was a problem hiding this comment.
The crash-safe resume path controlled by checkpoint_path/resuming_from_checkpoint adds substantial stateful behavior, but there are no tests that exercise an interrupted run and subsequent resume from a checkpoint. Adding an integration/unit test that validates resume behavior (e.g., completed_keys slicing, completed checkpoint skip, and post-drop reconstruction) would help prevent regressions in this high-risk flow.
| pipe = client.pipeline() | ||
| undo = BatchUndoBuffer() | ||
| keys_updated_in_batch: set[str] = set() | ||
|
|
||
| try: |
There was a problem hiding this comment.
In _quantize_vectors, the batch loop uses client.hget inside the nested key/field iteration, which creates O(keys * fields) network round-trips and can be extremely slow for large indexes. Consider batching/pipelining the reads (e.g., queue HGETs for the batch) and using pipeline(transaction=False) for the bulk HSETs to match the executor’s other pipeline usage.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 70520ba886
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ) | ||
| checkpoint.save() | ||
| total_keys_for_progress = total_keys |
There was a problem hiding this comment.
Validate checkpoint path before dropping source index
Failing to write a new checkpoint can currently break the migration after destructive steps have already run: apply() drops the source index before _quantize_vectors() attempts checkpoint.save(), so an invalid/unwritable checkpoint_path (permission error, directory path, bad mount, etc.) raises here and leaves the source index dropped with no recreated target index. This should be preflighted (or checkpoint created) before the drop phase so path errors fail safely.
Useful? React with 👍 / 👎.
| resuming_from_checkpoint = True | ||
| logger.info( | ||
| "Checkpoint found at %s, skipping source index validation " | ||
| "(index may have been dropped before crash)", | ||
| checkpoint_path, |
There was a problem hiding this comment.
Verify source index absence before forcing resume mode
Any non-completed checkpoint immediately flips resuming_from_checkpoint to True without checking whether the source index still exists. In the stale-checkpoint case (for example, source was restored but checkpoint file remains), the executor skips normal validation/drop flow and later hits index-creation failure because the existing source index is still present. This branch should mirror the completed-checkpoint guard and only enter resume mode when the live source index is actually gone.
Useful? React with 👍 / 👎.


Summary
Adds the reliability module for crash-safe quantization during migrations. This layer enables vector quantization (e.g., float32 to float16/bfloat16/int8/uint8) with checkpoint-based resume capability, so interrupted quantization jobs can pick up where they left off.
Usage
Quantization is triggered automatically when a migration plan includes vector type changes. Users can resume interrupted quantization jobs:
What is included
redisvl/migration/reliability.py):QuantizationReliabilityclass with checkpoint writing, crash-safe resume, progress tracking, and per-key error isolation.redisvl/migration/executor.py): Integration of the reliability layer into the sync executor quantization step, adding checkpoint_path parameter and resume logic.redisvl/migration/__init__.py): Exports for the reliability module.redisvl/migration/utils.py): Helper functions for quantization checkpoint handling.Details
The reliability layer:
PR Stack
feat/migrate-corefeat/migrate-executorfeat/migrate-asyncfeat/migrate-batchfeat/migrate-wizardfeat/migrate-cli-docsfeat/migrate-benchmarks