Skip to content

feat(migrate): [2/7] Quantization reliability layer#561

Closed
nkanu17 wants to merge 1 commit intofeat/migrate-corefrom
feat/migrate-executor
Closed

feat(migrate): [2/7] Quantization reliability layer#561
nkanu17 wants to merge 1 commit intofeat/migrate-corefrom
feat/migrate-executor

Conversation

@nkanu17
Copy link
Copy Markdown
Collaborator

@nkanu17 nkanu17 commented Apr 1, 2026

Summary

Adds the reliability module for crash-safe quantization during migrations. This layer enables vector quantization (e.g., float32 to float16/bfloat16/int8/uint8) with checkpoint-based resume capability, so interrupted quantization jobs can pick up where they left off.

Usage

Quantization is triggered automatically when a migration plan includes vector type changes. Users can resume interrupted quantization jobs:

# Apply a migration with quantization (auto-detected from plan)
rvl migrate apply --plan migration_plan.yaml

# Resume after a crash (checkpoint file is written during quantization)
rvl migrate apply --plan migration_plan.yaml --resume quantization_checkpoint.yaml

What is included

  • Reliability module (redisvl/migration/reliability.py): QuantizationReliability class with checkpoint writing, crash-safe resume, progress tracking, and per-key error isolation.
  • Executor updates (redisvl/migration/executor.py): Integration of the reliability layer into the sync executor quantization step, adding checkpoint_path parameter and resume logic.
  • Init updates (redisvl/migration/__init__.py): Exports for the reliability module.
  • Utils updates (redisvl/migration/utils.py): Helper functions for quantization checkpoint handling.

Details

The reliability layer:

  • Writes checkpoints after every N keys (configurable batch size) during quantization
  • Tracks which keys have been processed, allowing resume from the last checkpoint
  • Isolates per-key errors so a single bad document does not abort the entire job
  • Supports all quantization types: float16, bfloat16, int8, uint8, binary

PR Stack

PR Branch Description
#560 feat/migrate-core Core planner, validator, sync executor, and basic CLI
#561 (this) feat/migrate-executor Quantization reliability layer
#562 feat/migrate-async Async planner, executor, and validator
#563 feat/migrate-batch Batch migration with multi-index support
#564 feat/migrate-wizard Interactive migration wizard
#565 feat/migrate-cli-docs Full CLI commands and documentation
#566 feat/migrate-benchmarks Performance benchmarks

@jit-ci
Copy link
Copy Markdown

jit-ci bot commented Apr 1, 2026

🛡️ Jit Security Scan Results

CRITICAL HIGH MEDIUM

✅ No security findings were detected in this PR


Security scan by Jit

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 50cff884e6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the synchronous migration execution layer for redisvl.migration, including crash-safe reliability primitives (checkpointed/resumable vector re-encoding) and integration tests to validate supported migration routes end-to-end.

Changes:

  • Introduces MigrationExecutor implementing drop/recreate migrations with prefix renames, field renames, and vector datatype re-encoding.
  • Adds reliability.py with quantization checkpointing, idempotent dtype detection, bounded undo buffering, and BGSAVE safety snapshot helpers.
  • Adds integration test coverage for the apply+validate flow and a broad set of supported migration operations (including JSON storage and quantized dtypes).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
redisvl/migration/executor.py Sync migration executor implementing key enumeration, renames, quantization, drop/recreate, and validation orchestration.
redisvl/migration/reliability.py Crash-safety utilities: dtype detection, checkpoint persistence, BGSAVE wait helpers, and rollback buffer.
redisvl/migration/__init__.py Exposes MigrationExecutor as part of the public migration API.
tests/integration/test_migration_v1.py Adds a basic end-to-end plan→apply→validate integration test.
tests/integration/test_migration_routes.py Adds integration tests for supported algorithm/dtype/distance-metric migration routes.
tests/integration/test_migration_comprehensive.py Adds a comprehensive integration suite covering many supported migration operations and JSON storage cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@nkanu17
Copy link
Copy Markdown
Collaborator Author

nkanu17 commented Apr 1, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 50cff884e6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@nkanu17
Copy link
Copy Markdown
Collaborator Author

nkanu17 commented Apr 1, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 50cff884e6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

nkanu17 added a commit that referenced this pull request Apr 2, 2026
- Fix partial key renames: fail fast on collision instead of continuing
- Fix completed checkpoint: resume from post-drop state for index recreation
- Pass rename_operations to get_vector_datatype_changes for renamed vectors
- Add warning when SCAN falls back to '*' (full keyspace scan)
- Warn when field rename overwrites existing destination field
- Fix has_prefix_change falsy check for empty string prefixes
- Add logger to migration utils
nkanu17 added a commit that referenced this pull request Apr 2, 2026
- Fix _quantize_vectors docstring: 'documents quantized' not 'processed'
- Wrap test_migration_v1 assertions in try/finally for cleanup
@nkanu17 nkanu17 force-pushed the feat/migrate-executor branch from 50cff88 to 33f2a40 Compare April 2, 2026 00:30
@nkanu17
Copy link
Copy Markdown
Collaborator Author

nkanu17 commented Apr 2, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 33f2a40b2d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

nkanu17 added a commit that referenced this pull request Apr 2, 2026
- Fix partial key renames: fail fast on collision instead of continuing
- Fix completed checkpoint: resume from post-drop state for index recreation
- Pass rename_operations to get_vector_datatype_changes for renamed vectors
- Add warning when SCAN falls back to '*' (full keyspace scan)
- Warn when field rename overwrites existing destination field
- Fix has_prefix_change falsy check for empty string prefixes
- Add logger to migration utils
nkanu17 added a commit that referenced this pull request Apr 2, 2026
- Fix _quantize_vectors docstring: 'documents quantized' not 'processed'
- Wrap test_migration_v1 assertions in try/finally for cleanup
Copilot AI review requested due to automatic review settings April 2, 2026 03:58
nkanu17 added a commit that referenced this pull request Apr 2, 2026
- Remap datatype_changes keys to post-rename field names before quantization
- Only resume from completed checkpoint when source index is actually gone
nkanu17 added a commit that referenced this pull request Apr 2, 2026
- Remap datatype_changes keys to post-rename field names before quantization
- Only resume from completed checkpoint when source index is actually gone
@nkanu17 nkanu17 force-pushed the feat/migrate-executor branch from 33f2a40 to 42aa7bb Compare April 2, 2026 03:58
nkanu17 added a commit that referenced this pull request Apr 2, 2026
- Fix partial key renames: fail fast on collision instead of continuing
- Fix completed checkpoint: resume from post-drop state for index recreation
- Pass rename_operations to get_vector_datatype_changes for renamed vectors
- Add warning when SCAN falls back to '*' (full keyspace scan)
- Warn when field rename overwrites existing destination field
- Fix has_prefix_change falsy check for empty string prefixes
- Add logger to migration utils
nkanu17 added a commit that referenced this pull request Apr 2, 2026
- Fix _quantize_vectors docstring: 'documents quantized' not 'processed'
- Wrap test_migration_v1 assertions in try/finally for cleanup
nkanu17 added a commit that referenced this pull request Apr 2, 2026
- Remap datatype_changes keys to post-rename field names before quantization
- Only resume from completed checkpoint when source index is actually gone
nkanu17 added a commit that referenced this pull request Apr 2, 2026
- Switch from import logging to redisvl.utils.log.get_logger in executor, reliability, utils
- Remove unused import logging from utils.py
- Verify BGSAVE success via rdb_last_bgsave_status in both sync and async versions
nkanu17 added a commit that referenced this pull request Apr 2, 2026
- Handle missing source index in current_source_matches_snapshot (return False)
- Always remap datatype_changes keys on resume (plan uses source field names)
- Delete stale completed checkpoint on fresh run to prevent skipping quantization
Copilot AI review requested due to automatic review settings April 2, 2026 14:49
nkanu17 added a commit that referenced this pull request Apr 2, 2026
- Remap datatype_changes keys to post-rename field names before quantization
- Only resume from completed checkpoint when source index is actually gone
@nkanu17 nkanu17 force-pushed the feat/migrate-executor branch from 1897678 to 63e248f Compare April 2, 2026 14:49
@nkanu17
Copy link
Copy Markdown
Collaborator Author

nkanu17 commented Apr 2, 2026

@codex review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 8 to 15
import yaml

from redisvl.utils.log import get_logger

logger = get_logger(__name__)

from redisvl.index import SearchIndex
from redisvl.migration.models import (
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import sorting / module layout: logger = get_logger(__name__) is placed before from redisvl.index import SearchIndex, which leaves an import after executable code. This will fail make check-sort-imports/isort and violates the “imports first” rule. Move all imports (including SearchIndex) above the logger initialization.

Copilot uses AI. Check for mistakes.
Comment on lines +328 to +336
try:
current_index = SearchIndex.from_existing(
index_name,
redis_url=redis_url,
redis_client=redis_client,
)
except Exception:
# Index no longer exists (e.g. already dropped during migration)
return False
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current_source_matches_snapshot() catches Exception broadly and returns False. That will also treat transient Redis connection/auth issues as “schema mismatch / index missing”, which can mislead users and block migrations with the wrong manual action. Narrow the exception handling to the specific “index does not exist” error(s) (e.g., redis.exceptions.ResponseError matching unknown index) and re-raise unexpected exceptions.

Copilot uses AI. Check for mistakes.
Comment on lines +7 to +18
import asyncio
from redisvl.utils.log import get_logger
import os
import tempfile
import time
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple

import yaml
from pydantic import BaseModel, Field

from redisvl.migration.models import DTYPE_BYTES
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import sorting: stdlib imports (asyncio, os, tempfile, time, Path, typing) are interleaved with local imports (get_logger). With isort enabled (make check-sort-imports), this file should group stdlib, third-party, then local imports. Reorder these imports accordingly.

Copilot uses AI. Check for mistakes.
Comment on lines +4 to +10

from redisvl.utils.log import get_logger
import time
from typing import Any, Callable, Dict, Generator, List, Optional

from redis.exceptions import ResponseError

Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import sorting: import time (stdlib) comes after from redisvl.utils.log import get_logger (local), which will fail isort’s grouping rules. Please reorder imports to the standard (stdlib → third-party → local) layout so make check-sort-imports passes.

Suggested change
from redisvl.utils.log import get_logger
import time
from typing import Any, Callable, Dict, Generator, List, Optional
from redis.exceptions import ResponseError
import time
from typing import Any, Callable, Dict, Generator, List, Optional
from redis.exceptions import ResponseError
from redisvl.utils.log import get_logger

Copilot uses AI. Check for mistakes.
Comment on lines +569 to +576
if checkpoint_path and has_same_width_quantization:
report.validation.errors.append(
"Crash-safe resume is not supported for same-width datatype "
"changes (float16<->bfloat16 or int8<->uint8)."
)
report.manual_actions.append(
"Re-run without --resume for same-width vector conversions, or "
"split the migration to avoid same-width datatype changes."
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same-width dtype conversions (float16↔bfloat16, int8↔uint8) are not idempotent by byte-length detection, so any retry after a partial run (crash, timeout, manual stop) can corrupt already-converted vectors by re-interpreting target bytes as the source dtype. The code currently only blocks --resume for these cases, but a non-resume re-run is still unsafe. Consider blocking same-width conversions entirely, or require an explicit “unsafe” opt-in and add a strong warning/manual action explaining that recovery requires restoring from snapshot (e.g., the BGSAVE) rather than re-running.

Suggested change
if checkpoint_path and has_same_width_quantization:
report.validation.errors.append(
"Crash-safe resume is not supported for same-width datatype "
"changes (float16<->bfloat16 or int8<->uint8)."
)
report.manual_actions.append(
"Re-run without --resume for same-width vector conversions, or "
"split the migration to avoid same-width datatype changes."
if has_same_width_quantization:
report.validation.errors.append(
"Same-width vector datatype changes (float16<->bfloat16 or "
"int8<->uint8) are not supported by this migration because they "
"are not crash-safe or safe to re-run: any partial run followed "
"by a retry can corrupt already-converted vectors."
)
report.manual_actions.append(
"Adjust the migration plan to avoid same-width vector datatype "
"changes (for example, migrate via a different dtype or use "
"offline tooling to convert embeddings). If this migration has "
"already run partially, restore the data from a clean snapshot "
"(e.g., a BGSAVE backup) before attempting a new migration."

Copilot uses AI. Check for mistakes.
nkanu17 added a commit that referenced this pull request Apr 2, 2026
Move the sync MigrationExecutor into the core migration PR.
This version handles: field renames, key renames (prefix change),
drop/recreate index, wait for indexing, and validation.

Quantization reliability (checkpoints, BGSAVE, undo buffers) is
intentionally excluded and will be layered in PR #561.

Also includes:
- utils.py: logger, try/except in current_source_matches_snapshot,
  warnings in build_scan_match_patterns
- Integration tests (non-quantization): comprehensive + v1 flow
nkanu17 added a commit that referenced this pull request Apr 2, 2026
Add reliability module for crash-safe quantization during migrations:
- reliability.py: QuantizationCheckpoint, BatchUndoBuffer, trigger_bgsave_and_wait
- Enhanced executor.py: _quantize_vectors, checkpoint resume, BGSAVE sync
- test_migration_routes.py: quantization-specific integration tests
- test_migration_comprehensive.py: add datatype quantization test
@nkanu17 nkanu17 force-pushed the feat/migrate-executor branch from 63e248f to 34c5f95 Compare April 2, 2026 15:07
@nkanu17
Copy link
Copy Markdown
Collaborator Author

nkanu17 commented Apr 2, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 34c5f95d1b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +462 to +466
logger.warning(
"Checkpoint index '%s' does not match plan index '%s', ignoring",
existing_checkpoint.index_name,
plan.source.index_name,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Disable mismatched checkpoint before continuing apply

When the checkpoint index does not match the plan index, this branch only logs that it is being ignored, but checkpoint_path is still passed into _quantize_vectors. In quantized migrations, the run can proceed through drop/rename steps and then fail with a ValueError from _quantize_vectors on the same mismatch, leaving the source index dropped and migration incomplete. If a checkpoint is declared ignorable here, it should be detached (or removed) before the execution phase.

Useful? React with 👍 / 👎.

Comment on lines +568 to +572
if checkpoint_path and has_same_width_quantization:
report.validation.errors.append(
"Crash-safe resume is not supported for same-width datatype "
"changes (float16<->bfloat16 or int8<->uint8)."
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Scope same-width resume guard to hash quantization only

This early return blocks same-width datatype migrations whenever checkpoint_path is provided, even when the index uses JSON storage. For JSON, needs_quantization is false and vectors are re-indexed during recreate, so no in-place quantization/resume is performed. As written, valid JSON migrations like float16bfloat16 fail unnecessarily if callers pass a checkpoint path.

Useful? React with 👍 / 👎.

for key in batch:
for field_name, change in datatype_changes.items():
field_data: bytes | None = client.hget(key, field_name) # type: ignore[misc,assignment]
if not field_data:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unpipelined individual HGET reads cause major slowdown

Medium Severity

In _quantize_vectors, each vector field for every key in a batch is read with an individual client.hget() call (one Redis round trip each), while the writes are correctly pipelined. With batch_size=500 and N vector fields, this produces 500×N sequential round trips per batch. The analogous _rename_field_in_hash method pipelines its reads (lines 314–318). For remote Redis or large key counts, this makes quantization orders of magnitude slower than necessary.

Fix in Cursor Fix in Web

Add reliability module for crash-safe quantization during migrations:
- reliability.py: QuantizationCheckpoint, BatchUndoBuffer, trigger_bgsave_and_wait
- Enhanced executor.py: _quantize_vectors, checkpoint resume, BGSAVE sync
- test_migration_routes.py: quantization-specific integration tests
- test_migration_comprehensive.py: add datatype quantization test
Copilot AI review requested due to automatic review settings April 2, 2026 15:30
@nkanu17 nkanu17 force-pushed the feat/migrate-executor branch from 34c5f95 to 70520ba Compare April 2, 2026 15:30
@nkanu17 nkanu17 changed the title feat(migrator): [2/7] Sync executor with reliability checkpointing, crash-safe resume, and quantization support feat(migrate): [2/7] Quantization reliability layer Apr 2, 2026
@nkanu17
Copy link
Copy Markdown
Collaborator Author

nkanu17 commented Apr 2, 2026

@codex review

@nkanu17
Copy link
Copy Markdown
Collaborator Author

nkanu17 commented Apr 2, 2026

Closing in favor of restructured PR stack (Option A ordering).

@nkanu17 nkanu17 closed this Apr 2, 2026
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

"split the migration to avoid same-width datatype changes."
)
report.finished_at = timestamp_utc()
return report
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same-width check falsely blocks JSON storage migrations

Low Severity

The guard checkpoint_path and has_same_width_quantization blocks the entire migration without checking needs_quantization. Since needs_quantization is False for JSON storage (line 561), a JSON-backed index with a same-width vector datatype change (e.g., float16bfloat16) would be unnecessarily rejected when a checkpoint_path is provided, even though no quantization is performed for JSON.

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +453 to +457
# Check if we are resuming from a checkpoint (post-drop crash).
# If so, the source index may no longer exist in Redis, so we
# skip live schema validation and construct from the plan snapshot.
resuming_from_checkpoint = False
if checkpoint_path:
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The crash-safe resume path controlled by checkpoint_path/resuming_from_checkpoint adds substantial stateful behavior, but there are no tests that exercise an interrupted run and subsequent resume from a checkpoint. Adding an integration/unit test that validates resume behavior (e.g., completed_keys slicing, completed checkpoint skip, and post-drop reconstruction) would help prevent regressions in this high-risk flow.

Copilot uses AI. Check for mistakes.
Comment on lines +1010 to +1014
pipe = client.pipeline()
undo = BatchUndoBuffer()
keys_updated_in_batch: set[str] = set()

try:
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _quantize_vectors, the batch loop uses client.hget inside the nested key/field iteration, which creates O(keys * fields) network round-trips and can be extremely slow for large indexes. Consider batching/pipelining the reads (e.g., queue HGETs for the batch) and using pipeline(transaction=False) for the bulk HSETs to match the executor’s other pipeline usage.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 70520ba886

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1000 to +1002
)
checkpoint.save()
total_keys_for_progress = total_keys
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Validate checkpoint path before dropping source index

Failing to write a new checkpoint can currently break the migration after destructive steps have already run: apply() drops the source index before _quantize_vectors() attempts checkpoint.save(), so an invalid/unwritable checkpoint_path (permission error, directory path, bad mount, etc.) raises here and leaves the source index dropped with no recreated target index. This should be preflighted (or checkpoint created) before the drop phase so path errors fail safely.

Useful? React with 👍 / 👎.

Comment on lines +495 to +499
resuming_from_checkpoint = True
logger.info(
"Checkpoint found at %s, skipping source index validation "
"(index may have been dropped before crash)",
checkpoint_path,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Verify source index absence before forcing resume mode

Any non-completed checkpoint immediately flips resuming_from_checkpoint to True without checking whether the source index still exists. In the stale-checkpoint case (for example, source was restored but checkpoint file remains), the executor skips normal validation/drop flow and later hits index-creation failure because the existing source index is still present. This branch should mirror the completed-checkpoint guard and only enter resume mode when the live source index is actually gone.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants