Skip to content

Lvol migration fresh#1098

Draft
EbiRider wants to merge 243 commits into
mainfrom
lvol-migration-fresh
Draft

Lvol migration fresh#1098
EbiRider wants to merge 243 commits into
mainfrom
lvol-migration-fresh

Conversation

@EbiRider

Copy link
Copy Markdown
Collaborator

this aims to add the the migration feature to the sbcli and web api
this includes a few new calls
migration list
migrate
pre-create-migration
migrate-cancel

schmidt-scaled and others added 30 commits June 8, 2026 12:10
…his should be reverted beofre merging to main
* snapshot: fix delete race that produced stuck snapshots

Three independent fixes that together close the
"Cannot remove snapshot because it is open" / EBUSY (-16) state where
the snapshot ends up with non-zero open_ref but no clone entries and
can only be cleared by restarting the host node.

1. Bump random VUID space from 10k to 1M and dedupe against existing
   CLN_/LVOL_/SNAP_ bdev-name numeric suffixes. With ~10k lvols+snaps
   the legacy 10k range hit ~50% birthday-collision probability,
   producing repeated SPDK "lvol with name already exists" rejections
   that triggered the async-delete-then-reuse sequence below.

2. snapshot_controller.add and .clone reject ops on a target that is
   in pending deletion (lvol STATUS_IN_DELETION; snapshot
   STATUS_IN_DELETION or deleted=True). Closes the window between an
   async delete being issued and a fresh create slipping through
   against the same blob, which left snapshot parent metadata
   partially overwritten by the new clone's lineage.

3. snapshot_controller.delete blocks the snapshot's hard-delete while
   any clone's SPDK-side delete is still in flight. Previously any
   IN_DELETION clone was treated as "already gone" and the snap
   delete proceeded to call SPDK, which returned EBUSY because the
   clone's bdev was still open. Now a clone counts as gone only when
   its deletion_status field has been set (i.e. the leader's
   delete_lvol_from_node returned). Otherwise the snapshot is
   soft-deleted; the clone's own delete-completion path will
   re-trigger the hard delete once SPDK has actually released it.

Tests: tests/test_snapshot_delete_race.py covers all three fixes
(10 tests, all green).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: defer snapshot deletion if clones are in deletion state

* fix: update snapshot deletion status handling during clone deletion

* fix: remove unnecessary checks for deletion status in snapshot handling

* fix: update Docker image tag for snapshot delete race fix

* fix: reduce randomness range for snapshot ID generation to improve performance

---------

Co-authored-by: schmidt-scaled <schmidt@scaled.cloud>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hamdy-khader and others added 6 commits June 17, 2026 14:07
…ation logic

- Updated `tasks_runner_lvol_migration` to handle cases where no new snapshots are migrated but preexisting snapshots exist on the target node.
- Enhanced logic to determine the correct composite name for the last snapshot on the target.
- Added `subsys_port` assignment to lvol objects in `migration_controller`.
- Added support for recognizing bdev aliases in `node_bdev_names` mapping.
- Improved `check_bdev` logic to validate `top_bdev` if the initial check fails.
@noctarius noctarius force-pushed the lvol-migration-fresh branch from 8417063 to 2061357 Compare June 17, 2026 12:14
…nces during cleanup

- Introduced `special_delete` logic to manage snapshots with open references.
- Enhanced deletion process to handle remaining snapshot instances and synchronize cleanup across nodes.
- Updated `rpc_client` to support `special_delete` flag in `delete_lvol` requests.
Comment thread simplyblock_core/services/snapshot_monitor.py Fixed
Comment thread simplyblock_core/controllers/snapshot_controller.py Fixed
Comment thread simplyblock_core/services/snapshot_monitor.py Fixed
noctarius and others added 10 commits June 17, 2026 15:38
- Added logic to create a new snapshot if `snap_plan` is empty during migration.
- Integrated snapshot creation via `snapshot_controller` with proper error handling.
- Introduced `pypass_lvol_migration_check` flag in `snapshot_controller` to allow snapshot creation during active migrations.
- Updated `tasks_runner_lvol_migration` and `migration_controller` to utilize the bypass flag for intermediate and migration-related snapshots.
…flow now create intermediate snapshots if chain is initally empty
Comment thread simplyblock_core/controllers/migration_controller.py Fixed
Comment thread simplyblock_core/controllers/snapshot_controller.py Fixed
_src_node = db.get_storage_node_by_id(migration.source_node_id)
if _src_node.secondary_node_id:
src_node_ids.add(_src_node.secondary_node_id)
except KeyError:
try:
sec_rpc.listeners_del(nqn, nic.trtype.lower(),
nic.ip4_address, sec_port)
except Exception:
try:
tgt_rpc.listeners_del(nqn, nic.trtype.lower(),
nic.ip4_address, tgt_port)
except Exception:
if tgt_sec_node is None:
try:
tgt_sec_node = db.get_storage_node_by_id(tgt_node.secondary_node_id)
except KeyError:
if tgt_ter_node is None:
try:
tgt_ter_node = db.get_storage_node_by_id(tgt_node.tertiary_node_id)
except KeyError:
Comment thread simplyblock_core/services/snapshot_monitor.py Fixed
Comment thread simplyblock_core/services/snapshot_monitor.py Fixed
snap_bdev_info = leader_node.rpc_client().get_bdev(snap.snap_bdev)
if snap_bdev_info[0]["driver_specific"]["lvol"]["open_ref"] > 1:
special_delete = True
except Exception:
snap_bdev_info = leader_node.rpc_client().get_bdev(snap.snap_bdev)
if snap_bdev_info[0]["driver_specific"]["lvol"]["open_ref"] > 1:
special_delete = True
except Exception:
snap_bdev_info = rpc_client.rpc_client().get_bdev(snap.snap_bdev)
if snap_bdev_info[0]["driver_specific"]["lvol"]["open_ref"] > 1:
special_delete = True
except Exception:
Hamdy-khader and others added 3 commits June 19, 2026 18:07
* fix: refactor migration hub logic and streamline `transfer_hublvol` creation

- Centralized migration hub logic by introducing `transfer_hublvol` creation in `storage_node`.
- Updated `tasks_runner_lvol_migration` to leverage `transfer_hublvol` for unified hub management.
- Refactored `rpc_client` methods to simplify hublvol creation and deletion.

* fix: refactor migration hub logic and streamline `transfer_hublvol` creation

- Centralized migration hub logic by introducing `transfer_hublvol` creation in `storage_node`.
- Updated `tasks_runner_lvol_migration` to leverage `transfer_hublvol` for unified hub management.
- Refactored `rpc_client` methods to simplify hublvol creation and deletion.

* fix: update `transfer_hublvol` checks and remove redundant task claim logic

- Refined `transfer_hublvol` checks to validate `bdev_name` instead of `uuid`.
- Removed commented-out `claim_task` logic in `tasks_runner_restart`.

* fix: update `get_bdev` calls to `get_bdevs` for accurate bdev info retrieval

- Replaced `get_bdev` with `get_bdevs` in `snapshot_monitor` and `snapshot_controller` to ensure proper invocation.
- Maintained `special_delete` logic to handle snapshots with open references consistently.

* prepare for merge

* Fix linter issues
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants