Single-node cluster expansion (FTT1/FTT2)#1097
Conversation
| argument = subcommand.add_argument('--spdk-sys-mem', help='System memory reserved for non-SPDK use (e.g. 2G, 4096M). Overrides the auto-calculated minimum_sys_memory. If not set, the value is derived from the node configuration.', type=str, dest='spdk_sys_mem') | ||
| if self.developer_mode: | ||
| argument = subcommand.add_argument('--spdk-proxy-image', help='The SPDK proxy image URI.', type=str, dest='spdk_proxy_image') | ||
| argument = subcommand.add_argument('--expansion', help='After adding the node, integrate it into the cluster by re-homing existing secondary/tertiary roles. Use this to add a single node to an FTT2 cluster instead of staging FTT+1 nodes at once.', default=False, dest='expansion', action='store_true') |
Hamdy-khader
left a comment
There was a problem hiding this comment.
Looks good, approved.
mxsrc
left a comment
There was a problem hiding this comment.
I couldn't do a thorough review yet, but superficially there are a few things that should be changed before merging this. In addition to the comments on specific files the tests that are introduced do not succeed and their execution takes >2h, making them prohibitively expensive for any reasonable checks during development. The long execution time may be a result of the failure, but if it's not, we should introduce some sort of multi-step testing to make test driven development feasible, relegating long running tests to the last step. The failing unittests are probably a false-positive, I assume the general feature works as intended, still, this should be addressed before integration.
The actual logic for the feature seems to not use the established task mechanism that other asynchronous tasks use, like storage node addition. Is that a deliberate choice? Generally, I think the code should be placed in a shared module, e.g. simplyblock_core/controllers/cluster_expansion/... instead of at the top-level of simplyblock_core. I'll do a full review of the logic tomorrow, right now this is just a high-level comment.
| SIMPLY_BLOCK_DOCKER_IMAGE=public.ecr.aws/simply-block/simplyblock:feature-single-node-expansion | ||
| SIMPLY_BLOCK_SPDK_ULTRA_IMAGE=public.ecr.aws/simply-block/ultra:main-latest |
There was a problem hiding this comment.
This should probably not be part of this PR.
| @@ -0,0 +1,177 @@ | |||
| # Single-node cluster expansion (FTT2) — implementation plan | |||
There was a problem hiding this comment.
This file should not be added to the repo. I think it may make sense to document the changes, but that should be placed in docs/, and not take the form of a planning document.
| @@ -0,0 +1,163 @@ | |||
| # Step 6 design — final | |||
There was a problem hiding this comment.
This should not be introduced to the repository.
| "trsvcid": str(trsvcid), | ||
| "trtype": trtype, | ||
| } | ||
| return self._request2("bdev_nvme_remove_trid", params) |
There was a problem hiding this comment.
New functions should utilize the _request3 helper which uses exception-based error handling.
| class ExpandPlanError(Exception): | ||
| """Raised by ``execute_expand_plan`` when a fatal precondition is not | ||
| met (e.g., trying to start a fresh plan while one is already in | ||
| progress, or resuming an incompatible-schema state). Distinct from the | ||
| underlying executor exception that triggers an abort, which is | ||
| re-raised as-is after the abort is persisted.""" |
There was a problem hiding this comment.
There is a generic PreconditionError in simplyblock_core/exceptions that fits this purpoes.
…nsion - CLI: --expansion on `storage-node add`, integrates newcomer via cluster_expand_executor - Cluster model: expand_state cursor for resumability - RPC: bdev_nvme_remove_trid for surgical path removal - storage_node_ops: teardown_non_leader_lvstore, reattach_sibling_failover Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…o_cluster Move the FN_NEW_DEV_MIG queueing out of add_node's tail and into the post-integration step in clibase. The OLD path queues per-device migration inside add_node, which (a) runs before the newcomer has any lvstore_stack and (b) relies on cluster status flipping back to ACTIVE later for the runner to pick it up. With --expansion, queue post-integration so tasks are scheduled against the post-rotation lvstore_stack and an ACTIVE cluster. - add_node: new expansion=False kwarg; skips the implicit migration loop when set (caller's responsibility). - clibase: pass expansion=expansion through; after integrate_new_node_into_cluster succeeds, iterate the newcomer's online nvme_devices and call add_new_device_mig_task per device. Scope matches the OLD trigger (newcomer's devs only). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: rename secondary node attributes and update lvstore handling for tertiary nodes * fix: streamline hublvol creation and error handling in cluster expansion
Planner: add compute_role_diff_topology / _host_rotation_layout / _validate_topology to derive role moves from host topology; align sec/tert slot naming (lvstore_stack_secondary/tertiary, tertiary_node_id) across executor, planner, and teardown. Tests/tooling: add an offline protocol generator (protocol_generator.py, extract_rpc_timing.py) that expands the planner's move list into the per-move RPC sequence the executor would issue, plus a pytest harness (test_expansion_protocols.py) asserting structural invariants and emitting reviewable RPC-by-RPC protocol traces (protocols/*.txt). No FoundationDB/executor required. Expand planner/executor unit coverage. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
recreate_lvstore_on_non_leader: None-guard the hublvol.nvmf_port derefs when building lvstore_ports. A node with an lvstore but no hublvol (legacy/upgrade or single-node clusters) previously raised AttributeError here — soft-failing secondary/tertiary recreation on node restart and hard-failing cluster activate (cluster_ops only catches LVSRestartRequiredError). Falls back to 0, matching get_hublvol_port()'s sentinel. storage_node_monitor.check_node: drop the dead `elif STATUS_DOWN` branch. The preceding `if status in (UNREACHABLE, DOWN)` already handles DOWN, so the elif was unreachable duplicate (rebase artifact: main and the feature branch independently implemented DOWN->ONLINE recovery). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The --expansion flag was added to the generated cli.py and clibase.py logic but never to cli-reference.yaml (the generator's source of truth), so `generate.sh` stripped it and the cli-up-to-date check failed. Add the flag to the YAML and regenerate cli.py (now emits default=False). Also clear ruff F401/F841: drop unused imports across the expansion test suite and the now-unused RPCClient import in storage_node_ops (teardown switched to donor_node.rpc_client()); remove an unused local in the planner's host-rotation layout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…restart Two regressions surfaced by CI on the shared restart/activate path: 1. recreate_lvstore_on_non_leader connected the hublvol unconditionally near the top AND again in the port-blocked retry block (the latter under `if not activation_mode`). On a normal restart both fired, so the hublvol was attached twice (test_lvs_role_assignment / test_secondary_promotion: connect_to_hublvol called 2x, expected 1x). Gate the early connect under `if activation_mode` so expansion/activate (which skip the retry block) still connect once, while restart connects only in the retry block — restoring main's single-connect behavior. 2. Restore the RPCClient import in storage_node_ops (kept as a test patch target; flagged unused by ruff after teardown switched to donor_node.rpc_client()). Its removal broke 6 teardown tests that patch simplyblock_core.storage_node_ops.RPCClient. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The executor calls storage_node_ops.recreate_lvstore_on_non_leader, but test_spdk_move_executor patched the unrelated recreate_lvstore_on_sec, so the real (DB-touching) function ran and failed on the unmocked FDB. Repoint the 8 patches and update the two assertions to the real call signature (node, primary, primary, activation_mode=True). test_nvmeof_security: the lvstore_ports rebuild now writes the restarting node, so patch sec_node.write_to_db too (previously only primary_node was patched), avoiding the FDB-write SystemExit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rtifacts) Address review comments on the cluster-expansion PR: - Replace custom ExpandPlanError with the generic PreconditionError from simplyblock_core.exceptions (now available after rebasing onto main). - bdev_nvme_remove_trid: use the exception-based _request3 helper instead of _request2. Caller already wraps the call in try/except and ignores the return value, so behavior is unchanged. - Drop the per-branch dev image tag change to simplyblock_core/env_var (restore main's values) — a dev artifact that shouldn't ship in the PR. - Remove the planning documents single_node_expansion_plan.md and step6_design.md from the repo root. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1c7dda9 to
da41ba7
Compare
Address review feedback to use the established task mechanism instead of running the multi-minute role-rebalance inline in the CLI handler. - Add FN_CLUSTER_EXPAND task type + add_cluster_expand_task() wrapper and get_active_cluster_expand_task() dedup (one expansion per cluster). - New services/tasks_runner_cluster_expand.py drives integrate_new_node_into_cluster() in the background, mapping orchestrator outcomes to task done/suspended. Retry-by-resume: an aborted plan is re-armed (planner.expand_state_rearm) so the orchestrator resumes from the persisted cursor rather than recomputing a diff against a half-mutated topology. - clibase --expansion now queues the task and returns immediately; the post-integration new-device-migration trigger moves into the runner's success path (must run after the rotation lands). - Register TasksRunnerClusterExpand in docker-compose-swarm.yml and the collect_logs service lists. Testing: - Fast unit tier (no FDB/SPDK): test_tasks_runner_cluster_expand.py mocks integrate + DB to verify the task lifecycle in milliseconds; expand_state_rearm covered in test_cluster_expand_planner.py. - Sleep-injection seam: an autouse fast_sleep fixture in tests/expansion_sim/conftest.py no-ops time.sleep so the real-FDB sim suite is a logic check (ms) instead of the prior multi-hour wall-clock. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The repo-wide tests/conftest.py installs a minimal `fdb` stub (no api_version) so the unit suite runs without FoundationDB. Because db_controller does a top-level `import fdb`, that stub shadowed the real client and the expansion_sim suite could never connect — the reason the introduced FDB-backed tests failed regardless of a running FDB. _ensure_real_fdb() pops the stub, re-imports the genuine client, and rebinds it on already-imported simplyblock modules. _require_fdb() checks for the cluster file first (so CI without FDB still skips cleanly) and skips with a clear message if the real client is unavailable. The suite must run as its own pytest invocation, separate from the stubbed unit tier — the intended multi-step layout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
Adds single-node cluster expansion: integrate a newcomer storage node and rotate sec/tert LVS roles across the cluster, reusing the existing restart/activate primitives.
Feature
--expansionCLI flag + donor-side helpers; orchestrator triggers expansion migration afterintegrate_new_node_into_cluster.compute_role_diff_topology,_host_rotation_layout,_validate_topology); sec/tert slot naming aligned tolvstore_stack_secondary/tertiary+tertiary_node_id.SpdkMoveExecutor) drives create-primary / create-sec/-tert / re-home moves; donor cleanup viateardown_non_leader_lvstore.Shared-path integration & hardening
Expansion reuses
recreate_lvstore_on_non_leader/create_lvstore(called withactivation_mode=True), so a few changes land on the shared restart/activate path:recreate_lvstore_on_non_leader: per-LVSlvstore_portsrebuild + earlyconnect_to_hublvol(the only hublvol connect that runs underactivation_mode).create_lvstore: always create the primary hublvol (newcomer gains secondaries post-creation via rotation).hublvol.nvmf_portderefs in the port rebuild (legacy/single-node nodes with no hublvol previously raisedAttributeError— soft-failing secondary recreation on restart, hard-failing cluster activate). Removed a dead duplicateelif STATUS_DOWNbranch in the monitor.Known residual risk (reviewers please note)
lvstore_portssource-of-truth changed (rebuilt from node's own ports + back-refs vs copied from primary). Expected equivalent for single-secondary restart; worth a restart/activate smoke test.connect_to_hublvolfires unconditionally duringactivation_mode; harmless (try/except) during activate but premature — clean fix would be a dedicated param to separate activate-defer vs expansion-connect-now.Test
pytestsuite running; CI runs the full suite + lint + cli-gen check against this PR.🤖 Generated with Claude Code