Skip to content

Single-node cluster expansion (FTT1/FTT2)#1097

Open
schmidt-scaled wants to merge 14 commits into
mainfrom
feature-single-node-expansion
Open

Single-node cluster expansion (FTT1/FTT2)#1097
schmidt-scaled wants to merge 14 commits into
mainfrom
feature-single-node-expansion

Conversation

@schmidt-scaled

Copy link
Copy Markdown
Contributor

Summary

Adds single-node cluster expansion: integrate a newcomer storage node and rotate sec/tert LVS roles across the cluster, reusing the existing restart/activate primitives.

Feature

  • --expansion CLI flag + donor-side helpers; orchestrator triggers expansion migration after integrate_new_node_into_cluster.
  • Planner derives role moves from host topology (compute_role_diff_topology, _host_rotation_layout, _validate_topology); sec/tert slot naming aligned to lvstore_stack_secondary/tertiary + tertiary_node_id.
  • Executor (SpdkMoveExecutor) drives create-primary / create-sec/-tert / re-home moves; donor cleanup via teardown_non_leader_lvstore.
  • Offline protocol-simulation harness: expands the planner's move list into the per-move RPC sequence and asserts structural invariants, emitting reviewable RPC-by-RPC traces — no FoundationDB/executor required.

Shared-path integration & hardening

Expansion reuses recreate_lvstore_on_non_leader / create_lvstore (called with activation_mode=True), so a few changes land on the shared restart/activate path:

  • recreate_lvstore_on_non_leader: per-LVS lvstore_ports rebuild + early connect_to_hublvol (the only hublvol connect that runs under activation_mode).
  • create_lvstore: always create the primary hublvol (newcomer gains secondaries post-creation via rotation).
  • Hardening: None-guard the hublvol.nvmf_port derefs in the port rebuild (legacy/single-node nodes with no hublvol previously raised AttributeError — soft-failing secondary recreation on restart, hard-failing cluster activate). Removed a dead duplicate elif STATUS_DOWN branch in the monitor.

Known residual risk (reviewers please note)

  • lvstore_ports source-of-truth changed (rebuilt from node's own ports + back-refs vs copied from primary). Expected equivalent for single-secondary restart; worth a restart/activate smoke test.
  • connect_to_hublvol fires unconditionally during activation_mode; harmless (try/except) during activate but premature — clean fix would be a dedicated param to separate activate-defer vs expansion-connect-now.

Test

  • Local pytest suite running; CI runs the full suite + lint + cli-gen check against this PR.

🤖 Generated with Claude Code

Comment thread tests/expansion_sim/_rpc_sim.py Fixed
Comment thread tests/expansion_sim/_scenarios.py Fixed
Comment thread tests/expansion_sim/conftest.py Fixed
Comment thread tests/expansion_sim/extract_rpc_timing.py Fixed
Comment thread tests/expansion_sim/protocol_generator.py Fixed
Comment thread tests/expansion_sim/protocol_generator.py Fixed
Comment thread simplyblock_core/storage_node_ops.py Fixed
Comment thread tests/test_cluster_expand_orchestrator.py Fixed
Comment thread simplyblock_cli/cli.py
argument = subcommand.add_argument('--spdk-sys-mem', help='System memory reserved for non-SPDK use (e.g. 2G, 4096M). Overrides the auto-calculated minimum_sys_memory. If not set, the value is derived from the node configuration.', type=str, dest='spdk_sys_mem')
if self.developer_mode:
argument = subcommand.add_argument('--spdk-proxy-image', help='The SPDK proxy image URI.', type=str, dest='spdk_proxy_image')
argument = subcommand.add_argument('--expansion', help='After adding the node, integrate it into the cluster by re-homing existing secondary/tertiary roles. Use this to add a single node to an FTT2 cluster instead of staging FTT+1 nodes at once.', default=False, dest='expansion', action='store_true')
Comment thread simplyblock_core/storage_node_ops.py Dismissed

@Hamdy-khader Hamdy-khader left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, approved.

@mxsrc mxsrc left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't do a thorough review yet, but superficially there are a few things that should be changed before merging this. In addition to the comments on specific files the tests that are introduced do not succeed and their execution takes >2h, making them prohibitively expensive for any reasonable checks during development. The long execution time may be a result of the failure, but if it's not, we should introduce some sort of multi-step testing to make test driven development feasible, relegating long running tests to the last step. The failing unittests are probably a false-positive, I assume the general feature works as intended, still, this should be addressed before integration.

The actual logic for the feature seems to not use the established task mechanism that other asynchronous tasks use, like storage node addition. Is that a deliberate choice? Generally, I think the code should be placed in a shared module, e.g. simplyblock_core/controllers/cluster_expansion/... instead of at the top-level of simplyblock_core. I'll do a full review of the logic tomorrow, right now this is just a high-level comment.

Comment thread simplyblock_core/env_var Outdated
Comment on lines +4 to +5
SIMPLY_BLOCK_DOCKER_IMAGE=public.ecr.aws/simply-block/simplyblock:feature-single-node-expansion
SIMPLY_BLOCK_SPDK_ULTRA_IMAGE=public.ecr.aws/simply-block/ultra:main-latest

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably not be part of this PR.

Comment thread single_node_expansion_plan.md Outdated
@@ -0,0 +1,177 @@
# Single-node cluster expansion (FTT2) — implementation plan

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should not be added to the repo. I think it may make sense to document the changes, but that should be placed in docs/, and not take the form of a planning document.

Comment thread step6_design.md Outdated
@@ -0,0 +1,163 @@
# Step 6 design — final

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be introduced to the repository.

Comment thread simplyblock_core/rpc_client.py Outdated
"trsvcid": str(trsvcid),
"trtype": trtype,
}
return self._request2("bdev_nvme_remove_trid", params)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New functions should utilize the _request3 helper which uses exception-based error handling.

Comment on lines +91 to +96
class ExpandPlanError(Exception):
"""Raised by ``execute_expand_plan`` when a fatal precondition is not
met (e.g., trying to start a fresh plan while one is already in
progress, or resuming an incompatible-schema state). Distinct from the
underlying executor exception that triggers an abort, which is
re-raised as-is after the abort is persisted."""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a generic PreconditionError in simplyblock_core/exceptions that fits this purpoes.

michixs and others added 12 commits June 15, 2026 22:03
…nsion

- CLI: --expansion on `storage-node add`, integrates newcomer via cluster_expand_executor
- Cluster model: expand_state cursor for resumability
- RPC: bdev_nvme_remove_trid for surgical path removal
- storage_node_ops: teardown_non_leader_lvstore, reattach_sibling_failover

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…o_cluster

Move the FN_NEW_DEV_MIG queueing out of add_node's tail and into the
post-integration step in clibase. The OLD path queues per-device migration
inside add_node, which (a) runs before the newcomer has any lvstore_stack
and (b) relies on cluster status flipping back to ACTIVE later for the
runner to pick it up. With --expansion, queue post-integration so tasks
are scheduled against the post-rotation lvstore_stack and an ACTIVE
cluster.

- add_node: new expansion=False kwarg; skips the implicit migration loop
  when set (caller's responsibility).
- clibase: pass expansion=expansion through; after
  integrate_new_node_into_cluster succeeds, iterate the newcomer's online
  nvme_devices and call add_new_device_mig_task per device. Scope matches
  the OLD trigger (newcomer's devs only).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: rename secondary node attributes and update lvstore handling for tertiary nodes

* fix: streamline hublvol creation and error handling in cluster expansion
Planner: add compute_role_diff_topology / _host_rotation_layout /
_validate_topology to derive role moves from host topology; align
sec/tert slot naming (lvstore_stack_secondary/tertiary, tertiary_node_id)
across executor, planner, and teardown.

Tests/tooling: add an offline protocol generator (protocol_generator.py,
extract_rpc_timing.py) that expands the planner's move list into the
per-move RPC sequence the executor would issue, plus a pytest harness
(test_expansion_protocols.py) asserting structural invariants and
emitting reviewable RPC-by-RPC protocol traces (protocols/*.txt). No
FoundationDB/executor required. Expand planner/executor unit coverage.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
recreate_lvstore_on_non_leader: None-guard the hublvol.nvmf_port derefs
when building lvstore_ports. A node with an lvstore but no hublvol
(legacy/upgrade or single-node clusters) previously raised AttributeError
here — soft-failing secondary/tertiary recreation on node restart and
hard-failing cluster activate (cluster_ops only catches
LVSRestartRequiredError). Falls back to 0, matching get_hublvol_port()'s
sentinel.

storage_node_monitor.check_node: drop the dead `elif STATUS_DOWN` branch.
The preceding `if status in (UNREACHABLE, DOWN)` already handles DOWN, so
the elif was unreachable duplicate (rebase artifact: main and the feature
branch independently implemented DOWN->ONLINE recovery).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The --expansion flag was added to the generated cli.py and clibase.py
logic but never to cli-reference.yaml (the generator's source of truth),
so `generate.sh` stripped it and the cli-up-to-date check failed. Add the
flag to the YAML and regenerate cli.py (now emits default=False).

Also clear ruff F401/F841: drop unused imports across the expansion test
suite and the now-unused RPCClient import in storage_node_ops (teardown
switched to donor_node.rpc_client()); remove an unused local in the
planner's host-rotation layout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…restart

Two regressions surfaced by CI on the shared restart/activate path:

1. recreate_lvstore_on_non_leader connected the hublvol unconditionally
   near the top AND again in the port-blocked retry block (the latter
   under `if not activation_mode`). On a normal restart both fired, so
   the hublvol was attached twice (test_lvs_role_assignment /
   test_secondary_promotion: connect_to_hublvol called 2x, expected 1x).
   Gate the early connect under `if activation_mode` so expansion/activate
   (which skip the retry block) still connect once, while restart connects
   only in the retry block — restoring main's single-connect behavior.

2. Restore the RPCClient import in storage_node_ops (kept as a test patch
   target; flagged unused by ruff after teardown switched to
   donor_node.rpc_client()). Its removal broke 6 teardown tests that
   patch simplyblock_core.storage_node_ops.RPCClient.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The executor calls storage_node_ops.recreate_lvstore_on_non_leader, but
test_spdk_move_executor patched the unrelated recreate_lvstore_on_sec, so
the real (DB-touching) function ran and failed on the unmocked FDB. Repoint
the 8 patches and update the two assertions to the real call signature
(node, primary, primary, activation_mode=True).

test_nvmeof_security: the lvstore_ports rebuild now writes the restarting
node, so patch sec_node.write_to_db too (previously only primary_node was
patched), avoiding the FDB-write SystemExit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rtifacts)

Address review comments on the cluster-expansion PR:

- Replace custom ExpandPlanError with the generic PreconditionError from
  simplyblock_core.exceptions (now available after rebasing onto main).
- bdev_nvme_remove_trid: use the exception-based _request3 helper instead
  of _request2. Caller already wraps the call in try/except and ignores
  the return value, so behavior is unchanged.
- Drop the per-branch dev image tag change to simplyblock_core/env_var
  (restore main's values) — a dev artifact that shouldn't ship in the PR.
- Remove the planning documents single_node_expansion_plan.md and
  step6_design.md from the repo root.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@schmidt-scaled schmidt-scaled force-pushed the feature-single-node-expansion branch from 1c7dda9 to da41ba7 Compare June 15, 2026 20:15
schmidt-scaled and others added 2 commits June 15, 2026 23:35
Address review feedback to use the established task mechanism instead of
running the multi-minute role-rebalance inline in the CLI handler.

- Add FN_CLUSTER_EXPAND task type + add_cluster_expand_task() wrapper and
  get_active_cluster_expand_task() dedup (one expansion per cluster).
- New services/tasks_runner_cluster_expand.py drives
  integrate_new_node_into_cluster() in the background, mapping orchestrator
  outcomes to task done/suspended. Retry-by-resume: an aborted plan is
  re-armed (planner.expand_state_rearm) so the orchestrator resumes from the
  persisted cursor rather than recomputing a diff against a half-mutated
  topology.
- clibase --expansion now queues the task and returns immediately; the
  post-integration new-device-migration trigger moves into the runner's
  success path (must run after the rotation lands).
- Register TasksRunnerClusterExpand in docker-compose-swarm.yml and the
  collect_logs service lists.

Testing:
- Fast unit tier (no FDB/SPDK): test_tasks_runner_cluster_expand.py mocks
  integrate + DB to verify the task lifecycle in milliseconds;
  expand_state_rearm covered in test_cluster_expand_planner.py.
- Sleep-injection seam: an autouse fast_sleep fixture in
  tests/expansion_sim/conftest.py no-ops time.sleep so the real-FDB sim
  suite is a logic check (ms) instead of the prior multi-hour wall-clock.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The repo-wide tests/conftest.py installs a minimal `fdb` stub (no
api_version) so the unit suite runs without FoundationDB. Because
db_controller does a top-level `import fdb`, that stub shadowed the real
client and the expansion_sim suite could never connect — the reason the
introduced FDB-backed tests failed regardless of a running FDB.

_ensure_real_fdb() pops the stub, re-imports the genuine client, and
rebinds it on already-imported simplyblock modules. _require_fdb() checks
for the cluster file first (so CI without FDB still skips cleanly) and
skips with a clear message if the real client is unavailable. The suite
must run as its own pytest invocation, separate from the stubbed unit
tier — the intended multi-step layout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants