Reject snapshot-less join against primary holding a more recent snapshot#7844
Reject snapshot-less join against primary holding a more recent snapshot#7844
Conversation
Agent-Logs-Url: https://github.com/microsoft/CCF/sessions/d3383e6c-3063-4f30-ba6d-e324856cc41f Co-authored-by: cjen1-msft <190360281+cjen1-msft@users.noreply.github.com>
Agent-Logs-Url: https://github.com/microsoft/CCF/sessions/ee92835a-b9f3-4451-aa71-bfc04a2d774e Co-authored-by: cjen1-msft <190360281+cjen1-msft@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR hardens the node join protocol to prevent snapshot-less joiners from being accepted by an “original” primary that has already committed snapshots, avoiding full-ledger replay by requiring a sufficiently recent startup snapshot on the initial join attempt.
Changes:
- Add
retry_countto the node-to-node join request and track it on the joiner to distinguish first-attempt joins from retries after snapshot fetch. - Update primary-side join acceptance logic to compare the joiner’s
startup_seqnoagainst the latest committed snapshot on disk (first attempt only), falling back to the previous check on retries. - Add an E2E regression test covering rejection of snapshot-less joins and successful join after fetching a snapshot; add a changelog entry.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/node/rpc/node_call_types.h |
Adds optional retry_count to join request type. |
src/node/rpc/serialization.h |
Adds retry_count to JSON serialization for join request. |
src/node/node_state.h |
Tracks and sends join_retry_count when retrying join after snapshot fetch. |
src/node/rpc/node_frontend.h |
Implements primary-side min acceptable seqno check using latest on-disk committed snapshot on first attempt. |
tests/reconfiguration.py |
Adds regression test for joining without a snapshot against an original primary holding committed snapshots. |
CHANGELOG.md |
Adds [Unreleased] fixed entry describing the behavior change. |
Agent-Logs-Url: https://github.com/microsoft/CCF/sessions/77ef7836-6f94-4a76-a8fd-ec303efa9014 Co-authored-by: eddyashton <6000239+eddyashton@users.noreply.github.com>
|
Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details. Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
Agent-Logs-Url: https://github.com/microsoft/CCF/sessions/0152a93d-bea4-4acf-8376-7be729e36da8 Co-authored-by: cjen1-msft <190360281+cjen1-msft@users.noreply.github.com>
Agent-Logs-Url: https://github.com/microsoft/CCF/sessions/a18950f8-94db-4a15-b9a5-cae39445e222 Co-authored-by: cjen1-msft <190360281+cjen1-msft@users.noreply.github.com>
…napshot-joining-behaviour
|
Was blocked on #7853 now unblocked. |
|
@cjen1-msft the new test does not seem to be passing:
|
Co-authored-by: Amaury Chamayou <amaury@xargs.fr>
…r is legacy or not going to fetch.
The primary should express an opinion about what snapshot is acceptable. Currently this opinion is "at least my startup_seqno".
To fix #7835 we should have a two step system.
On the first try the primary should require the joiner to have a snapshot which is at its latest snapshot number.
On subsequent tries, that primary should fall back to the safety constraint of greater than
startup_seqno.Copilot generated description:
A joiner with
startup_seqno == 0was accepted by an original (alsostartup_seqno == 0) primary even when that primary held committed snapshots, causing the joiner to replay the entire ledger.Changes
retry_countfield onJoinNetworkNodeToNode::In, incremented by the joiner each time aStartupSeqnoIsOldresponse triggers a snapshot fetch + retry.node_frontend.haccepthandler):retry_count == 0: requirestartup_seqno >= max(this_startup_seqno, latest_on_disk_snapshot_seqno). Themaxkeeps the existing behaviour when this node holds no snapshot.retry_count > 0: fall back to the priorstartup_seqno >= this_startup_seqnocheck, so the joiner doesn't chase a snapshot committed mid-fetch.NodeConfigurationSubsystem+snapshots::find_latest_committed_snapshot_in_directory(i.e. what the joiner could actually fetch from this node), not from theSNAPSHOT_STATUSKV table.NodeState): tracksjoin_retry_count, bumped where the existingStartupSeqnoIsOld+fetch_recent_snapshotpath is taken, and sent on the next request.run_join_no_snapshot_against_original_primaryintests/reconfiguration.pystarts a single-node network, generates a committed snapshot, then asserts:from_snapshot=False, fetch_recent_snapshot=Falseis rejected withStartupSeqnoIsOld(previously silently replayed the ledger);from_snapshot=False, fetch_recent_snapshot=Truesucceeds and the joiner reports a non-zerostartup_seqno.[Unreleased] / Fixedentry.