feat: add backup and restore for storage backends#1197
feat: add backup and restore for storage backends#1197Zhangcy0x3 wants to merge 4 commits intonervosnetwork:developfrom
Conversation
9b09292 to
a18c559
Compare
|
Hi @chenyukang and @Officeyutong, I’ve recently completed the refactoring of fiber-json-types and the abstraction of the storage layer (supporting both RocksDB and SQLite). However, after resolving the latest conflicts in main.rs and integrating the SQLite backend, I've encountered some E2E regressions, specifically regarding TLC states and zeroed-out hashes. Given the large scope of this PR (35+ files) and the high velocity of the main branch, I've found myself stuck in a cycle of frequent rebases and conflict resolutions. My Request: Once the design is confirmed, I will proceed with implementing the restore functionality for the SQLite backend. This approach will help me avoid redundant rebases and focus on a finalized design. Thanks for your time and guidance! |
Ok, we will take a review, thanks for your time. |
d73f895 to
4097230
Compare
4097230 to
a03e29f
Compare
|
Hi, @quake I've integrated the backup_now and restore logic into the StorageBackend trait to align with the new unified storage architecture. Hi, @chenyukang I've moved the unit/integration tests from the source files to the dedicated tests/ directory to maintain a cleaner codebase as requested. |
| use std::path::Path; | ||
| use tracing::info; | ||
|
|
||
| pub fn restore(restore_path: &Path, db_path: &Path) -> Result<()> { |
There was a problem hiding this comment.
no need to restore never brings back the backed-up key and sk files?
There was a problem hiding this comment.
Done. I've added restore_node_keys to restore the key and SK files.
| error!( | ||
| "CRITICAL: Recovery rollback detected for channel {}! Backup CN: {}, Peer expects: {}. Blocking channel to prevent penalty.", | ||
| state.get_id(), audit_info.local_commitment_number, reestablish_channel.remote_commitment_number | ||
| ); |
There was a problem hiding this comment.
werid code format here, wrong ident spaces.
| // Restore Audit Interception | ||
| if let Some(audit_map) = self.store.get_restore_audit_map() { | ||
| if let Some(audit_info) = audit_map.channels.get(&state.get_id()) { | ||
| if reestablish_channel.remote_commitment_number |
There was a problem hiding this comment.
seems this is not the right check for backup?
https://github.com/zhangcy0x3/fiber/blob/a03e29f16442a5d95995e80687968e0df08e6cab/crates/fiber-lib/src/fiber/channel.rs#L7067-L7068
A normal reestablish considers +1 to be safe
There was a problem hiding this comment.
Thanks for the insight. I've updated the audit check to > local_commitment_number + 1 to allow for the safe lag. I also adjusted the corresponding unit test to simulate a 2-commitment lag (-2) to ensure the audit failure and channel-closing logic remain verified.
I didn't find this rpc method in crates/fiber-lib/src/rpc |
a03e29f to
14cc7b4
Compare
…and restore audit
14cc7b4 to
9bf0169
Compare
926c15d to
a9705aa
Compare
a9705aa to
85d6abb
Compare
I've refined the RPC interface to only require backup_path. The internal ckb_key_path and fiber_key_path are now injected into InfoRpcServerImpl during initialization. |
quake
left a comment
There was a problem hiding this comment.
For channels in Stale state, the node should not proactively send ReestablishChannel to the peer. Currently ChannelActor::pre_start() sends our local_commitment_number and remote_commitment_number immediately. If we restored from backup, these numbers are stale, sending them reveals to the peer that we've lost data, which a malicious peer could exploit by broadcasting an old revoked commitment transaction. Instead, stale channels should wait for the peer to send ReestablishChannel first, then use the peer's remote_commitment_number to audit against our local_commitment_number without exposing our actual state.
| } | ||
|
|
||
| /// Restore the RocksDB database from a checkpoint. | ||
| fn restore(&self, restore_path: &Path, db_path: &Path) -> Result<(), StoreError> { |
There was a problem hiding this comment.
I think restore should be implemented as a CLI command rather than through RPC. The main issue is that if the current database is already corrupted, the node may fail before the RPC service even starts. In that case, the restore path becomes unavailable exactly when it is needed most. A CLI-based restore flow would be more robust, since it can run independently of the node startup and RPC layer.
There was a problem hiding this comment.
Add fnn restore <path> CLI command.
| results | ||
| } | ||
|
|
||
| fn backup_now(&self, path: &Path) -> Result<(), StoreError> { |
|
|
||
| /// Restore Audit Map: Records a snapshot of all channel states at the moment nodes are restored from backup. | ||
| #[derive(Serialize, Deserialize, Debug, Clone, Default, PartialEq, Eq)] | ||
| pub struct RestoreAuditMap { |
There was a problem hiding this comment.
Instead of introducing a separate RestoreAuditMap store with KeyPrefix, RestoreAuditStore trait, and propagating trait bounds across the codebase, we can simply add a new variant to ChannelState (e.g. Stale) to mark channels restored from backup.
During restore, set all possible channels to the new Stale state and persist. During reestablishment, check the commitment numbers for stale channels, if the peer's remote_commitment_number is ahead of our local_commitment_number, we have stale state and should handle accordingly. If the audit passes, transition back to ChannelReady and proceed normally.
This eliminates the need for a separate store type, new trait, and all the associated trait bound changes, while the local_commitment_number needed for the audit is already available in ChannelActorData.
There was a problem hiding this comment.
I’ve refined the restore logic. Here’s the summary:
-
Architecture: Scrapped the
RestoreAuditMapand trait changes. Instead, I introducedChannelState::Staleas a "security shell" for restored channels. -
Passive Audit: Stale channels now stay silent on restart. The audit is only triggered when the peer sends a
ReestablishChannel. -
Safety: If the audit fails (rollback detected), the channel remains Stale and blocked. If it passes, it transitions to
ChannelReadyvia recursion to reuse existing synchronization logic. -
ShuttingDown Recovery: For channels that were closing, they first transition to
ChannelReadyto align state/TLCs, then naturally converge back toShuttingDownbased on their persisted flags. This avoids duplicating complex sync code. -
Testing: I’ve added three core integration tests covering
Passive Silence,Audit Success (Recovery), andAudit Failure (Blocking).
0d62ec4 to
621ab42
Compare
621ab42 to
dd3922a
Compare
feat: Comprehensive Data Loss Protection (DLP) via RocksDB Checkpoint and Restore Audit
Overview
This PR implements a full-cycle Data Loss Protection (DLP) mechanism for Fiber nodes, encompassing consistent backup generation and safety-guaranteed state restoration.
Related Issue: Part of #1086 (Specifically addressing Manual Backup and Restore Safety)
1. Backup Implementation: Approach A (RocksDB Checkpoint)
This PR implements Approach A using RocksDB's native
Checkpointfeature.2. Restore & Audit Map Design
The core of fund safety logic is the Restore Audit mechanism. This is designed to detect and mitigate "State Rollback" attacks in scenarios where a node is restored from an outdated backup.
Mechanism & Logic:
During the restoration process, the node populates a Restore Audit Map (persisted with
KeyPrefix = 250).ReestablishChannelflow, we perform a mandatory audit check against the peer's reported state:Design Philosophy: Audit Map vs. state.reestablishing
One might ask why we introduced a separate
Audit Mapinstead of relying on the existingreestablishingstate. Our reasoning is based on two core principles:Persistence & Crash Resilience:
The
state.reestablishingis a transient protocol phase. If the node crashes during the first re-establishment attempt after a restore, the context of "being in a restored state" must not be lost. TheAudit Mapis persisted in RocksDB (KeyPrefix 250), ensuring the safety "latch" survives across node restarts until a successful audit is completed.Separation of Concerns:
Channel state management handles the standard protocol flow. Data Loss Protection (DLP) is a meta-level safety concern. By using a separate Audit Map, we avoid polluting the core channel state machine with recovery-specific flags, keeping the protocol logic clean and maintainable.
Explicit Intent:
The presence of a record in the Audit Map explicitly signals that the channel's local state is "untrusted" until proven otherwise by the peer's
CommitmentNumber. This provides a clear, centralized place for the node to manage its security boundary during recovery.3. Architectural Layering
To maintain compatibility with the recent
fiber-storeandfiber-typescrate refactors, the implementation is strictly decoupled:crates/fiber-storecrates/fiber-lib(fnn)4. Testing & Verification
This PR includes 4 high-fidelity integration tests to ensure the robustness of the DLP flow:
test_channel_restore_audit_failure_closes_channel: Confirmed interception of rollback risks.test_channel_restore_audit_success_resolves: Verified automatic resolution path.test_channel_restore_audit_peer_lagging_is_safe: Handled scenarios where the peer is behind.test_channel_restore_audit_multi_channel_isolation: Confirmed independent logic for multiple channels.5. Future Work & Roadmap
To fully satisfy the requirements in #1086, the following tasks are planned for subsequent PRs:
backup_interval_hoursconfiguration for automated snapshots.Checklist
fiber-store,fiber-typeandfiber-cliarchitecture.