feat: add backup and restore for storage backends by Zhangcy0x3 · Pull Request #1197 · nervosnetwork/fiber

Zhangcy0x3 · 2026-03-12T17:32:44Z

feat: Comprehensive Data Loss Protection (DLP) via RocksDB Checkpoint and Restore Audit

Overview

This PR implements a full-cycle Data Loss Protection (DLP) mechanism for Fiber nodes, encompassing consistent backup generation and safety-guaranteed state restoration.

Related Issue: Part of #1086 (Specifically addressing Manual Backup and Restore Safety)

1. Backup Implementation: Approach A (RocksDB Checkpoint)

This PR implements Approach A using RocksDB's native Checkpoint feature.

Atomicity: It creates a consistent point-in-time snapshot of the entire database, ensuring that channel states, keys, and network data are perfectly synchronized.
Performance: The checkpointing process is near-instant and non-blocking. It utilizes hard links to minimize disk I/O and storage overhead, making it suitable for production environments.
Manual Trigger: Exposed via a dedicated RPC method backup_now to allow users/operators to trigger manual backups.

2. Restore & Audit Map Design

The core of fund safety logic is the Restore Audit mechanism. This is designed to detect and mitigate "State Rollback" attacks in scenarios where a node is restored from an outdated backup.

Mechanism & Logic:

During the restoration process, the node populates a Restore Audit Map (persisted with KeyPrefix = 250).

State Capture: It records the last known commitment numbers for all active channels at the exact moment of restoration.
Safety Audit: During the ReestablishChannel flow, we perform a mandatory audit check against the peer's reported state:
$$CN_{remote} \leq CN_{restored} + 1$$
Interception:
- If a peer reports a commitment number that exceeds our restored state (indicating we are using an old state), the channel is marked as at risk.
- risky operations are blocked, and the channel is "soft-locked" to prevent broadcasting revoked commitment transactions.
- Auto-Resolution: If the states align safely, the audit marker is automatically cleared to maintain database hygiene.

Design Philosophy: Audit Map vs. state.reestablishing

One might ask why we introduced a separate Audit Map instead of relying on the existing reestablishing state. Our reasoning is based on two core principles:

Persistence & Crash Resilience:
The state.reestablishing is a transient protocol phase. If the node crashes during the first re-establishment attempt after a restore, the context of "being in a restored state" must not be lost. The Audit Map is persisted in RocksDB (KeyPrefix 250), ensuring the safety "latch" survives across node restarts until a successful audit is completed.
Separation of Concerns:
Channel state management handles the standard protocol flow. Data Loss Protection (DLP) is a meta-level safety concern. By using a separate Audit Map, we avoid polluting the core channel state machine with recovery-specific flags, keeping the protocol logic clean and maintainable.
Explicit Intent:
The presence of a record in the Audit Map explicitly signals that the channel's local state is "untrusted" until proven otherwise by the peer's CommitmentNumber. This provides a clear, centralized place for the node to manage its security boundary during recovery.

3. Architectural Layering

To maintain compatibility with the recent fiber-store and fiber-types crate refactors, the implementation is strictly decoupled:

Component	Responsibility	Location
Low-level Storage	Physical I/O, RocksDB Checkpoints, Atomic copies	`crates/fiber-store`
High-level Logic	Audit state machine, P2P message interception, DLP Policy	`crates/fiber-lib` (fnn)

4. Testing & Verification

This PR includes 4 high-fidelity integration tests to ensure the robustness of the DLP flow:

test_channel_restore_audit_failure_closes_channel: Confirmed interception of rollback risks.
test_channel_restore_audit_success_resolves: Verified automatic resolution path.
test_channel_restore_audit_peer_lagging_is_safe: Handled scenarios where the peer is behind.
test_channel_restore_audit_multi_channel_isolation: Confirmed independent logic for multiple channels.

5. Future Work & Roadmap

To fully satisfy the requirements in #1086, the following tasks are planned for subsequent PRs:

Scheduled Trigger: Implement backup_interval_hours configuration for automated snapshots.
Event-driven Trigger: Automatic backup triggers after critical channel state transitions.
Remote Storage: Extensible storage backends for S3, GCS, and NFS (currently local RocksDB checkpoints).

Checklist

Adapted to the latest fiber-store, fiber-type and fiber-cliarchitecture.
All 740+ tests passed (excluding known upstream flaky tests).

Zhangcy0x3 · 2026-03-27T08:17:03Z

Hi @chenyukang and @Officeyutong,

I’ve recently completed the refactoring of fiber-json-types and the abstraction of the storage layer (supporting both RocksDB and SQLite). However, after resolving the latest conflicts in main.rs and integrating the SQLite backend, I've encountered some E2E regressions, specifically regarding TLC states and zeroed-out hashes.

Given the large scope of this PR (35+ files) and the high velocity of the main branch, I've found myself stuck in a cycle of frequent rebases and conflict resolutions.

My Request:
Could you please take a look at the current Trait abstraction design and the overall storage refactoring architecture? I want to ensure that the architectural direction aligns with the project’s expectations before I dive deeper into fixing the specific E2E timing issues.

Once the design is confirmed, I will proceed with implementing the restore functionality for the SQLite backend. This approach will help me avoid redundant rebases and focus on a finalized design.

Thanks for your time and guidance!

chenyukang · 2026-04-01T07:50:03Z

Hi @chenyukang and @Officeyutong,

I’ve recently completed the refactoring of fiber-json-types and the abstraction of the storage layer (supporting both RocksDB and SQLite). However, after resolving the latest conflicts in main.rs and integrating the SQLite backend, I've encountered some E2E regressions, specifically regarding TLC states and zeroed-out hashes.

Given the large scope of this PR (35+ files) and the high velocity of the main branch, I've found myself stuck in a cycle of frequent rebases and conflict resolutions.

My Request: Could you please take a look at the current Trait abstraction design and the overall storage refactoring architecture? I want to ensure that the architectural direction aligns with the project’s expectations before I dive deeper into fixing the specific E2E timing issues.

Once the design is confirmed, I will proceed with implementing the restore functionality for the SQLite backend. This approach will help me avoid redundant rebases and focus on a finalized design.

Thanks for your time and guidance!

Ok, we will take a review, thanks for your time.

Zhangcy0x3 · 2026-04-09T12:41:35Z

Hi, @quake I've integrated the backup_now and restore logic into the StorageBackend trait to align with the new unified storage architecture.

Hi, @chenyukang I've moved the unit/integration tests from the source files to the dedicated tests/ directory to maintain a cleaner codebase as requested.

chenyukang · 2026-04-09T13:07:28Z

+use std::path::Path;
+use tracing::info;
+
+pub fn restore(restore_path: &Path, db_path: &Path) -> Result<()> {


no need to restore never brings back the backed-up key and sk files?

Done. I've added restore_node_keys to restore the key and SK files.

chenyukang · 2026-04-09T13:10:18Z

+                                error!(
+                                "CRITICAL: Recovery rollback detected for channel {}! Backup CN: {}, Peer expects: {}. Blocking channel to prevent penalty.",
+                                state.get_id(), audit_info.local_commitment_number, reestablish_channel.remote_commitment_number
+                            );


werid code format here, wrong ident spaces.

chenyukang · 2026-04-09T13:21:59Z

+                    // Restore Audit Interception
+                    if let Some(audit_map) = self.store.get_restore_audit_map() {
+                        if let Some(audit_info) = audit_map.channels.get(&state.get_id()) {
+                            if reestablish_channel.remote_commitment_number


seems this is not the right check for backup?
https://github.com/zhangcy0x3/fiber/blob/a03e29f16442a5d95995e80687968e0df08e6cab/crates/fiber-lib/src/fiber/channel.rs#L7067-L7068
A normal reestablish considers +1 to be safe

Thanks for the insight. I've updated the audit check to > local_commitment_number + 1 to allow for the safe lag. I also adjusted the corresponding unit test to simulate a 2-commitment lag (-2) to ensure the audit failure and channel-closing logic remain verified.

quake · 2026-04-10T11:08:42Z

Manual Trigger: Exposed via a dedicated RPC method backup_now to allow users/operators to trigger manual backups.

I didn't find this rpc method in crates/fiber-lib/src/rpc

…and restore audit

Zhangcy0x3 · 2026-04-11T10:49:18Z

Manual Trigger: Exposed via a dedicated RPC method backup_now to allow users/operators to trigger manual backups.

I didn't find this rpc method in crates/fiber-lib/src/rpc

I've refined the RPC interface to only require backup_path. The internal ckb_key_path and fiber_key_path are now injected into InfoRpcServerImpl during initialization.

quake

For channels in Stale state, the node should not proactively send ReestablishChannel to the peer. Currently ChannelActor::pre_start() sends our local_commitment_number and remote_commitment_number immediately. If we restored from backup, these numbers are stale, sending them reveals to the peer that we've lost data, which a malicious peer could exploit by broadcasting an old revoked commitment transaction. Instead, stale channels should wait for the peer to send ReestablishChannel first, then use the peer's remote_commitment_number to audit against our local_commitment_number without exposing our actual state.

quake · 2026-04-13T05:59:57Z

+    }
+
+    /// Restore the RocksDB database from a checkpoint.
+    fn restore(&self, restore_path: &Path, db_path: &Path) -> Result<(), StoreError> {


I think restore should be implemented as a CLI command rather than through RPC. The main issue is that if the current database is already corrupted, the node may fail before the RPC service even starts. In that case, the restore path becomes unavailable exactly when it is needed most. A CLI-based restore flow would be more robust, since it can run independently of the node startup and RPC layer.

Add fnn restore <path> CLI command.

quake · 2026-04-13T06:00:16Z

        results
    }
+
+    fn backup_now(&self, path: &Path) -> Result<(), StoreError> {


suggest to rename to backup

quake · 2026-04-13T06:54:00Z

+
+/// Restore Audit Map: Records a snapshot of all channel states at the moment nodes are restored from backup.
+#[derive(Serialize, Deserialize, Debug, Clone, Default, PartialEq, Eq)]
+pub struct RestoreAuditMap {


Instead of introducing a separate RestoreAuditMap store with KeyPrefix, RestoreAuditStore trait, and propagating trait bounds across the codebase, we can simply add a new variant to ChannelState (e.g. Stale) to mark channels restored from backup.

During restore, set all possible channels to the new Stale state and persist. During reestablishment, check the commitment numbers for stale channels, if the peer's remote_commitment_number is ahead of our local_commitment_number, we have stale state and should handle accordingly. If the audit passes, transition back to ChannelReady and proceed normally.

This eliminates the need for a separate store type, new trait, and all the associated trait bound changes, while the local_commitment_number needed for the audit is already available in ChannelActorData.

I’ve refined the restore logic. Here’s the summary:

Architecture: Scrapped the RestoreAuditMap and trait changes. Instead, I introduced ChannelState::Stale as a "security shell" for restored channels.

Passive Audit: Stale channels now stay silent on restart. The audit is only triggered when the peer sends a ReestablishChannel.

Safety: If the audit fails (rollback detected), the channel remains Stale and blocked. If it passes, it transitions to ChannelReady via recursion to reuse existing synchronization logic.

ShuttingDown Recovery: For channels that were closing, they first transition to ChannelReady to align state/TLCs, then naturally converge back to ShuttingDown based on their persisted flags. This avoids duplicating complex sync code.

Testing: I’ve added three core integration tests covering Passive Silence, Audit Success (Recovery), and Audit Failure (Blocking).

Zhangcy0x3 force-pushed the feat/dlp-backup-and-restore branch 6 times, most recently from 9b09292 to a18c559 Compare March 24, 2026 08:00

Zhangcy0x3 mentioned this pull request Apr 2, 2026

feat: implement manual RPC backup via RocksDB checkpoint (Approach A) #1110

Closed

3 tasks

chenyukang reviewed Apr 7, 2026

View reviewed changes

Comment thread crates/fiber-lib/src/rpc/info.rs Outdated

Zhangcy0x3 force-pushed the feat/dlp-backup-and-restore branch from d73f895 to 4097230 Compare April 9, 2026 10:36

Zhangcy0x3 changed the title ~~feat: comprehensive Data Loss Protection (DLP) via RocksDB Checkpoint and Restore Audit~~ feat: add backup and restore for storage backends Apr 9, 2026

Zhangcy0x3 force-pushed the feat/dlp-backup-and-restore branch from 4097230 to a03e29f Compare April 9, 2026 10:50

chenyukang reviewed Apr 9, 2026

View reviewed changes

Comment thread crates/fiber-lib/src/fiber/channel.rs Outdated

Zhangcy0x3 force-pushed the feat/dlp-backup-and-restore branch from a03e29f to 14cc7b4 Compare April 11, 2026 07:27

feat: implement comprehensive Data Loss Protection (DLP) with backup …

7eb7152

…and restore audit

Zhangcy0x3 force-pushed the feat/dlp-backup-and-restore branch from 14cc7b4 to 9bf0169 Compare April 11, 2026 07:46

refactor: Add backup and restore functionality to StorageBackend

a8ad8ab

Zhangcy0x3 force-pushed the feat/dlp-backup-and-restore branch 2 times, most recently from 926c15d to a9705aa Compare April 11, 2026 09:36

Add backup rpc method and fix some issues

85d6abb

Zhangcy0x3 force-pushed the feat/dlp-backup-and-restore branch from a9705aa to 85d6abb Compare April 11, 2026 10:13

Zhangcy0x3 mentioned this pull request Apr 13, 2026

Unified Migration System Design #1256

Open

quake reviewed Apr 13, 2026

View reviewed changes

Zhangcy0x3 force-pushed the feat/dlp-backup-and-restore branch from 0d62ec4 to 621ab42 Compare April 14, 2026 14:20

refactor: Replace AduditMap by ChannelState Stale

dd3922a

Zhangcy0x3 force-pushed the feat/dlp-backup-and-restore branch from 621ab42 to dd3922a Compare April 14, 2026 14:57

Conversation

Zhangcy0x3 commented Mar 12, 2026

feat: Comprehensive Data Loss Protection (DLP) via RocksDB Checkpoint and Restore Audit

Overview

1. Backup Implementation: Approach A (RocksDB Checkpoint)

2. Restore & Audit Map Design

Mechanism & Logic:

Design Philosophy: Audit Map vs. state.reestablishing

3. Architectural Layering

4. Testing & Verification

5. Future Work & Roadmap

Checklist

Uh oh!

Zhangcy0x3 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chenyukang commented Apr 1, 2026

Uh oh!

Uh oh!

Zhangcy0x3 commented Apr 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Zhangcy0x3 Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

quake commented Apr 10, 2026

Uh oh!

Zhangcy0x3 commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quake left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Zhangcy0x3 commented Mar 27, 2026 •

edited

Loading

Zhangcy0x3 Apr 11, 2026 •

edited

Loading

Zhangcy0x3 commented Apr 11, 2026 •

edited

Loading