Skip to content

feat: add backup and restore for storage backends#1197

Open
Zhangcy0x3 wants to merge 4 commits intonervosnetwork:developfrom
Zhangcy0x3:feat/dlp-backup-and-restore
Open

feat: add backup and restore for storage backends#1197
Zhangcy0x3 wants to merge 4 commits intonervosnetwork:developfrom
Zhangcy0x3:feat/dlp-backup-and-restore

Conversation

@Zhangcy0x3
Copy link
Copy Markdown
Contributor

feat: Comprehensive Data Loss Protection (DLP) via RocksDB Checkpoint and Restore Audit

Overview

This PR implements a full-cycle Data Loss Protection (DLP) mechanism for Fiber nodes, encompassing consistent backup generation and safety-guaranteed state restoration.

Related Issue: Part of #1086 (Specifically addressing Manual Backup and Restore Safety)


1. Backup Implementation: Approach A (RocksDB Checkpoint)

This PR implements Approach A using RocksDB's native Checkpoint feature.

  • Atomicity: It creates a consistent point-in-time snapshot of the entire database, ensuring that channel states, keys, and network data are perfectly synchronized.
  • Performance: The checkpointing process is near-instant and non-blocking. It utilizes hard links to minimize disk I/O and storage overhead, making it suitable for production environments.
  • Manual Trigger: Exposed via a dedicated RPC method backup_now to allow users/operators to trigger manual backups.

2. Restore & Audit Map Design

The core of fund safety logic is the Restore Audit mechanism. This is designed to detect and mitigate "State Rollback" attacks in scenarios where a node is restored from an outdated backup.

Mechanism & Logic:

During the restoration process, the node populates a Restore Audit Map (persisted with KeyPrefix = 250).

  • State Capture: It records the last known commitment numbers for all active channels at the exact moment of restoration.
  • Safety Audit: During the ReestablishChannel flow, we perform a mandatory audit check against the peer's reported state:
    $$CN_{remote} \leq CN_{restored} + 1$$
  • Interception:
    • If a peer reports a commitment number that exceeds our restored state (indicating we are using an old state), the channel is marked as at risk.
    • risky operations are blocked, and the channel is "soft-locked" to prevent broadcasting revoked commitment transactions.
    • Auto-Resolution: If the states align safely, the audit marker is automatically cleared to maintain database hygiene.

Design Philosophy: Audit Map vs. state.reestablishing

One might ask why we introduced a separate Audit Map instead of relying on the existing reestablishing state. Our reasoning is based on two core principles:

  1. Persistence & Crash Resilience:
    The state.reestablishing is a transient protocol phase. If the node crashes during the first re-establishment attempt after a restore, the context of "being in a restored state" must not be lost. The Audit Map is persisted in RocksDB (KeyPrefix 250), ensuring the safety "latch" survives across node restarts until a successful audit is completed.

  2. Separation of Concerns:
    Channel state management handles the standard protocol flow. Data Loss Protection (DLP) is a meta-level safety concern. By using a separate Audit Map, we avoid polluting the core channel state machine with recovery-specific flags, keeping the protocol logic clean and maintainable.

  3. Explicit Intent:
    The presence of a record in the Audit Map explicitly signals that the channel's local state is "untrusted" until proven otherwise by the peer's CommitmentNumber. This provides a clear, centralized place for the node to manage its security boundary during recovery.


3. Architectural Layering

To maintain compatibility with the recent fiber-store and fiber-types crate refactors, the implementation is strictly decoupled:

Component Responsibility Location
Low-level Storage Physical I/O, RocksDB Checkpoints, Atomic copies crates/fiber-store
High-level Logic Audit state machine, P2P message interception, DLP Policy crates/fiber-lib (fnn)

4. Testing & Verification

This PR includes 4 high-fidelity integration tests to ensure the robustness of the DLP flow:

  • test_channel_restore_audit_failure_closes_channel: Confirmed interception of rollback risks.
  • test_channel_restore_audit_success_resolves: Verified automatic resolution path.
  • test_channel_restore_audit_peer_lagging_is_safe: Handled scenarios where the peer is behind.
  • test_channel_restore_audit_multi_channel_isolation: Confirmed independent logic for multiple channels.

5. Future Work & Roadmap

To fully satisfy the requirements in #1086, the following tasks are planned for subsequent PRs:

  • Scheduled Trigger: Implement backup_interval_hours configuration for automated snapshots.
  • Event-driven Trigger: Automatic backup triggers after critical channel state transitions.
  • Remote Storage: Extensible storage backends for S3, GCS, and NFS (currently local RocksDB checkpoints).

Checklist

  • Adapted to the latest fiber-store, fiber-type and fiber-cliarchitecture.
  • All 740+ tests passed (excluding known upstream flaky tests).

@Zhangcy0x3 Zhangcy0x3 force-pushed the feat/dlp-backup-and-restore branch 6 times, most recently from 9b09292 to a18c559 Compare March 24, 2026 08:00
@Zhangcy0x3
Copy link
Copy Markdown
Contributor Author

Zhangcy0x3 commented Mar 27, 2026

Hi @chenyukang and @Officeyutong,

I’ve recently completed the refactoring of fiber-json-types and the abstraction of the storage layer (supporting both RocksDB and SQLite). However, after resolving the latest conflicts in main.rs and integrating the SQLite backend, I've encountered some E2E regressions, specifically regarding TLC states and zeroed-out hashes.

Given the large scope of this PR (35+ files) and the high velocity of the main branch, I've found myself stuck in a cycle of frequent rebases and conflict resolutions.

My Request:
Could you please take a look at the current Trait abstraction design and the overall storage refactoring architecture? I want to ensure that the architectural direction aligns with the project’s expectations before I dive deeper into fixing the specific E2E timing issues.

Once the design is confirmed, I will proceed with implementing the restore functionality for the SQLite backend. This approach will help me avoid redundant rebases and focus on a finalized design.

Thanks for your time and guidance!

@chenyukang
Copy link
Copy Markdown
Collaborator

Hi @chenyukang and @Officeyutong,

I’ve recently completed the refactoring of fiber-json-types and the abstraction of the storage layer (supporting both RocksDB and SQLite). However, after resolving the latest conflicts in main.rs and integrating the SQLite backend, I've encountered some E2E regressions, specifically regarding TLC states and zeroed-out hashes.

Given the large scope of this PR (35+ files) and the high velocity of the main branch, I've found myself stuck in a cycle of frequent rebases and conflict resolutions.

My Request: Could you please take a look at the current Trait abstraction design and the overall storage refactoring architecture? I want to ensure that the architectural direction aligns with the project’s expectations before I dive deeper into fixing the specific E2E timing issues.

Once the design is confirmed, I will proceed with implementing the restore functionality for the SQLite backend. This approach will help me avoid redundant rebases and focus on a finalized design.

Thanks for your time and guidance!

Ok, we will take a review, thanks for your time.

Comment thread crates/fiber-lib/src/rpc/info.rs Outdated
@Zhangcy0x3 Zhangcy0x3 force-pushed the feat/dlp-backup-and-restore branch from d73f895 to 4097230 Compare April 9, 2026 10:36
@Zhangcy0x3 Zhangcy0x3 changed the title feat: comprehensive Data Loss Protection (DLP) via RocksDB Checkpoint and Restore Audit feat: add backup and restore for storage backends Apr 9, 2026
@Zhangcy0x3 Zhangcy0x3 force-pushed the feat/dlp-backup-and-restore branch from 4097230 to a03e29f Compare April 9, 2026 10:50
@Zhangcy0x3
Copy link
Copy Markdown
Contributor Author

Hi, @quake I've integrated the backup_now and restore logic into the StorageBackend trait to align with the new unified storage architecture.

Hi, @chenyukang I've moved the unit/integration tests from the source files to the dedicated tests/ directory to maintain a cleaner codebase as requested.

Comment thread crates/fiber-lib/src/store/restore.rs Outdated
use std::path::Path;
use tracing::info;

pub fn restore(restore_path: &Path, db_path: &Path) -> Result<()> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to restore never brings back the backed-up key and sk files?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I've added restore_node_keys to restore the key and SK files.

Comment thread crates/fiber-lib/src/fiber/channel.rs Outdated
error!(
"CRITICAL: Recovery rollback detected for channel {}! Backup CN: {}, Peer expects: {}. Blocking channel to prevent penalty.",
state.get_id(), audit_info.local_commitment_number, reestablish_channel.remote_commitment_number
);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

werid code format here, wrong ident spaces.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

Comment thread crates/fiber-lib/src/fiber/channel.rs Outdated
// Restore Audit Interception
if let Some(audit_map) = self.store.get_restore_audit_map() {
if let Some(audit_info) = audit_map.channels.get(&state.get_id()) {
if reestablish_channel.remote_commitment_number
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems this is not the right check for backup?
https://github.com/zhangcy0x3/fiber/blob/a03e29f16442a5d95995e80687968e0df08e6cab/crates/fiber-lib/src/fiber/channel.rs#L7067-L7068
A normal reestablish considers +1 to be safe

Copy link
Copy Markdown
Contributor Author

@Zhangcy0x3 Zhangcy0x3 Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the insight. I've updated the audit check to > local_commitment_number + 1 to allow for the safe lag. I also adjusted the corresponding unit test to simulate a 2-commitment lag (-2) to ensure the audit failure and channel-closing logic remain verified.

Comment thread crates/fiber-lib/src/fiber/channel.rs Outdated
@quake
Copy link
Copy Markdown
Member

quake commented Apr 10, 2026

  • Manual Trigger: Exposed via a dedicated RPC method backup_now to allow users/operators to trigger manual backups.

I didn't find this rpc method in crates/fiber-lib/src/rpc

@Zhangcy0x3 Zhangcy0x3 force-pushed the feat/dlp-backup-and-restore branch from a03e29f to 14cc7b4 Compare April 11, 2026 07:27
@Zhangcy0x3 Zhangcy0x3 force-pushed the feat/dlp-backup-and-restore branch from 14cc7b4 to 9bf0169 Compare April 11, 2026 07:46
@Zhangcy0x3 Zhangcy0x3 force-pushed the feat/dlp-backup-and-restore branch 2 times, most recently from 926c15d to a9705aa Compare April 11, 2026 09:36
@Zhangcy0x3 Zhangcy0x3 force-pushed the feat/dlp-backup-and-restore branch from a9705aa to 85d6abb Compare April 11, 2026 10:13
@Zhangcy0x3
Copy link
Copy Markdown
Contributor Author

Zhangcy0x3 commented Apr 11, 2026

  • Manual Trigger: Exposed via a dedicated RPC method backup_now to allow users/operators to trigger manual backups.

I didn't find this rpc method in crates/fiber-lib/src/rpc

I've refined the RPC interface to only require backup_path. The internal ckb_key_path and fiber_key_path are now injected into InfoRpcServerImpl during initialization.

Copy link
Copy Markdown
Member

@quake quake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For channels in Stale state, the node should not proactively send ReestablishChannel to the peer. Currently ChannelActor::pre_start() sends our local_commitment_number and remote_commitment_number immediately. If we restored from backup, these numbers are stale, sending them reveals to the peer that we've lost data, which a malicious peer could exploit by broadcasting an old revoked commitment transaction. Instead, stale channels should wait for the peer to send ReestablishChannel first, then use the peer's remote_commitment_number to audit against our local_commitment_number without exposing our actual state.

}

/// Restore the RocksDB database from a checkpoint.
fn restore(&self, restore_path: &Path, db_path: &Path) -> Result<(), StoreError> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think restore should be implemented as a CLI command rather than through RPC. The main issue is that if the current database is already corrupted, the node may fail before the RPC service even starts. In that case, the restore path becomes unavailable exactly when it is needed most. A CLI-based restore flow would be more robust, since it can run independently of the node startup and RPC layer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add fnn restore <path> CLI command.

Comment thread crates/fiber-store/src/native.rs Outdated
results
}

fn backup_now(&self, path: &Path) -> Result<(), StoreError> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to rename to backup

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

Comment thread crates/fiber-types/src/audit.rs Outdated

/// Restore Audit Map: Records a snapshot of all channel states at the moment nodes are restored from backup.
#[derive(Serialize, Deserialize, Debug, Clone, Default, PartialEq, Eq)]
pub struct RestoreAuditMap {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of introducing a separate RestoreAuditMap store with KeyPrefix, RestoreAuditStore trait, and propagating trait bounds across the codebase, we can simply add a new variant to ChannelState (e.g. Stale) to mark channels restored from backup.

During restore, set all possible channels to the new Stale state and persist. During reestablishment, check the commitment numbers for stale channels, if the peer's remote_commitment_number is ahead of our local_commitment_number, we have stale state and should handle accordingly. If the audit passes, transition back to ChannelReady and proceed normally.

This eliminates the need for a separate store type, new trait, and all the associated trait bound changes, while the local_commitment_number needed for the audit is already available in ChannelActorData.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve refined the restore logic. Here’s the summary:

  • Architecture: Scrapped the RestoreAuditMap and trait changes. Instead, I introduced ChannelState::Stale as a "security shell" for restored channels.

  • Passive Audit: Stale channels now stay silent on restart. The audit is only triggered when the peer sends a ReestablishChannel.

  • Safety: If the audit fails (rollback detected), the channel remains Stale and blocked. If it passes, it transitions to ChannelReady via recursion to reuse existing synchronization logic.

  • ShuttingDown Recovery: For channels that were closing, they first transition to ChannelReady to align state/TLCs, then naturally converge back to ShuttingDown based on their persisted flags. This avoids duplicating complex sync code.

  • Testing: I’ve added three core integration tests covering Passive Silence, Audit Success (Recovery), and Audit Failure (Blocking).

@Zhangcy0x3 Zhangcy0x3 force-pushed the feat/dlp-backup-and-restore branch from 0d62ec4 to 621ab42 Compare April 14, 2026 14:20
@Zhangcy0x3 Zhangcy0x3 force-pushed the feat/dlp-backup-and-restore branch from 621ab42 to dd3922a Compare April 14, 2026 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants