|
| 1 | +# Raft Implementation & Production Configuration |
| 2 | + |
| 3 | +This guide details the Raft consensus implementation in `ev-node`, used for High Availability (HA) of the Sequencer/Aggregator. It is targeted at experienced DevOps and developers configuring production environments. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +`ev-node` uses the [HashiCorp Raft](https://github.com/hashicorp/raft) implementation to manage leader election and state replication when running in **Aggregator Mode**. |
| 8 | + |
| 9 | +* **Role**: Ensures only one active Aggregator (Leader) produces blocks at a time. |
| 10 | +* **Failover**: Automatically elects a new leader if the current leader fails. |
| 11 | +* **Safety**: Synchronizes the block production state to prevent double-signing or fork divergence. |
| 12 | + |
| 13 | +### Architecture |
| 14 | + |
| 15 | +* **Transport**: TCP-based transport for inter-node communication. |
| 16 | +* **Storage**: [BoltDB](https://github.com/etcd-io/bbolt) is used for both the Raft Log (`raft-log.db`) and Stable Store (`raft-stable.db`). Snapshots are stored as files. |
| 17 | +* **FSM (Finite State Machine)**: The State Machine applies `RaftBlockState` (Protobuf) containing the latest block height, hash, and timestamp. |
| 18 | +* **Safety Checks**: |
| 19 | + * **Startup**: Nodes check for divergence between local block store and Raft state. |
| 20 | + * **Leadership Transfer**: Before becoming leader, a node waits for its FSM to catch up (`waitForMsgsLanded`) to prevent proposing blocks from a stale state. |
| 21 | + * **Shutdown**: The leader attempts to transfer leadership gracefully before shutting down to minimize downtime. |
| 22 | + |
| 23 | +## Configuration |
| 24 | + |
| 25 | +Raft is configured via CLI flags or the `config.toml` file under the `[raft]` (or `[rollkit.raft]`) section. |
| 26 | + |
| 27 | +### Essential Flags |
| 28 | + |
| 29 | +| Flag | Config Key | Description | Production Value | |
| 30 | +|------|------------|-------------|------------------| |
| 31 | +| `--evnode.raft.enable` | `raft.enable` | Enable Raft consensus. | `true` | |
| 32 | +| `--evnode.raft.node_id` | `raft.node_id` | **Unique** identifier for the node. | e.g., `node-01` | |
| 33 | +| `--evnode.raft.raft_addr` | `raft.raft_addr` | TCP address for Raft transport. | `0.0.0.0:5001` (Bind to private IP) | |
| 34 | +| `--evnode.raft.raft_dir` | `raft.raft_dir` | Directory for Raft data. | `/data/raft` (Must be persistent) | |
| 35 | +| `--evnode.raft.peers` | `raft.peers` | Comma-separated list of peer addresses in format `nodeID@host:port`. | `node-1@10.0.0.1:5001,node-2@10.0.0.2:5001,node-3@10.0.0.3:5001` | |
| 36 | +| `--evnode.raft.bootstrap` | `raft.bootstrap` | Bootstrap the cluster. **Required** for initial setup. | `true` (See Limitations) | |
| 37 | + |
| 38 | +### Timeout Tuning |
| 39 | + |
| 40 | +Raft timeouts should be tuned relative to your **Block Time** (`--evnode.node.block_time`) to utilize the fast failover capabilities without causing instability. |
| 41 | + |
| 42 | +| Flag | Default | Recommended Tuning | |
| 43 | +|------|---------|--------------------| |
| 44 | +| `--evnode.raft.heartbeat_timeout` | `1s` | **10-30% of Leader Lease**. For sub-second block times, lower to `50ms-100ms`. | |
| 45 | +| `--evnode.raft.leader_lease_timeout` | `500ms` | **Must be < Election Timeout**. Use `500ms` for 1s block times. For slower chains (e.g., 10s blocks), increase to `1s-2s` to tolerate network jitter. | |
| 46 | +| `--evnode.raft.send_timeout` | `1s` | Should be `> 2x RTT`. | |
| 47 | + |
| 48 | +**Relation to Block Time**: |
| 49 | +Ideally, a failover should complete within `2 * BlockTime` to minimize user impact. |
| 50 | +* **Fast Chain (BlockTime < 1s)**: Tighten timeouts. Heartbeat `50ms`, Lease `250ms`. |
| 51 | +* **Standard Chain (BlockTime = 1s)**: Heartbeat `100ms`, Lease `500ms`. |
| 52 | +* **Slow Chain (BlockTime > 5s)**: Defaults are usually sufficient (`1s` heartbeat). |
| 53 | + |
| 54 | +> **Warning**: Setting timeouts too low (< RTT + Jitter) will cause leadership flapping and halted block production. |
| 55 | +
|
| 56 | +## Production Deployment Principles |
| 57 | + |
| 58 | +### 1. Static Peering & Bootstrap |
| 59 | +Current implementation requires **Bootstrap Mode** (`--evnode.raft.bootstrap=true`) for all nodes participating in the cluster initialization. |
| 60 | +* **All nodes** should list the full set of peers in `--evnode.raft.peers`. |
| 61 | +* The `peers` list format is strict: `NodeID@Host:Port`. |
| 62 | +* **Limitation**: Dynamic addition of peers (Run-time Membership Changes) via RPC/CLI is not currently exposed. The cluster membership is static based on the initial bootstrap configuration. |
| 63 | + |
| 64 | +### 2. Infrastructure Requirements |
| 65 | +* **Encrypted Network (CRITICAL)**: Raft traffic is **unencrypted** (plain TCP). You **MUST** run the cluster inside a private network, VPN, or encrypted mesh (e.g., WireGuard, Tailscale). **Never expose Raft ports to the public internet**; doing so allows attackers to hijack the cluster consensus. |
| 66 | +* **Cluster Size**: Run an **odd number** of nodes (3 or 5) to tolerate failures (3 nodes tolerate 1 failure; 5 nodes tolerate 2). |
| 67 | +* **Storage**: The `--evnode.raft.raft_dir` **MUST** be mounted on persistent storage. Loss of this directory will cause the node to lose its identity and commit history, effectively removing it from the cluster. |
| 68 | +* **Network**: Raft requires low-latency, reliable connectivity. Ensure firewall rules allow TCP traffic on `raft_addr`. |
| 69 | + |
| 70 | +### 3. P2P Interaction & Catch-Up |
| 71 | +Raft and P2P work in parallel to ensure reliability: |
| 72 | +* **Hot Replication (Raft)**: New blocks produced by the leader are replicated via the Raft transport (Header + Data) to all followers. This ensures low-latency propagation of the chain tip. |
| 73 | +* **Catch-Up (P2P)**: If a node falls behind (e.g., disconnected for longer than the Raft log retention), it will receive a **Raft Snapshot** to update its consensus state to the latest head. However, the *historical blocks* between its local state and the new head are fetched via the **P2P Network** (or DA). |
| 74 | + * **Implication**: You must ensure P2P connectivity (`--p2p.listen_address` and `--p2p.peers`) is configured even for Raft nodes, to allow them to backfill missing data from peers. |
| 75 | + |
| 76 | +### 4. Lifecycle Management |
| 77 | +* **Rolling Restarts**: You can restart nodes one by one. The `ev-node` implementation handles graceful shutdown (leadership transfer) to minimize impact. |
| 78 | +* **State Divergence**: If a node falls too far behind or its local store conflicts with Raft (e.g., due to catastrophic disk failure), it may panic on startup to protect safety. In such cases, a manual extensive recovery (wiping state and re-syncing) may be required. |
| 79 | + |
| 80 | +### 4. Monitoring |
| 81 | +Monitor the following metrics (propagated via Prometheus if enabled): |
| 82 | +* **Leadership Changes**: Frequent changes indicate network instability or overloaded nodes. |
| 83 | +* **Applied Index vs Commit Index**: A growing lag indicates the FSM cannot keep up. |
| 84 | + |
| 85 | +## Example Command |
| 86 | + |
| 87 | +```bash |
| 88 | +./ev-node start \ |
| 89 | + --node.aggregator \ |
| 90 | + --raft.enable \ |
| 91 | + --raft.node_id="node-1" \ |
| 92 | + --raft.raft_addr="0.0.0.0:5001" \ |
| 93 | + --raft.raft_dir="/var/lib/ev-node/raft" \ |
| 94 | + --raft.bootstrap=true \ |
| 95 | + --raft.peers="node-1@10.0.1.1:5001,node-2@10.0.1.2:5001,node-3@10.0.1.3:5001" \ |
| 96 | + --p2p.listen_address="/ip4/0.0.0.0/tcp/26656" \ |
| 97 | + ...other flags |
| 98 | +``` |
0 commit comments