You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**Value** The intrinsic node ID of the chosen node. This will either be the node this node voted for, or the node that is has received an `IAmOpen` message from.
595
+
596
+
``self_healing_open.votes``
597
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
598
+
599
+
**Key** Intrinsic node ID of the node which has voted for this node to be opened.
600
+
601
+
``self_healing_open.sm_state``
602
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
603
+
604
+
**Value** State machine state of the self-healing open protocol.
**Value** Timeout state machine state of the self-healing open protocol. Ticks based on `failover_timeout` and advances `self_healing_open.sm_state` if it falls behind.
613
+
614
+
See :cpp:enum:`ccf::self_healing_open::StateMachine` above.
615
+
616
+
``self_healing_open.open_kind``
617
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
618
+
619
+
**Value** The kind of recovery that was performed, either `Quorum`-based which guarantees that there is at most one recovered service using this path, or `Failover`-based which could allow multiple services to recover.
Copy file name to clipboardExpand all lines: doc/host_config_schema/cchost_config.json
+45Lines changed: 45 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -360,6 +360,51 @@
360
360
"previous_sealed_ledger_secret_location": {
361
361
"type": ["string"],
362
362
"description": "Path to the sealed ledger secret folder, the ledger secrets for the recovered service will be unsealed from here instead of reconstructed from recovery shares."
363
+
},
364
+
"self_healing_open": {
365
+
"type": "object",
366
+
"properties": {
367
+
"identity": {
368
+
"type": "object",
369
+
"properties": {
370
+
"intrinsic_id": {
371
+
"type": "string",
372
+
"description": "Intrinsic identifier of this node, used to identify it in the self-healing-open protocol"
373
+
},
374
+
"published_address": {
375
+
"type": "string",
376
+
"description": "Published address (host:port) of this node, used to identify it in the self-healing-open protocol"
377
+
}
378
+
}
379
+
},
380
+
"cluster_identities": {
381
+
"type": "array",
382
+
"items": {
383
+
"type": "object",
384
+
"properties": {
385
+
"intrinsic_id": {
386
+
"type": "string",
387
+
"description": "Intrinsic identifier of the node, used to identify it in the self-healing-open protocol"
388
+
},
389
+
"published_address": {
390
+
"type": "string",
391
+
"description": "Published address (host:port) of the node, used for communication during the self-healing-open protocol"
392
+
}
393
+
}
394
+
},
395
+
"description": "List of identities for all nodes in the cluster"
396
+
},
397
+
"retry_timeout": {
398
+
"type": "string",
399
+
"default": "100ms",
400
+
"description": "Interval (time string) at which the node re-sends self-healing-open messages. This should be significantly less than 'failover_timeout'"
401
+
},
402
+
"failover_timeout": {
403
+
"type": "string",
404
+
"default": "2000ms",
405
+
"description": "Interval (time string) after which the node forcibly advances to the next phase of the self-healing-open protocol"
Copy file name to clipboardExpand all lines: doc/operations/recovery.rst
+92-2Lines changed: 92 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -113,8 +113,8 @@ Summary Diagram
113
113
114
114
Once operators have established a recovered crash-fault tolerant public network, the existing members of the consortium :ref:`must vote to accept the recovery of the network and submit their recovery shares <governance/accept_recovery:Accepting Recovery and Submitting Shares>`.
115
115
116
-
Local Sealing Recovery
117
-
----------------------
116
+
Local Sealing Recovery (Experimental)
117
+
-------------------------------------
118
118
119
119
SNP provides the `DERIVED_KEY` guest message which derives a key from the CPU's VCEK (or VLEK), TCB version and the guest's measurement and host_data (policy), thus any change to the CPU, measurement or policy, or a rolled-back TCB version, will prevent the key from being reconstructed.
120
120
If configured, the node will unseal the secrets it previously sealed instead of waiting for recovery shares from members after `transition_to_open` is triggered.
@@ -145,6 +145,96 @@ Which of these two paths is taken is noted in the `public:ccf.internal.last_reco
In environments with limited orchestration or limited operator access, it is desirable to allow an automated disaster recovery without operator intervention.
152
+
At a high level, Self-Healing-Open recovery allows recovering replicas to discover which node has the most up-to-date ledger and automatically recover the network using that ledger.
153
+
The protocol completes with a node choosing to `transition-to-open`, and so requires another mechanism to recover the private ledger.
154
+
If it is likely that the nodes will restart on the same hardware, local sealing recovery (see above) can be used to recover the private ledger automatically, and bring the service fully online.
155
+
156
+
There are two paths, an election path, and a very-high-availability failover path.
157
+
The election path ensures that if all nodes restart and have full network connectivity, a majority of nodes' on-disk ledger contains every committed transaction, and no timeouts trigger, then there will be only one recovered network and all committed transactions will be persisted.
158
+
However, the election path can become stuck, in which case the failover path is designed to ensure progress.
159
+
160
+
In the election path, nodes first gossip with each other, learning of the ledgers of other nodes.
161
+
Once they have heard from every node they vote for the node with the best ledger.
162
+
If a node receives votes from a majority of nodes, it invokes `transition-to-open` and notifies the other nodes to restart and join it.
163
+
This path is illustrated below, and is guaranteed to succeed if all nodes can communicate and no timeouts trigger.
164
+
165
+
.. mermaid::
166
+
167
+
sequenceDiagram
168
+
participant N1
169
+
participant N2
170
+
participant N3
171
+
172
+
Note over N1, N3: Gossip
173
+
174
+
N1 ->> N2: Gossip(Tx=1)
175
+
N1 ->> N3: Gossip(Tx=1)
176
+
N2 ->> N3: Gossip(Tx=2)
177
+
N3 ->> N2: Gossip(Tx=3)
178
+
179
+
Note over N1, N3: Vote
180
+
N2 ->> N3: Vote
181
+
N3 ->> N3: Vote
182
+
183
+
Note over N1, N3: Open/Join
184
+
N3 ->> N1: IAmOpen
185
+
N3 ->> N2: IAmOpen
186
+
187
+
Note over N1, N2: Restart
188
+
189
+
Note over N3: Transition-to-open
190
+
191
+
Note over N3: Local unsealing
192
+
193
+
Note over N3: Open
194
+
195
+
N1 ->> N3: Join
196
+
N2 ->> N3: Join
197
+
198
+
In the failover path, each phase has a timeout to skip to the next phase if a failure has occurred.
199
+
For example, the election path requires all nodes to communicate to advance from the gossip phase to the vote phase.
200
+
However, if any node fails to recover, the election path is stuck.
201
+
In this case, after a timeout, nodes will advance to the vote phase regardless of whether they have heard from all nodes, and vote for the best ledger they have heard of at that point.
202
+
203
+
Unfortunately, this can lead to multiple forks of the service if different nodes cannot communicate with each other and timeout.
204
+
Hence, we recommend setting the timeout substantially higher than the highest expected recovery time, to minimise the chance of this happening.
205
+
To audit if timeouts were used to open the service, the `public:ccf.gov.selfhealingopen.failover_open` table tracks this.
206
+
207
+
This failover path is illustrated below.
208
+
209
+
.. mermaid::
210
+
211
+
sequenceDiagram
212
+
participant N1
213
+
participant N2
214
+
participant N3
215
+
216
+
Note over N1, N3: Gossip
217
+
218
+
N2 ->> N3: Gossip(Tx=2)
219
+
N3 ->> N2: Gossip(Tx=3)
220
+
221
+
Note over N1: Timeout
222
+
Note over N3: Timeout
223
+
224
+
Note over N1, N3: Vote
225
+
226
+
N1 ->> N1: Vote
227
+
N3 ->> N3: Vote
228
+
N2 ->> N3: Vote
229
+
230
+
Note over N1, N3: Open/Join
231
+
232
+
Note over N1: Transition-to-open
233
+
Note over N3: Transition-to-open
234
+
235
+
236
+
If the network fails during reconfiguration, each node will use its latest known configuration to recover. Since reconfiguration requires votes from a majority of nodes, the latest configuration should recover using the election path, however nodes in the previous configuration may recover using the election path.
0 commit comments