Skip to content

Commit 337d41d

Browse files
cjen1-msfteddyashtonachamayouCopilot
authored
Self healing open (#7189)
Co-authored-by: Eddy Ashton <ashton.eddy@gmail.com> Co-authored-by: Amaury Chamayou <amaury@xargs.fr> Co-authored-by: Amaury Chamayou <amchamay@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
1 parent 94cfea3 commit 337d41d

38 files changed

Lines changed: 2953 additions & 68 deletions

.github/workflows/ci-verification.yml

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -239,3 +239,25 @@ jobs:
239239
name: tlc-trace-validation-consensus
240240
path: |
241241
tla/traces/*
242+
243+
model-checking-self-healing-open:
244+
name: Model Checking - Self-Healing Open
245+
runs-on: [self-hosted, 1ES.Pool=gha-vmss-d16av5-ci]
246+
container:
247+
image: mcr.microsoft.com/azurelinux/base/core:3.0
248+
options: --user root --publish-all --cap-add NET_ADMIN --cap-add NET_RAW --cap-add SYS_PTRACE
249+
250+
steps:
251+
- name: "Checkout dependencies"
252+
shell: bash
253+
run: |
254+
gpg --import /etc/pki/rpm-gpg/MICROSOFT-RPM-GPG-KEY
255+
tdnf -y update
256+
tdnf -y install ca-certificates git
257+
258+
- uses: actions/checkout@v5
259+
- name: Install Stateright dependencies
260+
run: |
261+
tdnf install -y cargo
262+
263+
- run: cd tla/disaster-recovery && cargo run check

CHANGELOG.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,14 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
66
and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).
77

8+
## [7.0.0-dev7]
9+
10+
[7.0.0-dev7]: https://github.com/microsoft/CCF/releases/tag/ccf-7.0.0-dev7
11+
12+
### Added
13+
14+
- Experimental self-healing-open protocol for automatically transitioning-to-open during a disaster recovery without operator intervention. (#7189)
15+
816
## [7.0.0-dev6]
917

1018
[7.0.0-dev6]: https://github.com/microsoft/CCF/releases/tag/ccf-7.0.0-dev6

CMakeLists.txt

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -366,6 +366,7 @@ endif()
366366
set(CCF_IMPL_SOURCE
367367
${CCF_DIR}/src/enclave/main.cpp ${CCF_DIR}/src/enclave/thread_local.cpp
368368
${CCF_DIR}/src/node/quote.cpp ${CCF_DIR}/src/node/uvm_endorsements.cpp
369+
${CCF_DIR}/src/node/self_healing_open_impl.cpp
369370
)
370371

371372
add_ccf_static_library(
@@ -736,7 +737,9 @@ if(BUILD_TESTS)
736737
add_unit_test(
737738
frontend_test
738739
${CMAKE_CURRENT_SOURCE_DIR}/src/node/rpc/test/frontend_test.cpp
739-
${CCF_DIR}/src/node/quote.cpp ${CCF_DIR}/src/node/uvm_endorsements.cpp
740+
${CCF_DIR}/src/node/quote.cpp
741+
${CCF_DIR}/src/node/uvm_endorsements.cpp
742+
${CCF_DIR}/src/node/self_healing_open_impl.cpp
740743
)
741744
target_link_libraries(
742745
frontend_test
@@ -747,6 +750,8 @@ if(BUILD_TESTS)
747750
ccfcrypto
748751
ccf_kv
749752
ccf_tasks
753+
curl
754+
uv
750755
)
751756

752757
add_unit_test(
@@ -772,11 +777,21 @@ if(BUILD_TESTS)
772777
add_unit_test(
773778
node_frontend_test
774779
${CMAKE_CURRENT_SOURCE_DIR}/src/node/rpc/test/node_frontend_test.cpp
775-
${CCF_DIR}/src/node/quote.cpp ${CCF_DIR}/src/node/uvm_endorsements.cpp
780+
${CCF_DIR}/src/node/quote.cpp
781+
${CCF_DIR}/src/node/uvm_endorsements.cpp
782+
${CCF_DIR}/src/node/self_healing_open_impl.cpp
776783
)
777784
target_link_libraries(
778-
node_frontend_test PRIVATE ${CMAKE_THREAD_LIBS_INIT} http_parser ccf_js
779-
ccf_endpoints ccfcrypto ccf_kv
785+
node_frontend_test
786+
PRIVATE ${CMAKE_THREAD_LIBS_INIT}
787+
http_parser
788+
ccf_js
789+
ccf_endpoints
790+
ccfcrypto
791+
ccf_kv
792+
uv
793+
curl
794+
ccf_tasks
780795
)
781796

782797
add_unit_test(

doc/audit/builtin_maps.rst

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -564,4 +564,59 @@ While the contents themselves are encrypted, the table is public so as to be acc
564564
**Value** The mechanism by which the ledger secret was recovered.
565565

566566
.. doxygenenum:: ccf::RecoveryType
567+
:project: CCF
568+
569+
``self_healing_open.nodes``
570+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
571+
572+
**Key** Intrinsic node ID: A string which is unique to a particular node role within a cluster.
573+
574+
**Value**
575+
576+
.. doxygenstruct:: ccf::self_healing_open::NodeInfo
577+
:project: CCF
578+
:members:
579+
580+
``self_healing_open.gossip``
581+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
582+
583+
**Key** Intrinsic node ID of the source of the gossip message.
584+
585+
**Value**
586+
587+
.. doxygenstruct:: ccf::self_healing_open::GossipRequest
588+
:project: CCF
589+
:members:
590+
591+
``self_healing_open.chosen_node``
592+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
593+
594+
**Value** The intrinsic node ID of the chosen node. This will either be the node this node voted for, or the node that is has received an `IAmOpen` message from.
595+
596+
``self_healing_open.votes``
597+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
598+
599+
**Key** Intrinsic node ID of the node which has voted for this node to be opened.
600+
601+
``self_healing_open.sm_state``
602+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
603+
604+
**Value** State machine state of the self-healing open protocol.
605+
606+
.. doxygenenum:: ccf::self_healing_open::StateMachine
607+
:project: CCF
608+
609+
``self_healing_open.timeout_sm_state``
610+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
611+
612+
**Value** Timeout state machine state of the self-healing open protocol. Ticks based on `failover_timeout` and advances `self_healing_open.sm_state` if it falls behind.
613+
614+
See :cpp:enum:`ccf::self_healing_open::StateMachine` above.
615+
616+
``self_healing_open.open_kind``
617+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
618+
619+
**Value** The kind of recovery that was performed, either `Quorum`-based which guarantees that there is at most one recovered service using this path, or `Failover`-based which could allow multiple services to recover.
620+
621+
.. doxygenenum:: ccf::self_healing_open::OpenKinds
567622
:project: CCF

doc/host_config_schema/cchost_config.json

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -360,6 +360,51 @@
360360
"previous_sealed_ledger_secret_location": {
361361
"type": ["string"],
362362
"description": "Path to the sealed ledger secret folder, the ledger secrets for the recovered service will be unsealed from here instead of reconstructed from recovery shares."
363+
},
364+
"self_healing_open": {
365+
"type": "object",
366+
"properties": {
367+
"identity": {
368+
"type": "object",
369+
"properties": {
370+
"intrinsic_id": {
371+
"type": "string",
372+
"description": "Intrinsic identifier of this node, used to identify it in the self-healing-open protocol"
373+
},
374+
"published_address": {
375+
"type": "string",
376+
"description": "Published address (host:port) of this node, used to identify it in the self-healing-open protocol"
377+
}
378+
}
379+
},
380+
"cluster_identities": {
381+
"type": "array",
382+
"items": {
383+
"type": "object",
384+
"properties": {
385+
"intrinsic_id": {
386+
"type": "string",
387+
"description": "Intrinsic identifier of the node, used to identify it in the self-healing-open protocol"
388+
},
389+
"published_address": {
390+
"type": "string",
391+
"description": "Published address (host:port) of the node, used for communication during the self-healing-open protocol"
392+
}
393+
}
394+
},
395+
"description": "List of identities for all nodes in the cluster"
396+
},
397+
"retry_timeout": {
398+
"type": "string",
399+
"default": "100ms",
400+
"description": "Interval (time string) at which the node re-sends self-healing-open messages. This should be significantly less than 'failover_timeout'"
401+
},
402+
"failover_timeout": {
403+
"type": "string",
404+
"default": "2000ms",
405+
"description": "Interval (time string) after which the node forcibly advances to the next phase of the self-healing-open protocol"
406+
}
407+
}
363408
}
364409
},
365410
"required": ["previous_service_identity_file"],

doc/operations/recovery.rst

Lines changed: 92 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -113,8 +113,8 @@ Summary Diagram
113113
114114
Once operators have established a recovered crash-fault tolerant public network, the existing members of the consortium :ref:`must vote to accept the recovery of the network and submit their recovery shares <governance/accept_recovery:Accepting Recovery and Submitting Shares>`.
115115
116-
Local Sealing Recovery
117-
----------------------
116+
Local Sealing Recovery (Experimental)
117+
-------------------------------------
118118
119119
SNP provides the `DERIVED_KEY` guest message which derives a key from the CPU's VCEK (or VLEK), TCB version and the guest's measurement and host_data (policy), thus any change to the CPU, measurement or policy, or a rolled-back TCB version, will prevent the key from being reconstructed.
120120
If configured, the node will unseal the secrets it previously sealed instead of waiting for recovery shares from members after `transition_to_open` is triggered.
@@ -145,6 +145,96 @@ Which of these two paths is taken is noted in the `public:ccf.internal.last_reco
145145
...
146146
$ /opt/ccf/bin/js_generic --config /path/to/config/file
147147
148+
Self-Healing-Open recovery (Experimental)
149+
-----------------------------------------
150+
151+
In environments with limited orchestration or limited operator access, it is desirable to allow an automated disaster recovery without operator intervention.
152+
At a high level, Self-Healing-Open recovery allows recovering replicas to discover which node has the most up-to-date ledger and automatically recover the network using that ledger.
153+
The protocol completes with a node choosing to `transition-to-open`, and so requires another mechanism to recover the private ledger.
154+
If it is likely that the nodes will restart on the same hardware, local sealing recovery (see above) can be used to recover the private ledger automatically, and bring the service fully online.
155+
156+
There are two paths, an election path, and a very-high-availability failover path.
157+
The election path ensures that if all nodes restart and have full network connectivity, a majority of nodes' on-disk ledger contains every committed transaction, and no timeouts trigger, then there will be only one recovered network and all committed transactions will be persisted.
158+
However, the election path can become stuck, in which case the failover path is designed to ensure progress.
159+
160+
In the election path, nodes first gossip with each other, learning of the ledgers of other nodes.
161+
Once they have heard from every node they vote for the node with the best ledger.
162+
If a node receives votes from a majority of nodes, it invokes `transition-to-open` and notifies the other nodes to restart and join it.
163+
This path is illustrated below, and is guaranteed to succeed if all nodes can communicate and no timeouts trigger.
164+
165+
.. mermaid::
166+
167+
sequenceDiagram
168+
participant N1
169+
participant N2
170+
participant N3
171+
172+
Note over N1, N3: Gossip
173+
174+
N1 ->> N2: Gossip(Tx=1)
175+
N1 ->> N3: Gossip(Tx=1)
176+
N2 ->> N3: Gossip(Tx=2)
177+
N3 ->> N2: Gossip(Tx=3)
178+
179+
Note over N1, N3: Vote
180+
N2 ->> N3: Vote
181+
N3 ->> N3: Vote
182+
183+
Note over N1, N3: Open/Join
184+
N3 ->> N1: IAmOpen
185+
N3 ->> N2: IAmOpen
186+
187+
Note over N1, N2: Restart
188+
189+
Note over N3: Transition-to-open
190+
191+
Note over N3: Local unsealing
192+
193+
Note over N3: Open
194+
195+
N1 ->> N3: Join
196+
N2 ->> N3: Join
197+
198+
In the failover path, each phase has a timeout to skip to the next phase if a failure has occurred.
199+
For example, the election path requires all nodes to communicate to advance from the gossip phase to the vote phase.
200+
However, if any node fails to recover, the election path is stuck.
201+
In this case, after a timeout, nodes will advance to the vote phase regardless of whether they have heard from all nodes, and vote for the best ledger they have heard of at that point.
202+
203+
Unfortunately, this can lead to multiple forks of the service if different nodes cannot communicate with each other and timeout.
204+
Hence, we recommend setting the timeout substantially higher than the highest expected recovery time, to minimise the chance of this happening.
205+
To audit if timeouts were used to open the service, the `public:ccf.gov.selfhealingopen.failover_open` table tracks this.
206+
207+
This failover path is illustrated below.
208+
209+
.. mermaid::
210+
211+
sequenceDiagram
212+
participant N1
213+
participant N2
214+
participant N3
215+
216+
Note over N1, N3: Gossip
217+
218+
N2 ->> N3: Gossip(Tx=2)
219+
N3 ->> N2: Gossip(Tx=3)
220+
221+
Note over N1: Timeout
222+
Note over N3: Timeout
223+
224+
Note over N1, N3: Vote
225+
226+
N1 ->> N1: Vote
227+
N3 ->> N3: Vote
228+
N2 ->> N3: Vote
229+
230+
Note over N1, N3: Open/Join
231+
232+
Note over N1: Transition-to-open
233+
Note over N3: Transition-to-open
234+
235+
236+
If the network fails during reconfiguration, each node will use its latest known configuration to recover. Since reconfiguration requires votes from a majority of nodes, the latest configuration should recover using the election path, however nodes in the previous configuration may recover using the election path.
237+
148238
Notes
149239
-----
150240

include/ccf/node/startup_config.h

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
#include "ccf/service/service_config.h"
1212
#include "ccf/service/tables/host_data.h"
1313
#include "ccf/service/tables/members.h"
14+
#include "ccf/service/tables/self_healing_open.h"
1415

1516
#include <optional>
1617
#include <string>
@@ -102,6 +103,15 @@ namespace ccf
102103
Snapshots snapshots = {};
103104
};
104105

106+
struct SelfHealingOpenConfig
107+
{
108+
self_healing_open::Identity identity;
109+
std::vector<self_healing_open::Identity> cluster_identities;
110+
ccf::ds::TimeString retry_timeout = {"100ms"};
111+
ccf::ds::TimeString failover_timeout = {"2000ms"};
112+
bool operator==(const SelfHealingOpenConfig&) const = default;
113+
};
114+
105115
struct StartupConfig : CCFConfig
106116
{
107117
StartupConfig() = default;
@@ -146,6 +156,7 @@ namespace ccf
146156
std::nullopt;
147157
std::optional<std::string> previous_sealed_ledger_secret_location =
148158
std::nullopt;
159+
std::optional<SelfHealingOpenConfig> self_healing_open = std::nullopt;
149160
};
150161
Recover recover = {};
151162
};

0 commit comments

Comments
 (0)