feat: CON-1445 Start remote DKG as soon as request appears in state#10055
Conversation
pierugo-dfinity
left a comment
There was a problem hiding this comment.
Something that crosses my mind when reading
remote
NiDkgConfigs are no longer stored in summary blocks
Can something break on upgrading the subnet if the old replica version included a NiDkgConfig in the CUP's summary block? From what I can tell, it looks like we do not re-validate the CUP's block when starting up (which would fail on the new replica version) but directly insert it into our validated pool instead. So it appears to look fine. Also from NiDKG's PoV, it does not make a difference whether the remote config comes from the state or the summary since merge_configs unifies them.
But more generally, couldn't having an invalid block in our validated pool break stuff elsewhere?
This is not something we usually pay attention to. Technically, I guess every upgrade summary is "invalid" on the new version, because it contains the wrong replica version. The only place where the upgrade CUP is validated again after the upgrade should be when it is fetched by an orchestrator of a different node that is still on the old version. But in that case we only validate the threshold signature. There, what's more important is that the new replica still returns the exact same bytes that the CUP was created with on the old version, which is why we store the original bytes separately in the pool. Otherwise the byte representation could change if an old CUP was deserialized and serialized again using the new version. |
Good point. Then, this addresses my concern.
Good to know, right, thanks! |
…ty#10250) Instead of the verbose `(NiDkgId, CallbackId, Result<NiDkgTranscript, String>)` triple, this PR introduces a dedicated struct for remote NiDKG transcript creation results. Additionally, we remove all mentions of "early" remote transcripts. After dfinity#10055, "early" transcript creation (i.e. as part of data blocks) is the only way for remote transcripts to be created, since they can no longer be part of summary blocks. Therefore, we can drop the "early" prefix. Note that, although we change the type of the summary payload, this change is backward/forward compatible (see passing `cup_compatibility_test`). This is because the hash function of the triple is equal to the one of the struct, therefore no migration is necessary.
Background
The
NiDkgConfigcontains the data necessary for a subnet to run one instance of the NiDKG protocol. For remote DKG (where a subnet generates key material for a different subnet), these configs are derived from request contexts stored in the replicated state.Problem
Before this PR, these remote
NiDkgConfigs were only created and included as part of summary blocks. This means, a remote DKG request had to wait in the replicated state until consensus reaches the next summary block, at which point the config is created and the remote DKG protocol is started. This increases the worst case latency of remote DKG requests (i.e. subnet/engine creations) by up to one DKG interval.Proposed Changes
Instead of waiting until the next summary, we can create the
NiDkgConfigs on demand, as soon as the corresponding request appears in the replicated state. We continue to use thestart_heightof the most recent summary block for these configs, essentially treating them "as if" they had already existed when the last summary was created. The only difference is that the number of remaining blocks to complete the remote DKG may be reduced, since it may be started at any point throughout the interval.For simplicity, this PR removes the handling of remote DKG as part of summary blocks entirely, which implies that remote
NiDkgConfigs are no longer stored in summary blocks. Therefore, when creating/validating dealings and transcripts, we now iterate over the union of local configs in the summary and remote configs derived from the replicated state.Additionally, this means that config creation errors and timeouts are now also handled by data blocks:
NiDkgConfigs, such that the same request isn't handled multiple times.Future Work
Currently, we may start a remote DKG request near the end of the interval, without enough data blocks left to include enough dealings to finish all required transcripts. This means the request will be retried at the start of the next interval. To improve efficiency, we should not create and include dealings for remote DKGs that were received with an insufficient number of data blocks left. Instead, we should only start working on the request at the start of the next interval.
Consider increasing the number of dealings per block and the number of prioritized remote DKGs per interval.