Support only requiring 2/3 downstairs to activate#10677
Open
jmpesp wants to merge 1 commit into
Open
Conversation
Historically Crucible required all three downstairs in order to activate, and this presents a challenge when one sled is offline: any upstairs with a downstairs on that sled would not be able to activate until that sled came back. However if Crucible had activated it _could_ have tolerated one downstairs going away and coming back just fine. Crucible can be changed to support only requiring 2/3 downstairs to activate (see RFD 542), and has been, but there's a region replacement related wrinkle to solve before shipping that: previously an Upstairs would only activate when all three downstairs were in sync, and in the case that a region replacement was done when there was no Upstairs (or the Upstairs was stopped) activation was taken as a signal that reconcilation had completed successfully. With the change that only 2/3 downstairs are required to be in sync this is no longer a reliable signal that the region set was in sync. Richer Volume health status was added to expose more details and allow Nexus to determine if a reconcilation or repair was in progress - this is the signal that Nexus has to now use to consider a post region replacement reconcilation successful. Change to that, and remove some out-of-date comments.
Contributor
|
Some notes came up during in office discussions with @jmpesp For 2/3 activation, we have to be careful with region-replacement. Consider this set of steps:
We now have downstairs 1 and the replacement for downstairs 2 online. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Historically Crucible required all three downstairs in order to activate, and this presents a challenge when one sled is offline: any upstairs with a downstairs on that sled would not be able to activate until that sled came back. However if Crucible had activated it could have tolerated one downstairs going away and coming back just fine.
Crucible can be changed to support only requiring 2/3 downstairs to activate (see RFD 542), and has been, but there's a region replacement related wrinkle to solve before shipping that: previously an Upstairs would only activate when all three downstairs were in sync, and in the case that a region replacement was done when there was no Upstairs (or the Upstairs was stopped) activation was taken as a signal that reconcilation had completed successfully. With the change that only 2/3 downstairs are required to be in sync this is no longer a reliable signal that the region set was in sync.
Richer Volume health status was added to expose more details and allow Nexus to determine if a reconcilation or repair was in progress - this is the signal that Nexus has to now use to consider a post region replacement reconcilation successful. Change to that, and remove some out-of-date comments.