nexus: add failure reasons to Failed VMMs using a VmmState domain type#10425
nexus: add failure reasons to Failed VMMs using a VmmState domain type#10425hawkw wants to merge 13 commits into
Failed VMMs using a VmmState domain type#10425Conversation
Failed VMMs using a VmmState domain type
hawkw
left a comment
There was a problem hiding this comment.
Some notes to hopefully make this loud and awkward PR a bit more comfortable to review. Also, I'm tagging in @sunshowers in the hopes that they will help to defend my decision to have three versions of these types in case anyone else gets mad at me for that :)
| self.outcome = CheckOutcome::Failure(Failure::SledExpunged); | ||
| return mk_failed(); | ||
| return mk_failed(VmmFailureReason::SledExpunged); |
There was a problem hiding this comment.
In future, I would like to replace the Failure type which is used internally to the instance_watcher task with just using VmmFailureReason --- I didn't do that in this PR because the diff has gotten big enough already, but will do that in a subsequent change if y'all decide you don't totally hate what I've proposed thus far.
| #[derive(Clone, Debug, Deserialize, Serialize, Eq, PartialEq)] | ||
| pub struct SledVmmState { | ||
| pub vmm_state: VmmRuntimeState, | ||
| /// The current state of any inbound migration to this VMM. | ||
| pub migration_in: Option<sled_agent::MigrationRuntimeState>, | ||
|
|
||
| /// The state of any outbound migration from this VMM. | ||
| pub migration_out: Option<sled_agent::MigrationRuntimeState>, | ||
| } |
There was a problem hiding this comment.
this is the "domain type" version of SledVmmState
| impl SledVmmState { | ||
| pub fn migrations(&self) -> Migrations<'_> { | ||
| Migrations { | ||
| migration_in: self.migration_in.as_ref(), | ||
| migration_out: self.migration_out.as_ref(), | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This just moved from its previous home in sled_agent_types::instance and has not otherwise changed.
| /// The dynamic runtime properties of an individual VMM process. | ||
| #[derive(Clone, Debug, Deserialize, Serialize, Eq, PartialEq)] | ||
| pub struct VmmRuntimeState { | ||
| /// The last state reported by this VMM. | ||
| pub state: VmmState, | ||
| /// The generation number for this VMM's state. | ||
| #[serde(rename = "gen")] | ||
| pub generation: Generation, | ||
| /// Timestamp for the VMM's state. | ||
| pub time_updated: DateTime<Utc>, | ||
| } |
There was a problem hiding this comment.
This is the same as the VmmRuntimeState in sled_agent_types but using the domain type version of VmmState, which has a failure reason...
| /// Transitions the VMM state from the current state to `new_state`, | ||
| /// returning a `VmmRuntimeState` representing the VMM after transitioning | ||
| /// to the new state. | ||
| /// | ||
| /// If `new_state` is the same as the current state, returns a copy of | ||
| /// `self` without any changes. | ||
| pub fn transition(&self, new_state: VmmState) -> Self { | ||
| if new_state == self.state { | ||
| return self.clone(); | ||
| } | ||
| Self { | ||
| state: new_state, | ||
| generation: self.generation.next(), | ||
| time_updated: Utc::now(), | ||
| } | ||
| } |
There was a problem hiding this comment.
This is a helper for reducing the boilerplate of doing these state transitions, which I thought was nice --- especially because it enforces the property that you bump the generation when doing a state transition, which you previously had to remember to do.
| /// Returns `true` if this is a terminal VMM state. | ||
| /// | ||
| /// A VMM in a terminal state will never transition to another state. | ||
| pub fn is_terminal(&self) -> bool { | ||
| matches!(self, VmmState::Destroyed | VmmState::Failed(_)) | ||
| } | ||
|
|
||
| /// Returns `true` if the VMM is in a `Failed` state. | ||
| pub fn is_failed(&self) -> bool { | ||
| matches!(self, VmmState::Failed(_)) | ||
| } | ||
|
|
||
| /// Returns `true` if the VMM is in a state where it is safe to | ||
| /// deallocate its sled resources and mark it as deleted. | ||
| pub fn is_destroyable(&self) -> bool { | ||
| matches!( | ||
| self, | ||
| VmmState::Destroyed | VmmState::Failed(_) | VmmState::SagaUnwound | ||
| ) | ||
| } | ||
|
|
||
| /// Returns `true` if the VMM exists on a sled. | ||
| /// | ||
| /// This is `false` for VMMs that have not yet been created, have been | ||
| /// destroyed, or were produced by an unwound saga and will never exist. | ||
| pub fn exists_on_sled(&self) -> bool { | ||
| !matches!( | ||
| self, | ||
| VmmState::Creating | VmmState::SagaUnwound | VmmState::Destroyed | ||
| ) | ||
| } |
There was a problem hiding this comment.
These predicates previously existed in the nexus_db_model version of the type; I've moved them here.
| impl From<sled_agent::VmmState> for VmmState { | ||
| fn from(state: sled_agent::VmmState) -> Self { | ||
| match state { | ||
| sled_agent::VmmState::Starting => VmmState::Starting, | ||
| sled_agent::VmmState::Running => VmmState::Running, | ||
| sled_agent::VmmState::Stopping => VmmState::Stopping, | ||
| sled_agent::VmmState::Stopped => VmmState::Stopped, | ||
| sled_agent::VmmState::Rebooting => VmmState::Rebooting, | ||
| sled_agent::VmmState::Migrating => VmmState::Migrating, | ||
| sled_agent::VmmState::Failed => { | ||
| // If we are converting a state received from a sled-agent that | ||
| // indicates that the VMM is failed, the failure reason is | ||
| // implicitly "from_sled_agent" for now. | ||
| VmmState::Failed(VmmFailureReason::FromSledAgent) | ||
| } | ||
| sled_agent::VmmState::Destroyed => VmmState::Destroyed, | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This is where we now synthesize the FromSledAgent failure reason when receiv9ing things from the sled-agent.
| impl From<VmmState> for sled_agent::VmmState { | ||
| fn from(state: VmmState) -> Self { | ||
| match state { | ||
| // The `Creating` state is internal to Nexus; the outside world | ||
| // should treat it as equivalent to `Starting`. | ||
| VmmState::Starting | VmmState::Creating => { | ||
| sled_agent::VmmState::Starting | ||
| } | ||
| VmmState::Running => sled_agent::VmmState::Running, | ||
| VmmState::Stopping => sled_agent::VmmState::Stopping, | ||
| VmmState::Stopped => sled_agent::VmmState::Stopped, | ||
| VmmState::Rebooting => sled_agent::VmmState::Rebooting, | ||
| VmmState::Migrating => sled_agent::VmmState::Migrating, | ||
| VmmState::Failed(_) => sled_agent::VmmState::Failed, | ||
|
|
||
| // The `SagaUnwound` state is internal to Nexus; the outside world | ||
| // should treat it as equivalent to `Destroyed`. | ||
| VmmState::Destroyed | VmmState::SagaUnwound => { | ||
| sled_agent::VmmState::Destroyed | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This conversion already existed but was defined on the nexus_db_model type rather than here.
| impl From<VmmState> for omicron_common::api::external::InstanceState { | ||
| fn from(value: VmmState) -> Self { | ||
| use omicron_common::api::external::InstanceState as Output; | ||
|
|
||
| match value { | ||
| // An instance with a VMM which is in the `Creating` state maps to | ||
| // `InstanceState::Starting`, rather than `InstanceState::Creating`. | ||
| // If we are still creating the VMM, this is because we are | ||
| // attempting to *start* the instance; instances may be created | ||
| // without creating a VMM to run them, and then started later. | ||
| VmmState::Creating | VmmState::Starting => Output::Starting, | ||
| VmmState::Running => Output::Running, | ||
| VmmState::Stopping => Output::Stopping, | ||
| // `SagaUnwound` should map to `Stopped` so that an `instance_view` | ||
| // API call that produces an instance with an unwound VMM will appear to | ||
| // be `Stopped`. This is because instances with unwound VMMs can | ||
| // be started by a subsequent instance-start saga, just like | ||
| // instances whose internal state actually is `Stopped`. | ||
| VmmState::Stopped | VmmState::SagaUnwound => Output::Stopped, | ||
| VmmState::Rebooting => Output::Rebooting, | ||
| VmmState::Migrating => Output::Migrating, | ||
| VmmState::Failed(_) => Output::Failed, | ||
| VmmState::Destroyed => Output::Destroyed, | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This conversion already existed but was defined on the nexus_db_model type.
| #[derive(Copy, Clone, Debug, Default)] | ||
| pub struct Migrations<'state> { | ||
| pub migration_in: Option<&'state sled_agent::MigrationRuntimeState>, | ||
| pub migration_out: Option<&'state sled_agent::MigrationRuntimeState>, | ||
| } | ||
|
|
||
| impl Migrations<'_> { | ||
| pub fn empty() -> Self { | ||
| Self { migration_in: None, migration_out: None } | ||
| } | ||
| } |
There was a problem hiding this comment.
This used to be defined in sled-agent-types but wasn't actually an API type at all, it just lived in the same module as the API types. It lives here now.
this is slightly sad because i prefered the `NoSuchVmm` naming, but would like the `IntoStaticStr` to remain `no_such_instance` so that a subsequent change to make the reason types in `instance_watcher`'s metrics use the new `VmmFailureReason` type, and it already uses this string, and i don't wanna change that...
Okay, so this is a large and unpleasant change, and I apologize from the bottom of my heart to those of you who are being forced to review it. Nonetheless, I do feel that this is justified and perhaps the best of the many bad solutions to the problem. And, I think that the actual change is not ultimately that complex, despite the number of files which had to be touched.
In summary, this is an attempt to solve the same problem as #10285: when a VMM is moved to the
Failedstate, it sure would be nice if we could write down why that happened somewhere where it is possible to see that information ever again. Unfortunately, that change ran afoul of the present cyclic dependency between the Nexus internal and sled-agent APIs: we cannot easily add additional data to theSledVmmStatetype insled_agent_types, because it is part of both the server-side versioned sled-agent API and the client-side versioned Nexus API. My previous attempt to resolve this stickiness, #10303, was reasonably deemed somewhat too sketchy, and rather than moving forwards with that, I think we are better off waiting for client-side versioning to be a thing which is possible.Therefore, this change punts on including more specific failure details for failed VMMs reported by sled-agent. Instead, when the sled-agent reports that a VMM has failed, we just synthesize a
VmmFailureReason::FromSledAgent. When the failure is detected by Nexus without the sled-agent reporting the failure, we record one of several other failure reasons. This is implemented by adding a new layer of type indirection between the sled-agent API types and the DB model types: anexus_types::instance::SledVmmState. This way, when Nexus would like to internally record a failed VMM state, it can provide a failure reason, and at the boundary, when a failed state is received from a sled-agent, theFrom<sled_agent_types::instance::VmmInstanceState> for nexus_types::instance::SledVmmStatecan just fill in the failure reason.The other thing that this indirection gives us, which is either good or bad (?), is that we can make the
nexus_types::instance::VmmStateenum have aVmmFailureReasonin itsFailedstate, so the shape of the domain type enforces that there is a failure reason if the VMM is failed, and there is not a failure reason otherwise. Since these are separate columns in the database, the DB model type can represent a bunch of invalid states, which I didn't love. Unfortunately, having an additional domain type forVmmStateis part of why this change is so massive, as it requires an additional layer of type conversions and such, and results in touching way more files. I could be convinced this is not the right thing, but I still kind of loosely think it's probably worth it.Most of this change is just updating everything to use the new types.
Closes #10285