Skip to content

nexus: add failure reasons to Failed VMMs using a VmmState domain type#10425

Open
hawkw wants to merge 13 commits into
mainfrom
eliza/failure-reason-2
Open

nexus: add failure reasons to Failed VMMs using a VmmState domain type#10425
hawkw wants to merge 13 commits into
mainfrom
eliza/failure-reason-2

Conversation

@hawkw
Copy link
Copy Markdown
Member

@hawkw hawkw commented May 8, 2026

Okay, so this is a large and unpleasant change, and I apologize from the bottom of my heart to those of you who are being forced to review it. Nonetheless, I do feel that this is justified and perhaps the best of the many bad solutions to the problem. And, I think that the actual change is not ultimately that complex, despite the number of files which had to be touched.

In summary, this is an attempt to solve the same problem as #10285: when a VMM is moved to the Failed state, it sure would be nice if we could write down why that happened somewhere where it is possible to see that information ever again. Unfortunately, that change ran afoul of the present cyclic dependency between the Nexus internal and sled-agent APIs: we cannot easily add additional data to the SledVmmState type in sled_agent_types, because it is part of both the server-side versioned sled-agent API and the client-side versioned Nexus API. My previous attempt to resolve this stickiness, #10303, was reasonably deemed somewhat too sketchy, and rather than moving forwards with that, I think we are better off waiting for client-side versioning to be a thing which is possible.

Therefore, this change punts on including more specific failure details for failed VMMs reported by sled-agent. Instead, when the sled-agent reports that a VMM has failed, we just synthesize a VmmFailureReason::FromSledAgent. When the failure is detected by Nexus without the sled-agent reporting the failure, we record one of several other failure reasons. This is implemented by adding a new layer of type indirection between the sled-agent API types and the DB model types: a nexus_types::instance::SledVmmState. This way, when Nexus would like to internally record a failed VMM state, it can provide a failure reason, and at the boundary, when a failed state is received from a sled-agent, the From<sled_agent_types::instance::VmmInstanceState> for nexus_types::instance::SledVmmState can just fill in the failure reason.

The other thing that this indirection gives us, which is either good or bad (?), is that we can make the nexus_types::instance::VmmState enum have a VmmFailureReason in its Failed state, so the shape of the domain type enforces that there is a failure reason if the VMM is failed, and there is not a failure reason otherwise. Since these are separate columns in the database, the DB model type can represent a bunch of invalid states, which I didn't love. Unfortunately, having an additional domain type for VmmState is part of why this change is so massive, as it requires an additional layer of type conversions and such, and results in touching way more files. I could be convinced this is not the right thing, but I still kind of loosely think it's probably worth it.

Most of this change is just updating everything to use the new types.

Closes #10285

@hawkw hawkw changed the title [WIP] GIANT HORRIBLE VMM STATE REFACTOR nexus: add failure reasons to Failed VMMs using a VmmState domain type May 11, 2026
@hawkw hawkw marked this pull request as ready for review May 11, 2026 22:12
Copy link
Copy Markdown
Member Author

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes to hopefully make this loud and awkward PR a bit more comfortable to review. Also, I'm tagging in @sunshowers in the hopes that they will help to defend my decision to have three versions of these types in case anyone else gets mad at me for that :)

Comment on lines 227 to +228
self.outcome = CheckOutcome::Failure(Failure::SledExpunged);
return mk_failed();
return mk_failed(VmmFailureReason::SledExpunged);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In future, I would like to replace the Failure type which is used internally to the instance_watcher task with just using VmmFailureReason --- I didn't do that in this PR because the diff has gotten big enough already, but will do that in a subsequent change if y'all decide you don't totally hate what I've proposed thus far.

Comment on lines +11 to +19
#[derive(Clone, Debug, Deserialize, Serialize, Eq, PartialEq)]
pub struct SledVmmState {
pub vmm_state: VmmRuntimeState,
/// The current state of any inbound migration to this VMM.
pub migration_in: Option<sled_agent::MigrationRuntimeState>,

/// The state of any outbound migration from this VMM.
pub migration_out: Option<sled_agent::MigrationRuntimeState>,
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the "domain type" version of SledVmmState

Comment on lines +29 to +36
impl SledVmmState {
pub fn migrations(&self) -> Migrations<'_> {
Migrations {
migration_in: self.migration_in.as_ref(),
migration_out: self.migration_out.as_ref(),
}
}
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just moved from its previous home in sled_agent_types::instance and has not otherwise changed.

Comment on lines +38 to +48
/// The dynamic runtime properties of an individual VMM process.
#[derive(Clone, Debug, Deserialize, Serialize, Eq, PartialEq)]
pub struct VmmRuntimeState {
/// The last state reported by this VMM.
pub state: VmmState,
/// The generation number for this VMM's state.
#[serde(rename = "gen")]
pub generation: Generation,
/// Timestamp for the VMM's state.
pub time_updated: DateTime<Utc>,
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as the VmmRuntimeState in sled_agent_types but using the domain type version of VmmState, which has a failure reason...

Comment on lines +51 to +66
/// Transitions the VMM state from the current state to `new_state`,
/// returning a `VmmRuntimeState` representing the VMM after transitioning
/// to the new state.
///
/// If `new_state` is the same as the current state, returns a copy of
/// `self` without any changes.
pub fn transition(&self, new_state: VmmState) -> Self {
if new_state == self.state {
return self.clone();
}
Self {
state: new_state,
generation: self.generation.next(),
time_updated: Utc::now(),
}
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a helper for reducing the boilerplate of doing these state transitions, which I thought was nice --- especially because it enforces the property that you bump the generation when doing a state transition, which you previously had to remember to do.

Comment on lines +128 to +158
/// Returns `true` if this is a terminal VMM state.
///
/// A VMM in a terminal state will never transition to another state.
pub fn is_terminal(&self) -> bool {
matches!(self, VmmState::Destroyed | VmmState::Failed(_))
}

/// Returns `true` if the VMM is in a `Failed` state.
pub fn is_failed(&self) -> bool {
matches!(self, VmmState::Failed(_))
}

/// Returns `true` if the VMM is in a state where it is safe to
/// deallocate its sled resources and mark it as deleted.
pub fn is_destroyable(&self) -> bool {
matches!(
self,
VmmState::Destroyed | VmmState::Failed(_) | VmmState::SagaUnwound
)
}

/// Returns `true` if the VMM exists on a sled.
///
/// This is `false` for VMMs that have not yet been created, have been
/// destroyed, or were produced by an unwound saga and will never exist.
pub fn exists_on_sled(&self) -> bool {
!matches!(
self,
VmmState::Creating | VmmState::SagaUnwound | VmmState::Destroyed
)
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These predicates previously existed in the nexus_db_model version of the type; I've moved them here.

Comment on lines +167 to +185
impl From<sled_agent::VmmState> for VmmState {
fn from(state: sled_agent::VmmState) -> Self {
match state {
sled_agent::VmmState::Starting => VmmState::Starting,
sled_agent::VmmState::Running => VmmState::Running,
sled_agent::VmmState::Stopping => VmmState::Stopping,
sled_agent::VmmState::Stopped => VmmState::Stopped,
sled_agent::VmmState::Rebooting => VmmState::Rebooting,
sled_agent::VmmState::Migrating => VmmState::Migrating,
sled_agent::VmmState::Failed => {
// If we are converting a state received from a sled-agent that
// indicates that the VMM is failed, the failure reason is
// implicitly "from_sled_agent" for now.
VmmState::Failed(VmmFailureReason::FromSledAgent)
}
sled_agent::VmmState::Destroyed => VmmState::Destroyed,
}
}
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we now synthesize the FromSledAgent failure reason when receiv9ing things from the sled-agent.

Comment on lines +187 to +209
impl From<VmmState> for sled_agent::VmmState {
fn from(state: VmmState) -> Self {
match state {
// The `Creating` state is internal to Nexus; the outside world
// should treat it as equivalent to `Starting`.
VmmState::Starting | VmmState::Creating => {
sled_agent::VmmState::Starting
}
VmmState::Running => sled_agent::VmmState::Running,
VmmState::Stopping => sled_agent::VmmState::Stopping,
VmmState::Stopped => sled_agent::VmmState::Stopped,
VmmState::Rebooting => sled_agent::VmmState::Rebooting,
VmmState::Migrating => sled_agent::VmmState::Migrating,
VmmState::Failed(_) => sled_agent::VmmState::Failed,

// The `SagaUnwound` state is internal to Nexus; the outside world
// should treat it as equivalent to `Destroyed`.
VmmState::Destroyed | VmmState::SagaUnwound => {
sled_agent::VmmState::Destroyed
}
}
}
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversion already existed but was defined on the nexus_db_model type rather than here.

Comment on lines +211 to +236
impl From<VmmState> for omicron_common::api::external::InstanceState {
fn from(value: VmmState) -> Self {
use omicron_common::api::external::InstanceState as Output;

match value {
// An instance with a VMM which is in the `Creating` state maps to
// `InstanceState::Starting`, rather than `InstanceState::Creating`.
// If we are still creating the VMM, this is because we are
// attempting to *start* the instance; instances may be created
// without creating a VMM to run them, and then started later.
VmmState::Creating | VmmState::Starting => Output::Starting,
VmmState::Running => Output::Running,
VmmState::Stopping => Output::Stopping,
// `SagaUnwound` should map to `Stopped` so that an `instance_view`
// API call that produces an instance with an unwound VMM will appear to
// be `Stopped`. This is because instances with unwound VMMs can
// be started by a subsequent instance-start saga, just like
// instances whose internal state actually is `Stopped`.
VmmState::Stopped | VmmState::SagaUnwound => Output::Stopped,
VmmState::Rebooting => Output::Rebooting,
VmmState::Migrating => Output::Migrating,
VmmState::Failed(_) => Output::Failed,
VmmState::Destroyed => Output::Destroyed,
}
}
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversion already existed but was defined on the nexus_db_model type.

Comment on lines +280 to +290
#[derive(Copy, Clone, Debug, Default)]
pub struct Migrations<'state> {
pub migration_in: Option<&'state sled_agent::MigrationRuntimeState>,
pub migration_out: Option<&'state sled_agent::MigrationRuntimeState>,
}

impl Migrations<'_> {
pub fn empty() -> Self {
Self { migration_in: None, migration_out: None }
}
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This used to be defined in sled-agent-types but wasn't actually an API type at all, it just lived in the same module as the API types. It lives here now.

@hawkw hawkw requested review from iximeow and smklein May 11, 2026 22:26
@hawkw hawkw self-assigned this May 11, 2026
hawkw added 4 commits May 11, 2026 15:28
this is slightly sad because i prefered the `NoSuchVmm` naming, but
would like the `IntoStaticStr` to remain `no_such_instance` so that a
subsequent change to make the reason types in `instance_watcher`'s
metrics use the new `VmmFailureReason` type, and it already uses this
string, and i don't wanna change that...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant