Skip to content

mvp vm attestation#1091

Merged
jordanhendricks merged 38 commits intomasterfrom
jhendricks/rfd-605
Apr 6, 2026
Merged

mvp vm attestation#1091
jordanhendricks merged 38 commits intomasterfrom
jhendricks/rfd-605

Conversation

@jordanhendricks
Copy link
Copy Markdown
Contributor

@jordanhendricks jordanhendricks commented Mar 27, 2026

closes #1067

TODO:

Testing notes

Happy path

I tested using an image @flihp built. (I've put the image for posterity in the lab at /staff/jordan/propolis-1091).

I made the instance with that image and a RO boot disk. For example:

#/usr/bin/env bash

set -euo pipefail

export OXIDE_HOST="https://recovery.sys.dublin.eng.oxide.computer/"
export PROFILE="dublin"

# image UUID
IMAGE="3fb74969-8981-437f-b4b5-2dc82d3ca191"

cat <<EOF > "./request.json"
{
  "name": "$NAME",
  "description": "demo instance",
  "hostname": "$NAME",
  "memory": 2147483648,
  "ncpus": 2,
  "start": false,
  "boot_disk": {
    "type": "create",
    "description": "a disk to run instance with name $NAME",
    "disk_backend": {
      "type": "distributed",
      "disk_source": {
        "type": "image",
        "image_id": "$IMAGE",
        "read_only": true
      }
    },
    "name": "disk-for-$NAME",
    "size": 2147483648
  }
}
EOF

$HOME/src/oxide.rs/target/debug/oxide instance create \
    --profile "$PROFILE" \
    --project "jordan" \
    --json-body "./request.json"

Inside the vm, edit vm-instance-cfg.json to have the instance UUID and the sha256sum of the image (in the case of the demo image, that's 06ff4481c775ffce878e722927985bffaa1fc7de5d3c2e231bea2adecd22615f).

Then in the VM, for a racklette (username root, password password), run:

$ appraiser --root-cert platform-id-staging.pem --reference-measurements sp-v1.0.64-1.0.64_corim.cbor --reference-measurements rot-1-v1.0.38-1.0.38_corim.cbor --vm-instance-cfg vm-instance-cfg.json vm-instance-rot vsock 605 -vvv

If everything works, you should see:

[2026-04-06T21:04:08Z INFO  appraiser] metadata from Oxide VM Instance RoT appraised
[2026-04-06T21:04:08Z INFO  appraiser] appraised attestation from VmInstanceRot over vsock

No boot disk

Steps to test: create instance, stop it (or don't auto-start it), then remove the boot disk as a boot disk. Send a challenge from inside the guest.

Result: attestation server used just the instance UUID for qualifying data

21:24:25.538Z INFO propolis-server (vm_state_driver): vm conf is ready = VmInstanceConf { uuid: 1f1ec2e3-c5cf-4eaf-8a19-aa25ec1f6895, boot_digest: None }

Cargo.toml Outdated
# Attestation
#dice-verifier = { git = "https://github.com/oxidecomputer/dice-util", branch = "jhendricks/update-sled-agent-types-versions", features = ["sled-agent"] }
dice-verifier = { git = "https://github.com/oxidecomputer/dice-util", features = ["sled-agent"] }
vm-attest = { git = "https://github.com/oxidecomputer/vm-attest", rev = "a7c2a341866e359a3126aaaa67823ec5097000cd", default-features = false }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the Cargo.lock weirdness from dice-verifier -> sled-agent-client -> omciron-common (some previous rev) and that's where the later API dependency stuff we saw in Omicron comes up when building the tuf. sled-agent-client re-exports items out of propolis-client which means we end up in a situation where propolis-server depends on a different rev of propolis-client and everything's Weird.

i'm not totally sure what we want or need to do about this, particularly because we're definitely not using the propolis-client-related parts of sled-agent! we're just using one small part of the API for the RoT calls. but sled-agent and propolis are (i think?) updated in the same deployment unit so the cyclic dependency is fine.

@jordanhendricks jordanhendricks marked this pull request as ready for review April 2, 2026 00:08
@jordanhendricks
Copy link
Copy Markdown
Contributor Author

I want to add some comments in the attestation module but from a code-structure perspective @iximeow and I are happy with this. Ready for review!

@jordanhendricks jordanhendricks requested a review from hawkw April 2, 2026 00:41
@jordanhendricks jordanhendricks self-assigned this Apr 2, 2026
Copy link
Copy Markdown
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the Tokio stuff felt a bit awkward here --- I'd be happy to open a PR against this branch changing some of the things I mentioned, if that's easier for you?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not super important but this string could be better probably

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 014950e

Some(backend.clone_volume())
} else {
// Disk must be read-only to be used for attestation.
slog::info!(self.log, "boot disk is not read-only");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should explicitly state that this means it will not be attested?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took a crack at this in 014950e

Comment on lines +42 to +118
#[derive(Debug)]
enum AttestationInitState {
Preparing {
vm_conf_send: oneshot::Sender<VmInstanceConf>,
},
/// A transient state while we're getting the initializer ready, having
/// taken `Preparing` and its `vm_conf_send`, but before we've got a
/// `JoinHandle` to track as running.
Initializing,
Running {
init_task: JoinHandle<()>,
},
}

/// This struct manages providing the requisite data for a corresponding
/// `AttestationSock` to become fully functional.
pub struct AttestationSockInit {
log: slog::Logger,
vm_conf_send: oneshot::Sender<VmInstanceConf>,
uuid: uuid::Uuid,
volume_ref: Option<crucible::Volume>,
}

impl AttestationSockInit {
/// Do any any remaining work of collecting VM RoT measurements in support
/// of this VM's attestation server.
pub async fn run(self) {
let AttestationSockInit { log, vm_conf_send, uuid, volume_ref } = self;

let mut vm_conf = vm_attest::VmInstanceConf { uuid, boot_digest: None };

if let Some(volume) = volume_ref {
// TODO(jph): make propolis issue, link to #1078 and add a log line
// TODO: load-bearing sleep: we have a Crucible volume, but we can
// be here and chomping at the bit to get a digest calculation
// started well before the volume has been activated; in
// `propolis-server` we need to wait for at least a subsequent
// instance start. Similar to the scrub task for Crucible disks,
// delay some number of seconds in the hopes that activation is done
// promptly.
//
// This should be replaced by awaiting for some kind of actual
// "activated" signal.
tokio::time::sleep(std::time::Duration::from_secs(10)).await;

let boot_digest =
match crate::attestation::boot_digest::boot_disk_digest(
volume, &log,
)
.await
{
Ok(digest) => digest,
Err(e) => {
// a panic here is unfortunate, but helps us debug for
// now; if the digest calculation fails it may be some
// retryable issue that a guest OS would survive. but
// panicking here means we've stopped Propolis at the
// actual error, rather than noticing the
// `vm_conf_sender` having dropped elsewhere.
panic!("failed to compute boot disk digest: {e:?}");
}
};

vm_conf.boot_digest = Some(boot_digest);
} else {
slog::warn!(log, "not computing boot disk digest");
}

let send_res = vm_conf_send.send(vm_conf);
if let Err(_) = send_res {
slog::error!(
log,
"attestation server is not listening for its config?"
);
}
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Soo, it feels a bit funny to me that this thing is a task we spawn that, when it completes, sends a message over a oneshot channel and then exits, and then we have a JoinHandle<()> for that task. It kinda feels like this could just be a JoinHandle<VmInstanceConf> and make a bunch of this at least a bit simpler?

I'd be happy to throw together a patch that does that refactoring if it's too annoying.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. The JoinHandle was from a previous iteration of how we would structure things that looked more like the way we presently handle the VNC server. I'll take a look at how hard this is to remove.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this and also the change in this module that I suggested in #1091 (comment) are kinda just refactoring/tidying things up, I would be fine with leaving a lot of this as-is and then merge some refactoring later --- I'd be happy to open a follow-up PR after this has merged, if that makes life easier for you?

let mut buffer =
Buffer::new(this_block_count as usize, block_size as usize);

// TODO(jph): We don't want to panic in the case of a failed read. How
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to do this and test on dublin.

Copy link
Copy Markdown
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Crucible retry stuff seems pretty much correct, I commented on some minor nitpicks. I think it's fine to defer some of the async refactoring to a subsequent PR, as there isn't anything wrong there, I just think we could maybe make the code a bit simpler. Beyond that, I think that pending whatever testing you need to do, I have no major concerns.

Comment on lines +89 to +95
slog::error!(
log,
"read failed: {e:?}.
offset={offset},
this_block_cout={this_block_count},
block_size={block_size},
end_block={end_block}"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super weird formatting here, can we do something about that? also perhaps these ought to be structured fields...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and also, perhaps this ought to include the retry count?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both done in c096720

jordanhendricks and others added 2 commits April 5, 2026 15:06
Co-authored-by: Eliza Weisman <eliza@elizas.website>
Copy link
Copy Markdown
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New comments and backoff look good to me, I commented on a couple very tiny nits

Comment on lines +80 to +82
"failed to read boot disk in {n_retries} tries \
aborting hash of boot digest"
);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more weird and unpleasant line wrapping (sorry):

Suggested change
"failed to read boot disk in {n_retries} tries \
aborting hash of boot digest"
);
"failed to read boot disk in {n_retries} tries \
aborting hash of boot digest"
);

Comment on lines +115 to +116
"hash of volume {:?} took {:?} ms",
vol_uuid,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turbo nit, sorry: mayhaps vol_uuid ought to be a structured field here as well? maybe we could build a child log context that includes it so that it's also added to the errors we log above?


if let Err(e) = res {
error!(log,
"read failed: {e:?}";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not totally sure what the type of e is here, but are we sure that fmt::Debug is the best way to log it? not a huge deal...

//! If there is no boot disk, or the boot disk is not read-only, only the
//! instance ID is used as identifying data.
//!
//! If there is a read-only disk, the attestation server will fail challenge
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe this should say

Suggested change
//! If there is a read-only disk, the attestation server will fail challenge
//! If there is a read-only boot disk, the attestation server will fail challenge

since currently, this sentence in isolation suggests we would hash any read-only disk...but, this is not a huge deal as i would kind of expect the reader could infer the correct thing from surrounding context...

//! there.)
//!
//! * Guest software submits a 32-byte nonce to a known attestation port.
//! * This port is backed by a vsock device in propolis.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turbo nit: maybe:

Suggested change
//! * This port is backed by a vsock device in propolis.
//! - This port is backed by a vsock device in propolis.

//! sent to the attestation server once all of the VM identity data is done
//! (so, in practice, when the boot disk is hashed).
//! * Until the VM conf is ready, the attestation server fails challenges.
//! Once the VM conf is ready, these challenges are passed through to the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this feels like maybe it should be its own bullet point?

Suggested change
//! Once the VM conf is ready, these challenges are passed through to the
//! * Once the VM conf is ready, these challenges are passed through to the

@jordanhendricks jordanhendricks merged commit fe47987 into master Apr 6, 2026
12 checks passed
@jordanhendricks jordanhendricks deleted the jhendricks/rfd-605 branch April 6, 2026 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mvp vm attestation support in propolis-server (rfd 605)

4 participants