Skip to content

storage: Checking and fixing system config for NBDE#17796

Merged
mvollmer merged 1 commit intocockpit-project:mainfrom
mvollmer:storage-nbde-help
Jan 13, 2023
Merged

storage: Checking and fixing system config for NBDE#17796
mvollmer merged 1 commit intocockpit-project:mainfrom
mvollmer:storage-nbde-help

Conversation

@mvollmer
Copy link
Copy Markdown
Member

@mvollmer mvollmer commented Oct 6, 2022

Outdated demo: https://www.youtube.com/watch?v=mbv-syeLHsc


Release note:

Storage: Set up a system to use NBDE

It is possible (and unfortunately likely) that a system is misconfigured and can not mount filesystems during boot that rely on Clevis to unlock them. Cockpit can now fix this when adding a keyserver to a encrypted filesystem.

image

@mvollmer mvollmer temporarily deployed to cockpit-dist October 6, 2022 12:20 Inactive
@mvollmer mvollmer temporarily deployed to cockpit-dist October 7, 2022 11:37 Inactive
@mvollmer mvollmer added the no-test For doc/workflow changes, or experiments which don't need a full CI run, label Oct 11, 2022
@mvollmer mvollmer temporarily deployed to cockpit-dist October 11, 2022 10:09 Inactive
@mvollmer mvollmer temporarily deployed to cockpit-dist October 11, 2022 12:24 Inactive
@mvollmer mvollmer temporarily deployed to cockpit-dist October 12, 2022 14:25 Inactive
@mvollmer mvollmer temporarily deployed to cockpit-dist October 13, 2022 15:38 Inactive
@mvollmer mvollmer temporarily deployed to cockpit-dist October 14, 2022 09:47 Inactive
@mvollmer mvollmer temporarily deployed to cockpit-dist October 17, 2022 07:57 Inactive
@mvollmer mvollmer temporarily deployed to cockpit-dist October 17, 2022 08:20 Inactive
@mvollmer mvollmer force-pushed the storage-nbde-help branch 2 times, most recently from ee93d26 to 038db1b Compare October 17, 2022 10:07
@mvollmer mvollmer temporarily deployed to cockpit-dist October 17, 2022 10:12 Inactive
@mvollmer mvollmer temporarily deployed to cockpit-dist October 17, 2022 10:22 Inactive
@mvollmer mvollmer temporarily deployed to cockpit-dist October 17, 2022 12:01 Inactive
@garrett
Copy link
Copy Markdown
Member

garrett commented Oct 17, 2022

Awesome work!

I have a bunch of design-related feedback:

  1. The location is not correct. (You know this, and you said it.)

    • This is a system-level option, right? Yet you have it under a partition on the encryption tab. We need to promote it higher up in the hierarchy. This means it should primarily be on the top-level storage page somewhere, somehow.
    • I'm assuming that you normally would only run the check (and fix, if needed) once per system. Is there a way to save (on a system level, not in-browser) the check to know if it's been done yet or not? We could, for example, also show a message on an unchecked system but not on a passing system. (Near where it currently is, as NBDE needs support for it.)
    • Perhaps we should make something like a two-step dialog, similar to a wizard, where the first page is checking for NBDE and fixing it if we aren't sure the system supports it. The second "page" of the modal would then be using NBDE within Cockpit. If we know the system supports NBDE correctly, then we'd skip directly to the second page of the modal and only show that. (It wouldn't look like a wizard.) We do this elsewhere in Cockpit (but I don't remember where) — or at least I know I've made mockups for this concept before (perhaps this isn't implemented yet?). In this way, we wouldn't need to show any messaging about NBDE unless the user is actively going to use it.
  2. Actions on modals should always be a button styled button (primary, secondary, warn, danger, etc.) and never should be a link styled button, except for "Cancel" and the cancel action must always be paired with an action button. "Close" is not "Cancel". Close always needs to be a secondary button style.

  3. We probably don't need to show a passing checklist. We could show the current step and problems. And we would show a success if everything passes. We don't need to show details of things that are OK. (If we're only showing the errors and them being fixed, then it's OK to show them in an OK state after fixing them, I guess.)

  4. "Fix" should probably elaborate what it is fixing, I think?

  5. If it's running a medium-length process that depends on the modal being there, then you shouldn't be able to close. And you probably shouldn't be able to cancel, unless there will be no ill effects. A longer process must always allow for the modal to be closed an have status information in the page, however. (Unless everything else on the page depends on the process, such as the update process on the Software Updates page, as an example.)

@mvollmer
Copy link
Copy Markdown
Member Author

mvollmer commented Oct 18, 2022

I have a bunch of design-related feedback:

Thanks! This is really helpful.

This is a system-level option, right?

Mostly, yes, and we should move it up in the hierarchy. However, there are two sets of checks-and-fixes: One for the root filesystem, and another one for everything else. Currently, the buttons in the filesystem tab start one of these two flows, depending on whether the filesystem in question is the root filesystem or not.

However, the checks and fixes for a non-root filesystem are much lighter than for the root filesystem. They might not justify the full blown dialog that is implemented here. In any case, NBDE for the root filesystem is the main thing we are addressing here, so let's ignore non-root filesystems for now.

Thus, yes, this is a system-level option and should be somewhere higher up.

I'm assuming that you normally would only run the check (and fix, if needed) once per system.

Yes. However, it is always possible that the system gets broken later on. Hmm. The best would be to find a very fast reliable way to figure out if anything needs to be fixed that can be done every time the Storage page loads. I will have a try at this. (We would need to check whether the root filesystem uses NBDE, and whether the initrd includes support for this.)

However, this is starting to move us outside of the scope of Cockpit, isn't it? Isn't this something that Insights should be doing? (Checking your system in the background for things that need fixing, alerting you about them, and offering ways to actually fix them.)

Cockpit could certainly show the alerts and let people carry out the fixes, but I'd say we shouldn't go and run the actual checks in the background.

So, what if, as a first step, we only do the check-and-fix flow when adding NBDE to the root filesystem, and forget about a global button? (If I find a quick way to check whether to show this button, we can reconsider this.)

Perhaps we should make something like a two-step dialog, similar to a wizard, where the first page is checking for NBDE and fixing it if we aren't sure the system supports it.

Adding a NBDE key is already a wizard: First you enter the contact information for the server, then you need to compare fingerprints and give the final OK. In my current implementation, the check-and-fix then runs as a third step. I agree that it is much better to run it first: If the fixing fails, you should probably not add the NBDE key.

But the first step in the existing wizard includes selecting whether to add a regular passphrase, or a NBDE key. So we would have step 1: Select NBDE, enter server address; step 2: check system for NBDE support; step 3: compare and confirm fingerprints. Is that okay?

As you propose, I think we can do the check when clicking "Apply" in step 1, and immediately skip to step 3 when nothing needs fixing. Step 2 would only happen when there is something to fix.

@garrett
Copy link
Copy Markdown
Member

garrett commented Oct 18, 2022

So, what if, as a first step, we only do the check-and-fix flow when adding NBDE to the root filesystem, and forget about a global button? (If I find a quick way to check whether to show this button, we can reconsider this.)

Sure.

But the first step in the existing wizard includes selecting whether to add a regular passphrase, or a NBDE key. So we would have step 1: Select NBDE, enter server address; step 2: check system for NBDE support; step 3: compare and confirm fingerprints. Is that okay?

Wizards are for interactive input. Checking the system is not interactive; it's something that does not require input from the user. Therefore, it should not be a step in a wizard.

And this doesn't sound like it should be a wizard. A wizard is a particular UI pattern. You're talking about a process.

But as the verification step does require user input (accepting the key or not), it could be a wizard, I suppose. That's better than two sequential dialogs. But it's not as good as a dialog that changes based on context. It could be implemented as a wizard, but we don't need (nor want) a sidebar, a big header, or next/back buttons that PF wizards provide.

Here's a mockup of how it could work. It definitely needs updates and iteration.

nbde-add excalidraw

(I made this in ExcaliDraw, so you could just open this PNG there if you want to edit it and it would show you the vector form, as it's embedded.)

@garrett
Copy link
Copy Markdown
Member

garrett commented Oct 18, 2022

Also: Since we'll check for a number of things for NBDE support and it takes a little while, the checking step could have a progress bar, as we can just divide it by # of steps and when each completes, advance the bar.

@garrett
Copy link
Copy Markdown
Member

garrett commented Oct 18, 2022

Here's a revised flow based on IRC feedback:

nbde-add-flow excalidraw

@mvollmer mvollmer removed blocked Don't land until something else happens first (see task list) no-test For doc/workflow changes, or experiments which don't need a full CI run, labels Nov 8, 2022
@mvollmer mvollmer temporarily deployed to cockpit-dist November 8, 2022 08:05 Inactive
@mvollmer mvollmer temporarily deployed to cockpit-dist November 8, 2022 10:39 Inactive
@mvollmer mvollmer temporarily deployed to cockpit-dist November 9, 2022 09:29 Inactive
@mvollmer mvollmer temporarily deployed to cockpit-dist November 9, 2022 13:59 Inactive
@mvollmer
Copy link
Copy Markdown
Member Author

The final reboot sometimes (always?) times out on rhel-8... it completes in 20 seconds here locally. Let's debug this some.

@mvollmer mvollmer temporarily deployed to cockpit-dist November 14, 2022 13:44 Inactive
@mvollmer
Copy link
Copy Markdown
Member Author

The final reboot sometimes (always?) times out on rhel-8... it completes in 20 seconds here locally. Let's debug this some.

Now it succeeded on non-rhos. The console screenshots have been taken and can be found in the results directory. For example, https://cockpit-logs.us-east-1.linodeobjects.com/pull-17796-20221114-133844-c33e9cff-rhel-8-8/failed-reboot.ppm

Comment thread test/verify/check-storage-luks Outdated

Check failure

Code scanning / CodeQL

Potentially uninitialized local variable

Local variable 'root_row' may be used before it is initialized.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no "/" this will blow up, I guess that's intendend?

@mvollmer mvollmer temporarily deployed to cockpit-dist November 24, 2022 08:17 Inactive
@mvollmer mvollmer marked this pull request as ready for review November 24, 2022 10:31
@mvollmer mvollmer temporarily deployed to cockpit-dist November 24, 2022 10:36 Inactive
@mvollmer
Copy link
Copy Markdown
Member Author

mvollmer commented Nov 25, 2022

@garrett, what do you think about these open issues?

  • Should we ask the user to verify the key fingerprint before starting the fixing?
  • Should we offer a way to skip the fixing and still add the key?

@jelly
Copy link
Copy Markdown
Member

jelly commented Nov 25, 2022

image

Minor nitpick, maybe we should have some more spacing after the SHA1 hash.

jelly
jelly previously approved these changes Nov 25, 2022
Copy link
Copy Markdown
Member

@jelly jelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, some minor nitpicks/questions.

Comment thread test/verify/check-storage-luks Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool! Would be nice if this was more generic and maybe by default in wait_reboot. But then we'd have to add a wrapper in testlib.py so self.wait_reboot as that can call attach.

Comment thread test/verify/check-storage-luks Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arch is waiting on latchset/clevis#374 I'll try to poke the PR

Comment thread pkg/storaged/crypto-keyslots.jsx Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a little brittle, as also lsinitrd -m also prints early CPIO image and stuff. But there doesn't seem to be a better way.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I understand that. But it also prints other stuff. It was more a wish that more tools would have parseable output with for example --json.

Comment thread pkg/storaged/crypto-keyslots.jsx Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this might be a silly question, but doesn't grubby also re-generate initramfs. So in theory we could first install clevis-dracut. But that's a real nitpick I feel :)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grubby does not regenerate the initrd. At least it never did for me.

Comment thread test/verify/check-storage-luks Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no "/" this will blow up, I guess that's intendend?

@mvollmer
Copy link
Copy Markdown
Member Author

If there is no "/" this will blow up, I guess that's intendend?

It's tolerated. :-) We could produce a better error message, but this really shouldn't happen at all.

Comment on lines +273 to +275
if (p.waiting) {
text = _("Waiting for other software management operations to finish");
} else if (p.package) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 3 added lines are not executed by any test. Details

} else if (p.package) {
let fmt;
if (p.info == PkEnum.INFO_DOWNLOADING)
fmt = _("Downloading $0");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This added line is not executed by any test. Details

if (p.info == PkEnum.INFO_DOWNLOADING)
fmt = _("Downloading $0");
else if (p.info == PkEnum.INFO_REMOVING)
fmt = _("Removing $0");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This added line is not executed by any test. Details

title: cockpit.format(_("The $0 package must be installed."), package_name),
func: progress => {
if (data.remove_names.length > 0)
return Promise.reject(cockpit.format(_("Installing $0 would remove $1."), name, data.remove_names[0]));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This added line is not executed by any test. Details

if (data.remove_names.length > 0)
return Promise.reject(cockpit.format(_("Installing $0 would remove $1."), name, data.remove_names[0]));
else if (data.unavailable_names.length > 0)
return Promise.reject(cockpit.format(_("The $0 package is not available from any repository."), name));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This added line is not executed by any test. Details


function ensure_initrd_clevis_support(steps, progress, package_name) {
const task = cockpit.spawn(["lsinitrd", "-m"], { superuser: true, err: "message" });
progress(_("Checking for NBDE support in the initrd"), () => task.close());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This added line is not executed by any test. Details

const fsys_options = fsys_config && parse_options(decode_filename(fsys_config[1].opts.v));

if (!fsys_options || fsys_options.indexOf(option) >= 0)
return Promise.resolve();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This added line is not executed by any test. Details

const crypto_config = array_find(block.Configuration, function (c) { return c[0] == "crypttab" });
const crypto_options = crypto_config && parse_options(decode_filename(crypto_config[1].options.v));
if (!crypto_options || crypto_options.indexOf(option) >= 0)
return Promise.resolve();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This added line is not executed by any test. Details

Comment on lines +409 to +411
} else
return Promise.resolve();
} else
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 3 added lines are not executed by any test. Details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants