storage: Checking and fixing system config for NBDE by mvollmer · Pull Request #17796 · cockpit-project/cockpit

mvollmer · 2022-10-06T12:14:22Z

Outdated demo: https://www.youtube.com/watch?v=mbv-syeLHsc

Release note:

Storage: Set up a system to use NBDE

It is possible (and unfortunately likely) that a system is misconfigured and can not mount filesystems during boot that rely on Clevis to unlock them. Cockpit can now fix this when adding a keyserver to a encrypted filesystem.

garrett · 2022-10-17T14:47:56Z

Awesome work!

I have a bunch of design-related feedback:

The location is not correct. (You know this, and you said it.)
- This is a system-level option, right? Yet you have it under a partition on the encryption tab. We need to promote it higher up in the hierarchy. This means it should primarily be on the top-level storage page somewhere, somehow.
- I'm assuming that you normally would only run the check (and fix, if needed) once per system. Is there a way to save (on a system level, not in-browser) the check to know if it's been done yet or not? We could, for example, also show a message on an unchecked system but not on a passing system. (Near where it currently is, as NBDE needs support for it.)
- Perhaps we should make something like a two-step dialog, similar to a wizard, where the first page is checking for NBDE and fixing it if we aren't sure the system supports it. The second "page" of the modal would then be using NBDE within Cockpit. If we know the system supports NBDE correctly, then we'd skip directly to the second page of the modal and only show that. (It wouldn't look like a wizard.) We do this elsewhere in Cockpit (but I don't remember where) — or at least I know I've made mockups for this concept before (perhaps this isn't implemented yet?). In this way, we wouldn't need to show any messaging about NBDE unless the user is actively going to use it.
Actions on modals should always be a button styled button (primary, secondary, warn, danger, etc.) and never should be a link styled button, except for "Cancel" and the cancel action must always be paired with an action button. "Close" is not "Cancel". Close always needs to be a secondary button style.
We probably don't need to show a passing checklist. We could show the current step and problems. And we would show a success if everything passes. We don't need to show details of things that are OK. (If we're only showing the errors and them being fixed, then it's OK to show them in an OK state after fixing them, I guess.)
"Fix" should probably elaborate what it is fixing, I think?
If it's running a medium-length process that depends on the modal being there, then you shouldn't be able to close. And you probably shouldn't be able to cancel, unless there will be no ill effects. A longer process must always allow for the modal to be closed an have status information in the page, however. (Unless everything else on the page depends on the process, such as the update process on the Software Updates page, as an example.)

mvollmer · 2022-10-18T08:05:52Z

I have a bunch of design-related feedback:

Thanks! This is really helpful.

This is a system-level option, right?

Mostly, yes, and we should move it up in the hierarchy. However, there are two sets of checks-and-fixes: One for the root filesystem, and another one for everything else. Currently, the buttons in the filesystem tab start one of these two flows, depending on whether the filesystem in question is the root filesystem or not.

However, the checks and fixes for a non-root filesystem are much lighter than for the root filesystem. They might not justify the full blown dialog that is implemented here. In any case, NBDE for the root filesystem is the main thing we are addressing here, so let's ignore non-root filesystems for now.

Thus, yes, this is a system-level option and should be somewhere higher up.

I'm assuming that you normally would only run the check (and fix, if needed) once per system.

Yes. However, it is always possible that the system gets broken later on. Hmm. The best would be to find a very fast reliable way to figure out if anything needs to be fixed that can be done every time the Storage page loads. I will have a try at this. (We would need to check whether the root filesystem uses NBDE, and whether the initrd includes support for this.)

However, this is starting to move us outside of the scope of Cockpit, isn't it? Isn't this something that Insights should be doing? (Checking your system in the background for things that need fixing, alerting you about them, and offering ways to actually fix them.)

Cockpit could certainly show the alerts and let people carry out the fixes, but I'd say we shouldn't go and run the actual checks in the background.

So, what if, as a first step, we only do the check-and-fix flow when adding NBDE to the root filesystem, and forget about a global button? (If I find a quick way to check whether to show this button, we can reconsider this.)

Perhaps we should make something like a two-step dialog, similar to a wizard, where the first page is checking for NBDE and fixing it if we aren't sure the system supports it.

Adding a NBDE key is already a wizard: First you enter the contact information for the server, then you need to compare fingerprints and give the final OK. In my current implementation, the check-and-fix then runs as a third step. I agree that it is much better to run it first: If the fixing fails, you should probably not add the NBDE key.

But the first step in the existing wizard includes selecting whether to add a regular passphrase, or a NBDE key. So we would have step 1: Select NBDE, enter server address; step 2: check system for NBDE support; step 3: compare and confirm fingerprints. Is that okay?

As you propose, I think we can do the check when clicking "Apply" in step 1, and immediately skip to step 3 when nothing needs fixing. Step 2 would only happen when there is something to fix.

garrett · 2022-10-18T09:34:27Z

So, what if, as a first step, we only do the check-and-fix flow when adding NBDE to the root filesystem, and forget about a global button? (If I find a quick way to check whether to show this button, we can reconsider this.)

Sure.

But the first step in the existing wizard includes selecting whether to add a regular passphrase, or a NBDE key. So we would have step 1: Select NBDE, enter server address; step 2: check system for NBDE support; step 3: compare and confirm fingerprints. Is that okay?

Wizards are for interactive input. Checking the system is not interactive; it's something that does not require input from the user. Therefore, it should not be a step in a wizard.

And this doesn't sound like it should be a wizard. A wizard is a particular UI pattern. You're talking about a process.

But as the verification step does require user input (accepting the key or not), it could be a wizard, I suppose. That's better than two sequential dialogs. But it's not as good as a dialog that changes based on context. It could be implemented as a wizard, but we don't need (nor want) a sidebar, a big header, or next/back buttons that PF wizards provide.

Here's a mockup of how it could work. It definitely needs updates and iteration.

(I made this in ExcaliDraw, so you could just open this PNG there if you want to edit it and it would show you the vector form, as it's embedded.)

garrett · 2022-10-18T09:39:37Z

Also: Since we'll check for a number of things for NBDE support and it takes a little while, the checking step could have a progress bar, as we can just divide it by # of steps and when each completes, advance the bar.

garrett · 2022-10-18T09:52:47Z

Here's a revised flow based on IRC feedback:

mvollmer · 2022-11-14T13:39:27Z

The final reboot sometimes (always?) times out on rhel-8... it completes in 20 seconds here locally. Let's debug this some.

mvollmer · 2022-11-15T07:27:59Z

The final reboot sometimes (always?) times out on rhel-8... it completes in 20 seconds here locally. Let's debug this some.

Now it succeeded on non-rhos. The console screenshots have been taken and can be found in the results directory. For example, https://cockpit-logs.us-east-1.linodeobjects.com/pull-17796-20221114-133844-c33e9cff-rhel-8-8/failed-reboot.ppm

jelly · 2022-11-25T17:47:55Z

If there is no "/" this will blow up, I guess that's intendend?

mvollmer · 2022-11-25T11:53:42Z

@garrett, what do you think about these open issues?

Should we ask the user to verify the key fingerprint before starting the fixing?
Should we offer a way to skip the fixing and still add the key?

jelly · 2022-11-25T16:01:00Z

Minor nitpick, maybe we should have some more spacing after the SHA1 hash.

jelly

Looks good, some minor nitpicks/questions.

jelly · 2022-11-25T14:32:17Z

This is cool! Would be nice if this was more generic and maybe by default in wait_reboot. But then we'd have to add a wrapper in testlib.py so self.wait_reboot as that can call attach.

jelly · 2022-11-25T15:43:40Z

Arch is waiting on latchset/clevis#374 I'll try to poke the PR

jelly · 2022-11-25T16:29:52Z

This feels a little brittle, as also lsinitrd -m also prints early CPIO image and stuff. But there doesn't seem to be a better way.

This is what the RHEL documentation recommends: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/security_hardening/configuring-automated-unlocking-of-encrypted-volumes-using-policy-based-decryption_security-hardening#configuring-automated-unlocking-using-a-tang-key-in-the-web-console_configuring-automated-unlocking-of-encrypted-volumes-using-policy-based-decryption

Right, I understand that. But it also prints other stuff. It was more a wish that more tools would have parseable output with for example --json.

jelly · 2022-11-25T17:01:05Z

So this might be a silly question, but doesn't grubby also re-generate initramfs. So in theory we could first install clevis-dracut. But that's a real nitpick I feel :)

grubby does not regenerate the initrd. At least it never did for me.

jelly · 2022-11-25T17:47:55Z

If there is no "/" this will blow up, I guess that's intendend?

mvollmer · 2022-11-28T07:22:23Z

If there is no "/" this will blow up, I guess that's intendend?

It's tolerated. :-) We could produce a better error message, but this really shouldn't happen at all.

cockpituous · 2023-01-10T11:56:03Z

+            if (p.waiting) {
+                text = _("Waiting for other software management operations to finish");
+            } else if (p.package) {


These 3 added lines are not executed by any test. Details

cockpituous · 2023-01-10T11:56:04Z

+            } else if (p.package) {
+                let fmt;
+                if (p.info == PkEnum.INFO_DOWNLOADING)
+                    fmt = _("Downloading $0");