Skip to content

[fm] mark ereports included in a sitrep as "seen"#10156

Open
hawkw wants to merge 18 commits intomainfrom
eliza/ereports-seen
Open

[fm] mark ereports included in a sitrep as "seen"#10156
hawkw wants to merge 18 commits intomainfrom
eliza/ereports-seen

Conversation

@hawkw
Copy link
Member

@hawkw hawkw commented Mar 25, 2026

In order to implement the FM analysis preparation phase (#10073), we must have a mechanism to load ereports from the database that are considered "new" (i.e. they were not included as inputs to prior sitreps). The design for detecting which ereports to load described in this comment involves having the execution phase eventually mark the database records for those ereports as having been seen so that they are not considered "new" in subsequent preparation phases, which is what this branch implements.

Since this is part of the execution phase, it is not guaranteed to have occurred by the time a new preparation phase has begun. This means that the marking of seen ereports is eventually consistent. This is acceptable in the design proposed in #10073, as the marking is really intended as an optimization. When we actually assemble the inputs to analysis, we will additionally filter out any ereports which are part of the parent sitrep, since they have already been included. The marking is intended as an optimization to allow those ereports to not be held in memory forever, and as long as it will eventually happen, it is okay for it to lag behind the loading of ereports in the preparation phase.

@hawkw hawkw changed the title [fm] start ereport seenness marking [fm] mark ereports included in a sitrep as "seen" Mar 26, 2026
@hawkw hawkw marked this pull request as ready for review March 26, 2026 19:28
@hawkw hawkw requested a review from smklein March 26, 2026 19:28
@hawkw hawkw self-assigned this Mar 26, 2026
@hawkw hawkw added the fault-management Everything related to the fault-management initiative (RFD480 and others) label Mar 26, 2026
@hawkw hawkw added this to the 20 milestone Mar 26, 2026
@hawkw hawkw requested a review from mergeconflict March 26, 2026 19:29
Comment on lines 69 to +72
// TODO(eliza): as we start doing other things (i.e. requesting support
// bundles, updating problems), consider spawning these in their own tasks...
let alerts = self.create_requested_alerts(&sitrep, opctx).await;
let marking = self.mark_ereports_seen(&sitrep, opctx).await;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we are actually doing multiple things (and will be doing even more multiple things when @mergeconflict's work on support bundle requests land) we should probably actually spawn these. However, I think I'm going to do that in a follow-up PR...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is where a bunch of the important parts of the change live, so of course, GitHub has decided to collapse it by default 🙃

Comment on lines +7224 to +7227
* if this is `true`, the ereport has *definitely* been seen by at least
* one committed sitrep at some point in time. if it is `false`, the
* ereport may or may not have been included in a sitrep, and you will
* have to actually check the sitrep to find out.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Comments say "true"/"false", but this is a UUID - should say nullable.

Also would be nice to identify if this is supposed to be the sitrep "executing" or the sitrep which "first marked the ereport as seen". I expect they'll often, but not always, be the same

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! And, yeah, it is the sitrep that was executing when this row was marked seen, which may not actually be the first sitrep it appeared in. That should be written down here.

opctx: &OpContext,
sitrep_id: SitrepUuid,
ereport_ids: impl IntoIterator<Item = EreportId>,
) -> Result<usize, Error> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the be an opctx check on the ability to modify things here?

(looking at parity with the other pub fn to list them...)

///
/// If any ereports have already been marked as seen with another sitrep ID,
/// they are unmodified. Otherwise, this query sets the `marked_seen_in`
/// column to the provided sitrep ID.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a note about the return value meaning?

//
// This bit is kindas screwy: unfortunately, Postgres serialization
// does not support bind parameters which are arrays of tuples, so
// we must bind two separate arrays. Trust me on this one.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aight IF YOU SAY SO

enas.push(DbEna::from(ena));
}

// Raw SQL is necessary here as `diesel`'s `.eq_any` doesn't work with
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a little paranoid of me, but: do we have a test for this checking that it doesn't fall apart when the ererport_ids iterator is empty?

.await
.expect("failed to insert second sitrep");

// Delete the original sitrep
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment looks like it was added here, but it doesn't look like we're doing deletion here?

" {NOT_ALREADY_MARKED:<WIDTH$}{ereports_not_marked_in_sitrep:>NUM_WIDTH$}"
);
println!(" {MARKED_SEEN:<WIDTH$}{ereports_marked_seen:>NUM_WIDTH$}");
let already_marked = ereports_not_marked_in_sitrep - ereports_marked_seen;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're sure this won't underflow?

self.fm_sitrep_metadata_read_on_conn(id, &conn).await?.into();

Ok(Sitrep { metadata, cases })
Ok(Sitrep { metadata, cases, ereports_by_id: ereports })
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ereports_by_id derivable from cases?

  • If no: How does it differ?
  • If yes: would it be a problem if they differ? Should we be checking sitreps for internal consistency?

(We have a blueprint checker called "blippy" for this reason, might be time to make a "slippy")

pub inventory: &'a inventory::Collection,
pub parent_sitrep: Option<&'a fm::Sitrep>,
pub sitrep_id: SitrepUuid,
pub cases: case::AllCases,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make sure I have a solid understanding about how this builder is intended to be used, especially with cases being a pub field right now.

Here's my main concern: conceptually, I'm observing three different phases of sitrep construction + usage:

  1. Preparation (loading inputs from the database)
  2. Planning (building the next sitrep)
  3. Execution (enacting the sitrep)

Within the context of "Marking ereports as seen", I can see:

  1. Planning couples the set of "seen" ereports tightly with cases
  2. Execution tries to mark the erports as "seen" within the database

But I'm not seeing the Preparation phase here, which would load inputs from the database, and use them in the construction of the next builder.

It's possible that "this doesn't exist yet, because we aren't yet using the SitrepBuilder, but I wanted to flag this early: We'll need a way to verify that ereports which should be marked as seen in the database actually have been seen! It's possible that the fm_rendezvous task gets stalled for some reason - we cannot assume that the ereports have been marked "seen" when the next sitrep is created.

Anyway, I bring this up because: with cases being public, it seems wrong to mark the case closed, or to remove an ereport from a case, until it has been marked seen. Otherwise, we could risk the following race condition:

  1. ereport E is processed by sitrep S1
  2. sitrep S1 is the target, but doesn't get enough time to execute fully (or fails to rendezvous)
  3. a new sitrep S2 is the new target, drops the reference to ereport E
  4. ereport E no longer exists in the sitrep -- so we no longer try to mark ereport E as "seen" -- and it gets loaded and evaluated as new

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fault-management Everything related to the fault-management initiative (RFD480 and others)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants