Skip to content

[multicast] M2P forwarding, OPTE port subscription, and sled-agent propagation#10070

Open
zeeshanlakhani wants to merge 8 commits intomulticast-e2efrom
zl/multicast-m2p-forwarding
Open

[multicast] M2P forwarding, OPTE port subscription, and sled-agent propagation#10070
zeeshanlakhani wants to merge 8 commits intomulticast-e2efrom
zl/multicast-m2p-forwarding

Conversation

@zeeshanlakhani
Copy link
Collaborator

Complete the multicast data path by adding per-sled M2P (multicast-to- physical) mapping, forwarding entry management, and OPTE port subscription for multicast group members.

Sled-agent:

  • Add multicast_subscribe / multicast_unsubscribe endpoints (API v29) that configure M2P, forwarding, and OPTE port subscription for a VMM
  • OPTE port_manager gains set/clear operations for M2P and forwarding
  • Port subscription cleanup on PortTicket release

Nexus:

  • New sled.rs (MulticastSledClient) encapsulating all sled-agent multicast interactions: subscribe/unsubscribe, M2P/forwarding propagation and teardown
  • Groups RPW propagates M2P and forwarding entries to all member sleds after DPD configuration, with convergent retry on failure
  • Members RPW uses MemberReconcileCtx to thread shared reconciliation state. This handles subscribe on join, unsubscribe on leave, and re-subscribe on migration
  • Dataplane client updated for bifurcated replication groups

Tests:

  • Integration tests for M2P/forwarding/subscribe lifecycle
  • Instance migration multicast re-convergence

@zeeshanlakhani zeeshanlakhani force-pushed the zl/multicast-m2p-forwarding branch 4 times, most recently from 3efe7fa to 98e9742 Compare March 17, 2026 05:42
@zeeshanlakhani
Copy link
Collaborator Author

zeeshanlakhani commented Mar 17, 2026

Note: Both helios/deploy check-opte-ver/check-opte-ver in CI will fail for now. After R19 ships and OPTE PR #924 merges, we can switch from branch = "zl/filter-mcast-srcs" to rev = "<merged-sha>" and bump tools/opte_version + deploy.sh target to 0.40.

@zeeshanlakhani zeeshanlakhani marked this pull request as ready for review March 17, 2026 09:02
@zeeshanlakhani zeeshanlakhani self-assigned this Mar 23, 2026
@zeeshanlakhani zeeshanlakhani force-pushed the zl/multicast-m2p-forwarding branch from 98e9742 to b510a9f Compare March 24, 2026 02:40
zeeshanlakhani and others added 7 commits March 24, 2026 18:03
…opagation

Complete the multicast data path by adding per-sled M2P (multicast-to-
physical) mapping, forwarding entry management, and OPTE port subscription
for multicast group members.

Sled-agent:
- Add `multicast_subscribe` / `multicast_unsubscribe` endpoints (API v29)
  that configure M2P, forwarding, and OPTE port subscription for a VMM
- OPTE port_manager gains set/clear operations for M2P and forwarding
- Port subscription cleanup on PortTicket release

Nexus:
- New `sled.rs` (MulticastSledClient) encapsulating all sled-agent
  multicast interactions: subscribe/unsubscribe, M2P/forwarding
  propagation and teardown
- Groups RPW propagates M2P and forwarding entries to all member sleds
  after DPD configuration, with convergent retry on failure
- Members RPW uses MemberReconcileCtx to thread shared reconciliation
  state. This handles subscribe on join, unsubscribe on leave, and
  re-subscribe on migration
- Dataplane client updated for bifurcated replication groups

Tests:
- Integration tests for M2P/forwarding/subscribe lifecycle
- Instance migration multicast re-convergence
Move multicast networking types (Mcast2PhysMapping, McastForwardingEntry,
etc.) from omicron-common into a new network-types crate with versioned
structure. This breaks the dependency cycle between illumos-utils and
sled-agent-types-versions (issue #10139).

Structure:
- network-types/versions/: versioned type definitions (initial v1)
- network-types/: re-exports latest versions
- Consumers use network_types::multicast::* for current types
Ereports are uniquely identified by the tuple of their reporter's
restart ID and the ereport's ENA, as described in [RFD 520][1]. ENAs are
ordered within the restart of the reporter, but reporter IDs are
randomly-generated UUIDs and do not have a temporal ordering. As I
described in [this comment][2] on issue #10073, the approach I'd like to
take for loading new ereports to use as inputs to fault management
analysis relies on having an ordering of not only the ENA component of
the ereport ID but also the ordering of reporter restarts (i.e. "is this
restart newer than or older than the one I currently believe to be the
latest?"). In order to do this, we need to track the sequence of restart
IDs per reporter location. This is done on a location-by-location basis
because, well, the laws of physics prevent two sleds/switches/PSCs from
occupying the same physical location in the rack at the same time,
making their history of reporter restarts inherently orderable in a way
that other things, like serial numbers, may be less orderable.

This PR adds a table for constructing that ordering and modifies the
ereport insert query to actually use it. There's some additional
tidiness and refactoring I'd like to do here later (and an OMDB command
for printing the ordering table would be nice), but for now, I'd like to
be able to actually use this.

[1]: https://rfd.shared.oxide.computer/rfd/0520
[2]:
#10073 (comment)
As discussed in [this comment][1], it turns out that the approach to
sitrep input loading which required the ereporter restart table that I
added in #10132 will actually just straight up not work. Unfortunately,
we didn't figure that out until after the PR adding that table had
merged. We've now found a different way to do this that doesn't require
the ordering table. Additionally, the ordering table orders restarts by
the order in which ereports were _added to the database_, not the order
in which the restarts were actually first observed, which is of
debatable usefulness at best. Therefore, we should just get rid of it.

Unfortunately, because there was a database migration, we cannot simply
push a commit that reverts eb0bb82 and
removes all evidence of my shame. Instead, we must have a new migration
that just kinda undoes the migration I wrote yesterday, in case anyone
has installed a build of the control plane containing
eb0bb82 some time in approximately the
last two hours. So, undoing the change requires somewhat more ceremony
than it would if the database schema has not been touched. This PR does
that.

[1]:
#10073 (comment)
I have another PR that will follow this one to make the set default
logic for subnet pools work like IP pools (demote and replace existing
default instead of erroring if there's already a default), and as part
of that I started doing some test cleanup. I moved that into its own
separate PR to make the real PR easier to review.

* Replace inline
`NexusRequest::object_get`/`object_put`/`object_delete`/`expect_failure`/etc.
with the shared `object_get`, `object_put`, `object_delete`,
`objects_list_page_authz`, and `*_error helpers` from resource_helpers.
* Add `assert_silos_for_pool` and `assert_pools_for_silo` helpers to
reduce repeated assertion boilerplate.
* Remove stale TODO comments referencing #9453 (not in the tests but
just a comment change)
- Remove the footgunny method for fetching "the" default IP Pool for a
silo, which is hard to use correctly without specifying an IP version.
Also fix call sites.
- Add method for returning all the default IP Pools for a Silo, of which
there are at most 2.
- In the VPC creation saga, ensure that we create an internet gateway
pool for all the IP Pools currently linked to the silo as a default.
- Add a data migration that renames any existing "default" gateway IP
Pools as "default-v4" or "default-v6", as we're now specifying the
version.
- Add regression test for #10117.
- This fixes #10117.
@zeeshanlakhani zeeshanlakhani force-pushed the zl/multicast-m2p-forwarding branch from b510a9f to 1402d4d Compare March 26, 2026 06:42
@zeeshanlakhani
Copy link
Collaborator Author

zeeshanlakhani commented Mar 26, 2026

@jgallagher I started down the better path for network types in relation to #10139, where we could expand on the initial version in a separate PR. For the moment, I'm going to move the types back to omicron_common until #10158 finalizes.

@zeeshanlakhani zeeshanlakhani force-pushed the zl/multicast-m2p-forwarding branch from 1402d4d to 102d7fb Compare March 26, 2026 09:23
@zeeshanlakhani zeeshanlakhani force-pushed the zl/multicast-m2p-forwarding branch from 102d7fb to 043af28 Compare March 26, 2026 11:51
Copy link
Contributor

@FelixMcFelix FelixMcFelix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to be posting comments in chunks given the apparent size of the PR, so please bear with me there if I miss something that's covered by a later file. Apart from that I'm seeing various changes from main interspersed here, so I don't know if something is up with the branch targeting.

Comment on lines +73 to +87

#[error("Failed to set multicast-to-physical mapping: {0}")]
SetMcastM2p(String),

#[error("Failed to clear multicast-to-physical mapping: {0}")]
ClearMcastM2p(String),

#[error("Failed to set multicast forwarding: {0}")]
SetMcastFwd(String),

#[error("Failed to clear multicast forwarding: {0}")]
ClearMcastFwd(String),

#[error("Failed to list multicast forwarding: {0}")]
ListMcastFwd(String),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not quite the right way to think about the error type here. What's fundamentally going wrong in most of these these cases is that MulticastUnderlay::new failed, so it's the same class of parameter validation failure (and so one error variant). ListMcastFwd is obviously the exception, but I've written on that below. The other thing is that I think String is probably the wrong error type in OPTE itself for a validation failure.

Comment on lines +1039 to +1073
/// Install CIDR allow rules for multicast traffic on a port.
fn install_mcast_cidr_allows(
&self,
hdl: &Handle,
port_name: &str,
) -> Result<(), Error> {
for (cidr, dir) in mcast_cidr_allow_set() {
debug!(
self.inner.log,
"installing multicast CIDR allow rule";
"port" => port_name,
"cidr" => %cidr,
"direction" => ?dir,
);
hdl.allow_cidr(port_name, cidr, dir)?;
}
Ok(())
}

/// Remove CIDR allow rules for multicast traffic from a port.
fn remove_mcast_cidr_allows(
&self,
hdl: &Handle,
port_name: &str,
) -> Result<(), Error> {
for (cidr, dir) in mcast_cidr_allow_set() {
debug!(
self.inner.log,
"removing multicast CIDR allow rule";
"port" => port_name,
"cidr" => %cidr,
"direction" => ?dir,
);
hdl.remove_cidr(port_name, cidr, dir)?;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The add/remove_cidr ioctls are intended for transit IPs, not this. But also this logic shouldn't be necessary? We've already set up the gateway layer in OPTE to handle this according to your description here: https://github.com/oxidecomputer/opte/blob/04c3d5d37d7b919cbf01019d2a17b93ff2df2eb8/lib/oxide-vpc/src/engine/gateway/mod.rs#L43-L56.

I think if we were converting them to individual/aggregated subnet exceptions, we would leave that to OPTE based on the group subscription info it has.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, you're right. I had this as a placeholder, but dealt with it in the gateway code.

_: Direction,
) -> Result<NoResp, OpteError> {
unimplemented!("Not yet used in tests")
Ok(NO_RESPONSE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should still be marked as unimplemented, I think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary with removal based on previous comment.

Comment on lines +273 to +293
// Parse tools/opte_version_override for OPTE_COMMIT and
// OPTE_PKG_FALLBACK_VERSION. When OPTE_COMMIT is set, the OPTE package
// isn't published to the helios pkg repo yet, so we install the fallback
// version via pkg and overlay the xde binary and opteadm from buildomat.
let opte_override = parse_opte_version_override(
&WORKSPACE_DIR.join("tools/opte_version_override"),
)
.await?;
let effective_opte_version = match &opte_override {
Some(ov) => {
info!(
logger,
"OPTE override active: commit={}, fallback pkg version={}",
ov.commit,
ov.fallback_version,
);
&ov.fallback_version
}
None => opte_version.trim(),
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is useful, but I think ideally we'd want to pursue this via an override p5p from a given commit, which OPTE's CI does generate. I know Ry looked into this a while back with illumos/image-builder#1, but I think that the extra_packages machinery when building image_cmd might be able to help us?

If we're trying to greater enshrine this so we can run tests more reliably, we should also modify .github/workflows/check-opte-ver.yml (or add a new stage) such that overall CI fails if any of these override flags are set. I don't want us to be able to merge any version workarounds if possible.

Comment on lines +157 to +158
mcast_subscriptions:
Mutex<HashMap<(Uuid, NetworkInterfaceKind), McastPortState>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be easier to keep this state, and that of eip_gateways, in ports. Stuffing it into Port/PortInner might be tricky, since that type is shared with the Zones for a few reasons, so I don't know if something like this will suffice:

struct MutablePortState {
    fixed_data: Port,
    eip_gateways: HashSet<Uuid>,
    mcast: McastPortState,
}

I also, one day, want to move EIP bindings down from the instance manager to here to make those pathways less circuitous.

Sorry for kicking off the bad pattern there (mainly born from expediency around that immutable Port type), but I think we should nix it while we can. This should also drastically simplify the need to reason about lock order.

(Your VnicUuid comment also aplies to ports and eip_gateways, that would be nice but it's not necessary here.)

Comment on lines +170 to +190
/// The set of (CIDR, direction) pairs for multicast allow rules.
/// Covers [`IPV4_MULTICAST_RANGE`] (224.0.0.0/4) and
/// [`IPV6_MULTICAST_RANGE`] (ff00::/8) in both directions.
fn mcast_cidr_allow_set() -> [(IpCidr, Direction); 4] {
let v4 = IPV4_MULTICAST_RANGE;
let v6 = IPV6_MULTICAST_RANGE;
let ipv4 = IpCidr::Ip4(Ipv4Cidr::new(
oxide_vpc::api::Ipv4Addr::from(v4.addr()),
oxide_vpc::api::Ipv4PrefixLen::new(v4.width()).unwrap(),
));
let ipv6 = IpCidr::Ip6(Ipv6Cidr::new(
oxide_vpc::api::Ipv6Addr::from(v6.addr()),
oxide_vpc::api::Ipv6PrefixLen::new(v6.width()).unwrap(),
));
[
(ipv4, Direction::In),
(ipv4, Direction::Out),
(ipv6, Direction::In),
(ipv6, Direction::Out),
]
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be needed assuming the comments in OPTE are correct.

Comment on lines +967 to +970
hdl.mcast_unsubscribe(&McastUnsubscribeReq {
port_name: port_name.clone(),
group: (*group_ip).into(),
})?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe comment here that this is effectively infallible (we've verified the IPs are multicast at least implicitly, the operation is idempotent, the port exists).

Comment on lines +993 to +997
hdl.mcast_subscribe(&McastSubscribeReq {
port_name: port_name.clone(),
group: (*group_ip).into(),
filter: filter.clone(),
})?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto to above, mainly since we don't want to give the impression that this is something that can fail partially.

Comment on lines +1352 to +1358
oxide_vpc::api::Replication::Reserved => {
return Err(Error::ListMcastFwd(
"OPTE returned Reserved replication \
mode, which has no defined semantics"
.to_string(),
));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is ever set this way, it would have to be from an operator manually playing with opteadm I assume. Should we be bailing here if that's the case? We should either be listing the system state unconditionally or flagging an error log at worst, such that consumers like converge_forwarding don't get wedged trying to reconcile system state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants