[multicast] M2P forwarding, OPTE port subscription, and sled-agent propagation#10070
[multicast] M2P forwarding, OPTE port subscription, and sled-agent propagation#10070zeeshanlakhani wants to merge 8 commits intomulticast-e2efrom
Conversation
3efe7fa to
98e9742
Compare
|
Note: Both |
98e9742 to
b510a9f
Compare
…opagation Complete the multicast data path by adding per-sled M2P (multicast-to- physical) mapping, forwarding entry management, and OPTE port subscription for multicast group members. Sled-agent: - Add `multicast_subscribe` / `multicast_unsubscribe` endpoints (API v29) that configure M2P, forwarding, and OPTE port subscription for a VMM - OPTE port_manager gains set/clear operations for M2P and forwarding - Port subscription cleanup on PortTicket release Nexus: - New `sled.rs` (MulticastSledClient) encapsulating all sled-agent multicast interactions: subscribe/unsubscribe, M2P/forwarding propagation and teardown - Groups RPW propagates M2P and forwarding entries to all member sleds after DPD configuration, with convergent retry on failure - Members RPW uses MemberReconcileCtx to thread shared reconciliation state. This handles subscribe on join, unsubscribe on leave, and re-subscribe on migration - Dataplane client updated for bifurcated replication groups Tests: - Integration tests for M2P/forwarding/subscribe lifecycle - Instance migration multicast re-convergence
Move multicast networking types (Mcast2PhysMapping, McastForwardingEntry, etc.) from omicron-common into a new network-types crate with versioned structure. This breaks the dependency cycle between illumos-utils and sled-agent-types-versions (issue #10139). Structure: - network-types/versions/: versioned type definitions (initial v1) - network-types/: re-exports latest versions - Consumers use network_types::multicast::* for current types
Ereports are uniquely identified by the tuple of their reporter's restart ID and the ereport's ENA, as described in [RFD 520][1]. ENAs are ordered within the restart of the reporter, but reporter IDs are randomly-generated UUIDs and do not have a temporal ordering. As I described in [this comment][2] on issue #10073, the approach I'd like to take for loading new ereports to use as inputs to fault management analysis relies on having an ordering of not only the ENA component of the ereport ID but also the ordering of reporter restarts (i.e. "is this restart newer than or older than the one I currently believe to be the latest?"). In order to do this, we need to track the sequence of restart IDs per reporter location. This is done on a location-by-location basis because, well, the laws of physics prevent two sleds/switches/PSCs from occupying the same physical location in the rack at the same time, making their history of reporter restarts inherently orderable in a way that other things, like serial numbers, may be less orderable. This PR adds a table for constructing that ordering and modifies the ereport insert query to actually use it. There's some additional tidiness and refactoring I'd like to do here later (and an OMDB command for printing the ordering table would be nice), but for now, I'd like to be able to actually use this. [1]: https://rfd.shared.oxide.computer/rfd/0520 [2]: #10073 (comment)
As discussed in [this comment][1], it turns out that the approach to sitrep input loading which required the ereporter restart table that I added in #10132 will actually just straight up not work. Unfortunately, we didn't figure that out until after the PR adding that table had merged. We've now found a different way to do this that doesn't require the ordering table. Additionally, the ordering table orders restarts by the order in which ereports were _added to the database_, not the order in which the restarts were actually first observed, which is of debatable usefulness at best. Therefore, we should just get rid of it. Unfortunately, because there was a database migration, we cannot simply push a commit that reverts eb0bb82 and removes all evidence of my shame. Instead, we must have a new migration that just kinda undoes the migration I wrote yesterday, in case anyone has installed a build of the control plane containing eb0bb82 some time in approximately the last two hours. So, undoing the change requires somewhat more ceremony than it would if the database schema has not been touched. This PR does that. [1]: #10073 (comment)
I have another PR that will follow this one to make the set default logic for subnet pools work like IP pools (demote and replace existing default instead of erroring if there's already a default), and as part of that I started doing some test cleanup. I moved that into its own separate PR to make the real PR easier to review. * Replace inline `NexusRequest::object_get`/`object_put`/`object_delete`/`expect_failure`/etc. with the shared `object_get`, `object_put`, `object_delete`, `objects_list_page_authz`, and `*_error helpers` from resource_helpers. * Add `assert_silos_for_pool` and `assert_pools_for_silo` helpers to reduce repeated assertion boilerplate. * Remove stale TODO comments referencing #9453 (not in the tests but just a comment change)
- Remove the footgunny method for fetching "the" default IP Pool for a silo, which is hard to use correctly without specifying an IP version. Also fix call sites. - Add method for returning all the default IP Pools for a Silo, of which there are at most 2. - In the VPC creation saga, ensure that we create an internet gateway pool for all the IP Pools currently linked to the silo as a default. - Add a data migration that renames any existing "default" gateway IP Pools as "default-v4" or "default-v6", as we're now specifying the version. - Add regression test for #10117. - This fixes #10117.
b510a9f to
1402d4d
Compare
|
|
1402d4d to
102d7fb
Compare
102d7fb to
043af28
Compare
FelixMcFelix
left a comment
There was a problem hiding this comment.
I'm going to be posting comments in chunks given the apparent size of the PR, so please bear with me there if I miss something that's covered by a later file. Apart from that I'm seeing various changes from main interspersed here, so I don't know if something is up with the branch targeting.
|
|
||
| #[error("Failed to set multicast-to-physical mapping: {0}")] | ||
| SetMcastM2p(String), | ||
|
|
||
| #[error("Failed to clear multicast-to-physical mapping: {0}")] | ||
| ClearMcastM2p(String), | ||
|
|
||
| #[error("Failed to set multicast forwarding: {0}")] | ||
| SetMcastFwd(String), | ||
|
|
||
| #[error("Failed to clear multicast forwarding: {0}")] | ||
| ClearMcastFwd(String), | ||
|
|
||
| #[error("Failed to list multicast forwarding: {0}")] | ||
| ListMcastFwd(String), |
There was a problem hiding this comment.
I think this is not quite the right way to think about the error type here. What's fundamentally going wrong in most of these these cases is that MulticastUnderlay::new failed, so it's the same class of parameter validation failure (and so one error variant). ListMcastFwd is obviously the exception, but I've written on that below. The other thing is that I think String is probably the wrong error type in OPTE itself for a validation failure.
| /// Install CIDR allow rules for multicast traffic on a port. | ||
| fn install_mcast_cidr_allows( | ||
| &self, | ||
| hdl: &Handle, | ||
| port_name: &str, | ||
| ) -> Result<(), Error> { | ||
| for (cidr, dir) in mcast_cidr_allow_set() { | ||
| debug!( | ||
| self.inner.log, | ||
| "installing multicast CIDR allow rule"; | ||
| "port" => port_name, | ||
| "cidr" => %cidr, | ||
| "direction" => ?dir, | ||
| ); | ||
| hdl.allow_cidr(port_name, cidr, dir)?; | ||
| } | ||
| Ok(()) | ||
| } | ||
|
|
||
| /// Remove CIDR allow rules for multicast traffic from a port. | ||
| fn remove_mcast_cidr_allows( | ||
| &self, | ||
| hdl: &Handle, | ||
| port_name: &str, | ||
| ) -> Result<(), Error> { | ||
| for (cidr, dir) in mcast_cidr_allow_set() { | ||
| debug!( | ||
| self.inner.log, | ||
| "removing multicast CIDR allow rule"; | ||
| "port" => port_name, | ||
| "cidr" => %cidr, | ||
| "direction" => ?dir, | ||
| ); | ||
| hdl.remove_cidr(port_name, cidr, dir)?; | ||
| } |
There was a problem hiding this comment.
The add/remove_cidr ioctls are intended for transit IPs, not this. But also this logic shouldn't be necessary? We've already set up the gateway layer in OPTE to handle this according to your description here: https://github.com/oxidecomputer/opte/blob/04c3d5d37d7b919cbf01019d2a17b93ff2df2eb8/lib/oxide-vpc/src/engine/gateway/mod.rs#L43-L56.
I think if we were converting them to individual/aggregated subnet exceptions, we would leave that to OPTE based on the group subscription info it has.
There was a problem hiding this comment.
Ah yeah, you're right. I had this as a placeholder, but dealt with it in the gateway code.
| _: Direction, | ||
| ) -> Result<NoResp, OpteError> { | ||
| unimplemented!("Not yet used in tests") | ||
| Ok(NO_RESPONSE) |
There was a problem hiding this comment.
This should still be marked as unimplemented, I think?
There was a problem hiding this comment.
Not necessary with removal based on previous comment.
| // Parse tools/opte_version_override for OPTE_COMMIT and | ||
| // OPTE_PKG_FALLBACK_VERSION. When OPTE_COMMIT is set, the OPTE package | ||
| // isn't published to the helios pkg repo yet, so we install the fallback | ||
| // version via pkg and overlay the xde binary and opteadm from buildomat. | ||
| let opte_override = parse_opte_version_override( | ||
| &WORKSPACE_DIR.join("tools/opte_version_override"), | ||
| ) | ||
| .await?; | ||
| let effective_opte_version = match &opte_override { | ||
| Some(ov) => { | ||
| info!( | ||
| logger, | ||
| "OPTE override active: commit={}, fallback pkg version={}", | ||
| ov.commit, | ||
| ov.fallback_version, | ||
| ); | ||
| &ov.fallback_version | ||
| } | ||
| None => opte_version.trim(), | ||
| }; | ||
|
|
There was a problem hiding this comment.
I think this is useful, but I think ideally we'd want to pursue this via an override p5p from a given commit, which OPTE's CI does generate. I know Ry looked into this a while back with illumos/image-builder#1, but I think that the extra_packages machinery when building image_cmd might be able to help us?
If we're trying to greater enshrine this so we can run tests more reliably, we should also modify .github/workflows/check-opte-ver.yml (or add a new stage) such that overall CI fails if any of these override flags are set. I don't want us to be able to merge any version workarounds if possible.
| mcast_subscriptions: | ||
| Mutex<HashMap<(Uuid, NetworkInterfaceKind), McastPortState>>, |
There was a problem hiding this comment.
I think it would be easier to keep this state, and that of eip_gateways, in ports. Stuffing it into Port/PortInner might be tricky, since that type is shared with the Zones for a few reasons, so I don't know if something like this will suffice:
struct MutablePortState {
fixed_data: Port,
eip_gateways: HashSet<Uuid>,
mcast: McastPortState,
}I also, one day, want to move EIP bindings down from the instance manager to here to make those pathways less circuitous.
Sorry for kicking off the bad pattern there (mainly born from expediency around that immutable Port type), but I think we should nix it while we can. This should also drastically simplify the need to reason about lock order.
(Your VnicUuid comment also aplies to ports and eip_gateways, that would be nice but it's not necessary here.)
| /// The set of (CIDR, direction) pairs for multicast allow rules. | ||
| /// Covers [`IPV4_MULTICAST_RANGE`] (224.0.0.0/4) and | ||
| /// [`IPV6_MULTICAST_RANGE`] (ff00::/8) in both directions. | ||
| fn mcast_cidr_allow_set() -> [(IpCidr, Direction); 4] { | ||
| let v4 = IPV4_MULTICAST_RANGE; | ||
| let v6 = IPV6_MULTICAST_RANGE; | ||
| let ipv4 = IpCidr::Ip4(Ipv4Cidr::new( | ||
| oxide_vpc::api::Ipv4Addr::from(v4.addr()), | ||
| oxide_vpc::api::Ipv4PrefixLen::new(v4.width()).unwrap(), | ||
| )); | ||
| let ipv6 = IpCidr::Ip6(Ipv6Cidr::new( | ||
| oxide_vpc::api::Ipv6Addr::from(v6.addr()), | ||
| oxide_vpc::api::Ipv6PrefixLen::new(v6.width()).unwrap(), | ||
| )); | ||
| [ | ||
| (ipv4, Direction::In), | ||
| (ipv4, Direction::Out), | ||
| (ipv6, Direction::In), | ||
| (ipv6, Direction::Out), | ||
| ] | ||
| } |
There was a problem hiding this comment.
This shouldn't be needed assuming the comments in OPTE are correct.
| hdl.mcast_unsubscribe(&McastUnsubscribeReq { | ||
| port_name: port_name.clone(), | ||
| group: (*group_ip).into(), | ||
| })?; |
There was a problem hiding this comment.
Maybe comment here that this is effectively infallible (we've verified the IPs are multicast at least implicitly, the operation is idempotent, the port exists).
| hdl.mcast_subscribe(&McastSubscribeReq { | ||
| port_name: port_name.clone(), | ||
| group: (*group_ip).into(), | ||
| filter: filter.clone(), | ||
| })?; |
There was a problem hiding this comment.
Ditto to above, mainly since we don't want to give the impression that this is something that can fail partially.
| oxide_vpc::api::Replication::Reserved => { | ||
| return Err(Error::ListMcastFwd( | ||
| "OPTE returned Reserved replication \ | ||
| mode, which has no defined semantics" | ||
| .to_string(), | ||
| )); | ||
| } |
There was a problem hiding this comment.
If this is ever set this way, it would have to be from an operator manually playing with opteadm I assume. Should we be bailing here if that's the case? We should either be listing the system state unconditionally or flagging an error log at worst, such that consumers like converge_forwarding don't get wedged trying to reconcile system state.
Complete the multicast data path by adding per-sled M2P (multicast-to- physical) mapping, forwarding entry management, and OPTE port subscription for multicast group members.
Sled-agent:
multicast_subscribe/multicast_unsubscribeendpoints (API v29) that configure M2P, forwarding, and OPTE port subscription for a VMMNexus:
sled.rs(MulticastSledClient) encapsulating all sled-agent multicast interactions: subscribe/unsubscribe, M2P/forwarding propagation and teardownTests: