CA DRA: remove NodeInfo wrapping

## Which component are you using?

/area cluster-autoscaler
/area core-autoscaler

## Problem description:

Before Cluster Autoscaler was integrated with DRA, `ClusterSnapshot` would directly store the scheduler-framework `NodeInfo` objects, and CA code would operate on these scheduler-framework objects.

In order to support DRA, CA had to start tracking ResourceSlices and ResourceClaims in addition to just Nodes and Pods. The scheduler `NodeInfo` type doesn't contain them, and back then it was just a struct, not an interface - so we couldn't directly extend it. Instead, we introduced an internal NodeInfo and PodInfo types - wrapping the scheduler structs, and adding CA-specific fields on top: https://github.com/kubernetes/autoscaler/blob/91bae6bb20d83167865b7b022a960454701af51e/cluster-autoscaler/simulator/framework/infos.go#L47

Scheduler NodeInfos are tracked in `ClusterSnapshotStore`: https://github.com/kubernetes/autoscaler/blob/91bae6bb20d83167865b7b022a960454701af51e/cluster-autoscaler/simulator/clustersnapshot/store/delta.go#L54

`ClusterSnapshotStore` also contains `dynamicresources/snapshot.Snapshot` which tracks DRA objects: https://github.com/kubernetes/autoscaler/blob/91bae6bb20d83167865b7b022a960454701af51e/cluster-autoscaler/simulator/dynamicresources/snapshot/snapshot.go#L48

Whenever scheduler plugins list (the scheduler version of) NodeInfos or DRA objects through the scheduler framework, they are listed directly from `ClusterSnapshotStore` and `dynamicresources/snapshot.Snapshot` with no memory allocations.

On the other hand, Cluster Autoscaler code always lists/gets the CA-specific version of NodeInfo. Since parts of the object live in different places (`ClusterSnapshotStore` and `dynamicresources/snapshot.Snapshot`), the internal NodeInfos always get computed on-the-fly, wrapping both parts: https://github.com/kubernetes/autoscaler/blob/91bae6bb20d83167865b7b022a960454701af51e/cluster-autoscaler/simulator/clustersnapshot/predicate/predicate_snapshot.go#L54

This results in allocating memory for the internal NodeInfo every time it's accessed from `ClusterSnapshot`. After we added parallelism to binpacking in https://github.com/kubernetes/autoscaler/pull/8729, it turned out after benchmarking that these memory allocations are now a performance bottleneck.

The scheduler-framework `NodeInfo` was changed to an interface in K8s 1.34, so we can now return the internal NodeInfos directly to scheduler code - as long as they implement the interface. Since the internal NodeInfo wraps the scheduler one, most of the interface methods can just be passed through to the wrapped scheduler struct.

## Proposed solution:

### High-level overview:

* Make internal `NodeInfo` implement the scheduler-framework `NodeInfo` interface, so that it can be directly returned and used by the scheduler framework.
* Change the "storage version" in `ClusterSnapshotStore` from scheduler `NodeInfo`/`PodInfo` to CA-internal `NodeInfo`/`PodInfo`. Have `ClusterSnapshotStore` directly provide the CA-internal objects to both CA and scheduler-framework without any  wrapping/memory allocations.
* Modify `ClusterSnapshotStore` interface and implementations so that any time the internal `NodeInfo`/`PodInfo` objects are allocated during its methods, they have the CA-specific fields correctly set using DRA snapshot before storing.
* Modify `ClusterSnapshot` interface and its `PredicateSnapshot` implementation to delegate `NodeInfo` getting/listing to the embedded `ClusterSnapshotStore`.

### Detailed plan:

* Implement [the scheduler-framework `NodeInfo` interface](https://github.com/kubernetes/kubernetes/blob/0ea38248603ce745956d779787874cfd91ca29ae/staging/src/k8s.io/kube-scheduler/framework/types.go#L252) for [the CA-internal NodeInfo struct](https://github.com/kubernetes/autoscaler/blob/91bae6bb20d83167865b7b022a960454701af51e/cluster-autoscaler/simulator/framework/infos.go#L47), passing most methods through to the embedded [scheduler-framework `NodeInfo` implementation](https://github.com/kubernetes/kubernetes/blob/268bdbe214ac64b5a31a59cfae6b96f3a1585bd3/pkg/scheduler/framework/types.go#L165).
  * Handling `PodInfo` will be tricky - the internal `NodeInfo.Pods()` should return a list of internal `PodInfo` objects without allocating them on the fly. So we can't just pass this method through to the scheduler `NodeInfo.Pods()`. We have to actually embed the scheduler `PodInfo` in the internal `PodInfo` and implement the `PodInfo` interface, keep a list of internal `PodInfos` in the internal `NodeInfo`, and return that from the internal `NodeInfo.Pods()`.
* Modify `ClusterSnapshotStore` interface and both implementations (`BasicSnapshotStore` and `DeltaSnapshotStore`):
  * Store the internal `NodeInfos` instead of the scheduler version in the implementations. The `ClusterSnapshotStore.NodeInfos()` method exposing the `NodeInfos` to scheduler code should keep working as-is, because the internal `NodeInfo` implements the correct interface.
  * `AddSchedulerNodeInfo` and `RemoveSchedulerNodeInfo` should be changed to operate on the internal `NodeInfo`, and renamed to something like `ForceAddNodeInfo`/`ForceRemoveNodeInfo`. The internal `NodeInfo` provided to `ForceAddNodeInfo()` should already have all the CA-specific parts filled in, so it can just be stored as-is.
  *  Move the logic that allocates internal `NodeInfo` objects including all the CA-specific parts from on-the-fly in `PredicateSnapshot.GetNodeInfo()/ListNodeInfos()`, to once-per-loop in `ClusterSnapshotStore.SetClusterState()`.
  * Change `ForceAddPod()` to allocate an internal `PodInfo` and include all the DRA-related information (taken from the DRA snapshot) before storing it in a `NodeInfo`.
* Modify `ClusterSnapshot` interface and implementation (`PredicateSnapshot`):
  * Move `GetNodeInfo()` and `ListNodeInfos()` methods from `ClusterSnapshot` to `ClusterSnapshotStore`. Implement the new methods in both `ClusterSnapshotStore` implementations by just getting/listing the stored internal `NodeInfos` without any wrapping.
  * Change `AddNodeInfo()` to call the new `ClusterSnapshotStore.ForceAddNodeInfo()` and pass the whole internal `NodeInfo` to it.
  * Change `RemoveNodeInfo` to call the new `ClusterSnapshotStore.ForceRemoveNodeInfo()`.  

### Mental model changes

Currently, a good mental model for Cluster Snapshot and DRA is something like this:
* `ClusterSnapshotStore` is the single storage source-of-truth for Nodes and Pods.
* `ClusterSnapshotStore.DraSnapshot()` is the single storage source-of-truth for DRA objects.
* `ClusterSnapshot.GetNodeInfo()/ListNodeInfos()` correlates Nodes and Pods  the DRA objects, and returns the CA-internal NodeInfo objects combining both of them. The result is "ephemeral": read-only, only valid until the next `ClusterSnapshot` mutation, shouldn't be stored.
* Any time `ClusterSnapshot` is mutated, `ClusterSnapshotStore` and the embedded `ClusterSnapshotStore.DraSnapshot()` are mutated independently. The previous results of `GetNodeInfo()/ListNodeInfos()` are now invalid, so we don't care about "synchronizing" them with the mutation. For each tracked object, we only change the single source of truth it's stored in.

The proposed changes make the mental model more complex:
* `ClusterSnapshotStore.DraSnapshot()` still stores the DRA objects and is still the source-of-truth conceptually, but it's not the only place where the DRA objects are stored.
* `ClusterSnapshotStore` stores fully-formed CA-internal `NodeInfo` objects - so Nodes, Pods, and DRA objects. It's still the single storage source-of-truth for Nodes and Pods.
* DRA objects are stored in 2 places now - `ClusterSnapshotStore.DraSnapshot()`, and the CA-internal `NodeInfos` tracked in `ClusterSnapshotStore`. We can't get away from having a separate DRA snapshot because there are objects to track that aren't bound to a particular Node or Pod. We also can't get away from having the DRA objects in the internal `NodeInfos` stored by `ClusterSnapshotStore` - either they're persisted there, or we're back to allocating new `NodeInfo` objects on the fly.
* We still don't care about synchronizing mutations to the ephemeral results of `GetNodeInfo()/ListNodeInfos()`, but we now need to synchronize any mutations between `ClusterSnapshotStore.DraSnapshot()` and the CA-internal `NodeInfos` tracked in `ClusterSnapshotStore`. In return, any time we get a NodeInfo from `ClusterSnapshotStore`, we can just return it with no extra logic.
 
I couldn't come up with a solution that doesn't make the mental model more complex at all, and at least in the proposed solution the new complexity is limited to `ClusterSnapshot` and `ClusterSnapshotStore` implementations. The expectations for the rest of the code base shouldn't change.

## Additional context:

* https://github.com/kubernetes/autoscaler/pull/8721 adds more fields to the CA-internal `NodeInfo`, and a new `CSISnapshot` for filling them to `ClusterSnapshotStore`. It'll probably land before this issue is fixed, so the parts that mention "DRA" in the implementation plan above will become "DRA and CSINode". Both work the same way, so the same solution should be fine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CA DRA: remove NodeInfo wrapping #8965

Which component are you using?

Problem description:

Proposed solution:

High-level overview:

Detailed plan:

Mental model changes

Additional context:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CA DRA: remove NodeInfo wrapping #8965

Description

Which component are you using?

Problem description:

Proposed solution:

High-level overview:

Detailed plan:

Mental model changes

Additional context:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions