Skip to content

CA DRA: remove NodeInfo wrapping #8965

@towca

Description

@towca

Which component are you using?

/area cluster-autoscaler
/area core-autoscaler

Problem description:

Before Cluster Autoscaler was integrated with DRA, ClusterSnapshot would directly store the scheduler-framework NodeInfo objects, and CA code would operate on these scheduler-framework objects.

In order to support DRA, CA had to start tracking ResourceSlices and ResourceClaims in addition to just Nodes and Pods. The scheduler NodeInfo type doesn't contain them, and back then it was just a struct, not an interface - so we couldn't directly extend it. Instead, we introduced an internal NodeInfo and PodInfo types - wrapping the scheduler structs, and adding CA-specific fields on top:

Scheduler NodeInfos are tracked in ClusterSnapshotStore:

ClusterSnapshotStore also contains dynamicresources/snapshot.Snapshot which tracks DRA objects:

Whenever scheduler plugins list (the scheduler version of) NodeInfos or DRA objects through the scheduler framework, they are listed directly from ClusterSnapshotStore and dynamicresources/snapshot.Snapshot with no memory allocations.

On the other hand, Cluster Autoscaler code always lists/gets the CA-specific version of NodeInfo. Since parts of the object live in different places (ClusterSnapshotStore and dynamicresources/snapshot.Snapshot), the internal NodeInfos always get computed on-the-fly, wrapping both parts:

func (s *PredicateSnapshot) GetNodeInfo(nodeName string) (*framework.NodeInfo, error) {

This results in allocating memory for the internal NodeInfo every time it's accessed from ClusterSnapshot. After we added parallelism to binpacking in #8729, it turned out after benchmarking that these memory allocations are now a performance bottleneck.

The scheduler-framework NodeInfo was changed to an interface in K8s 1.34, so we can now return the internal NodeInfos directly to scheduler code - as long as they implement the interface. Since the internal NodeInfo wraps the scheduler one, most of the interface methods can just be passed through to the wrapped scheduler struct.

Proposed solution:

High-level overview:

  • Make internal NodeInfo implement the scheduler-framework NodeInfo interface, so that it can be directly returned and used by the scheduler framework.
  • Change the "storage version" in ClusterSnapshotStore from scheduler NodeInfo/PodInfo to CA-internal NodeInfo/PodInfo. Have ClusterSnapshotStore directly provide the CA-internal objects to both CA and scheduler-framework without any wrapping/memory allocations.
  • Modify ClusterSnapshotStore interface and implementations so that any time the internal NodeInfo/PodInfo objects are allocated during its methods, they have the CA-specific fields correctly set using DRA snapshot before storing.
  • Modify ClusterSnapshot interface and its PredicateSnapshot implementation to delegate NodeInfo getting/listing to the embedded ClusterSnapshotStore.

Detailed plan:

  • Implement the scheduler-framework NodeInfo interface for the CA-internal NodeInfo struct, passing most methods through to the embedded scheduler-framework NodeInfo implementation.
    • Handling PodInfo will be tricky - the internal NodeInfo.Pods() should return a list of internal PodInfo objects without allocating them on the fly. So we can't just pass this method through to the scheduler NodeInfo.Pods(). We have to actually embed the scheduler PodInfo in the internal PodInfo and implement the PodInfo interface, keep a list of internal PodInfos in the internal NodeInfo, and return that from the internal NodeInfo.Pods().
  • Modify ClusterSnapshotStore interface and both implementations (BasicSnapshotStore and DeltaSnapshotStore):
    • Store the internal NodeInfos instead of the scheduler version in the implementations. The ClusterSnapshotStore.NodeInfos() method exposing the NodeInfos to scheduler code should keep working as-is, because the internal NodeInfo implements the correct interface.
    • AddSchedulerNodeInfo and RemoveSchedulerNodeInfo should be changed to operate on the internal NodeInfo, and renamed to something like ForceAddNodeInfo/ForceRemoveNodeInfo. The internal NodeInfo provided to ForceAddNodeInfo() should already have all the CA-specific parts filled in, so it can just be stored as-is.
    • Move the logic that allocates internal NodeInfo objects including all the CA-specific parts from on-the-fly in PredicateSnapshot.GetNodeInfo()/ListNodeInfos(), to once-per-loop in ClusterSnapshotStore.SetClusterState().
    • Change ForceAddPod() to allocate an internal PodInfo and include all the DRA-related information (taken from the DRA snapshot) before storing it in a NodeInfo.
  • Modify ClusterSnapshot interface and implementation (PredicateSnapshot):
    • Move GetNodeInfo() and ListNodeInfos() methods from ClusterSnapshot to ClusterSnapshotStore. Implement the new methods in both ClusterSnapshotStore implementations by just getting/listing the stored internal NodeInfos without any wrapping.
    • Change AddNodeInfo() to call the new ClusterSnapshotStore.ForceAddNodeInfo() and pass the whole internal NodeInfo to it.
    • Change RemoveNodeInfo to call the new ClusterSnapshotStore.ForceRemoveNodeInfo().

Mental model changes

Currently, a good mental model for Cluster Snapshot and DRA is something like this:

  • ClusterSnapshotStore is the single storage source-of-truth for Nodes and Pods.
  • ClusterSnapshotStore.DraSnapshot() is the single storage source-of-truth for DRA objects.
  • ClusterSnapshot.GetNodeInfo()/ListNodeInfos() correlates Nodes and Pods the DRA objects, and returns the CA-internal NodeInfo objects combining both of them. The result is "ephemeral": read-only, only valid until the next ClusterSnapshot mutation, shouldn't be stored.
  • Any time ClusterSnapshot is mutated, ClusterSnapshotStore and the embedded ClusterSnapshotStore.DraSnapshot() are mutated independently. The previous results of GetNodeInfo()/ListNodeInfos() are now invalid, so we don't care about "synchronizing" them with the mutation. For each tracked object, we only change the single source of truth it's stored in.

The proposed changes make the mental model more complex:

  • ClusterSnapshotStore.DraSnapshot() still stores the DRA objects and is still the source-of-truth conceptually, but it's not the only place where the DRA objects are stored.
  • ClusterSnapshotStore stores fully-formed CA-internal NodeInfo objects - so Nodes, Pods, and DRA objects. It's still the single storage source-of-truth for Nodes and Pods.
  • DRA objects are stored in 2 places now - ClusterSnapshotStore.DraSnapshot(), and the CA-internal NodeInfos tracked in ClusterSnapshotStore. We can't get away from having a separate DRA snapshot because there are objects to track that aren't bound to a particular Node or Pod. We also can't get away from having the DRA objects in the internal NodeInfos stored by ClusterSnapshotStore - either they're persisted there, or we're back to allocating new NodeInfo objects on the fly.
  • We still don't care about synchronizing mutations to the ephemeral results of GetNodeInfo()/ListNodeInfos(), but we now need to synchronize any mutations between ClusterSnapshotStore.DraSnapshot() and the CA-internal NodeInfos tracked in ClusterSnapshotStore. In return, any time we get a NodeInfo from ClusterSnapshotStore, we can just return it with no extra logic.

I couldn't come up with a solution that doesn't make the mental model more complex at all, and at least in the proposed solution the new complexity is limited to ClusterSnapshot and ClusterSnapshotStore implementations. The expectations for the rest of the code base shouldn't change.

Additional context:

  • Add csinode limit awareness in cluster-autoscaler #8721 adds more fields to the CA-internal NodeInfo, and a new CSISnapshot for filling them to ClusterSnapshotStore. It'll probably land before this issue is fixed, so the parts that mention "DRA" in the implementation plan above will become "DRA and CSINode". Both work the same way, so the same solution should be fine.

Metadata

Metadata

Assignees

Labels

area/cluster-autoscalerarea/core-autoscalerDenotes an issue that is related to the core autoscaler and is not specific to any provider.kind/featureCategorizes issue or PR as related to a new feature.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions