-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Description
Which component are you using?
/area cluster-autoscaler
/area core-autoscaler
Problem description:
Before Cluster Autoscaler was integrated with DRA, ClusterSnapshot would directly store the scheduler-framework NodeInfo objects, and CA code would operate on these scheduler-framework objects.
In order to support DRA, CA had to start tracking ResourceSlices and ResourceClaims in addition to just Nodes and Pods. The scheduler NodeInfo type doesn't contain them, and back then it was just a struct, not an interface - so we couldn't directly extend it. Instead, we introduced an internal NodeInfo and PodInfo types - wrapping the scheduler structs, and adding CA-specific fields on top:
| type NodeInfo struct { |
Scheduler NodeInfos are tracked in ClusterSnapshotStore:
| type DeltaSnapshotStore struct { |
ClusterSnapshotStore also contains dynamicresources/snapshot.Snapshot which tracks DRA objects:
| type Snapshot struct { |
Whenever scheduler plugins list (the scheduler version of) NodeInfos or DRA objects through the scheduler framework, they are listed directly from ClusterSnapshotStore and dynamicresources/snapshot.Snapshot with no memory allocations.
On the other hand, Cluster Autoscaler code always lists/gets the CA-specific version of NodeInfo. Since parts of the object live in different places (ClusterSnapshotStore and dynamicresources/snapshot.Snapshot), the internal NodeInfos always get computed on-the-fly, wrapping both parts:
autoscaler/cluster-autoscaler/simulator/clustersnapshot/predicate/predicate_snapshot.go
Line 54 in 91bae6b
| func (s *PredicateSnapshot) GetNodeInfo(nodeName string) (*framework.NodeInfo, error) { |
This results in allocating memory for the internal NodeInfo every time it's accessed from ClusterSnapshot. After we added parallelism to binpacking in #8729, it turned out after benchmarking that these memory allocations are now a performance bottleneck.
The scheduler-framework NodeInfo was changed to an interface in K8s 1.34, so we can now return the internal NodeInfos directly to scheduler code - as long as they implement the interface. Since the internal NodeInfo wraps the scheduler one, most of the interface methods can just be passed through to the wrapped scheduler struct.
Proposed solution:
High-level overview:
- Make internal
NodeInfoimplement the scheduler-frameworkNodeInfointerface, so that it can be directly returned and used by the scheduler framework. - Change the "storage version" in
ClusterSnapshotStorefrom schedulerNodeInfo/PodInfoto CA-internalNodeInfo/PodInfo. HaveClusterSnapshotStoredirectly provide the CA-internal objects to both CA and scheduler-framework without any wrapping/memory allocations. - Modify
ClusterSnapshotStoreinterface and implementations so that any time the internalNodeInfo/PodInfoobjects are allocated during its methods, they have the CA-specific fields correctly set using DRA snapshot before storing. - Modify
ClusterSnapshotinterface and itsPredicateSnapshotimplementation to delegateNodeInfogetting/listing to the embeddedClusterSnapshotStore.
Detailed plan:
- Implement the scheduler-framework
NodeInfointerface for the CA-internal NodeInfo struct, passing most methods through to the embedded scheduler-frameworkNodeInfoimplementation.- Handling
PodInfowill be tricky - the internalNodeInfo.Pods()should return a list of internalPodInfoobjects without allocating them on the fly. So we can't just pass this method through to the schedulerNodeInfo.Pods(). We have to actually embed the schedulerPodInfoin the internalPodInfoand implement thePodInfointerface, keep a list of internalPodInfosin the internalNodeInfo, and return that from the internalNodeInfo.Pods().
- Handling
- Modify
ClusterSnapshotStoreinterface and both implementations (BasicSnapshotStoreandDeltaSnapshotStore):- Store the internal
NodeInfosinstead of the scheduler version in the implementations. TheClusterSnapshotStore.NodeInfos()method exposing theNodeInfosto scheduler code should keep working as-is, because the internalNodeInfoimplements the correct interface. AddSchedulerNodeInfoandRemoveSchedulerNodeInfoshould be changed to operate on the internalNodeInfo, and renamed to something likeForceAddNodeInfo/ForceRemoveNodeInfo. The internalNodeInfoprovided toForceAddNodeInfo()should already have all the CA-specific parts filled in, so it can just be stored as-is.- Move the logic that allocates internal
NodeInfoobjects including all the CA-specific parts from on-the-fly inPredicateSnapshot.GetNodeInfo()/ListNodeInfos(), to once-per-loop inClusterSnapshotStore.SetClusterState(). - Change
ForceAddPod()to allocate an internalPodInfoand include all the DRA-related information (taken from the DRA snapshot) before storing it in aNodeInfo.
- Store the internal
- Modify
ClusterSnapshotinterface and implementation (PredicateSnapshot):- Move
GetNodeInfo()andListNodeInfos()methods fromClusterSnapshottoClusterSnapshotStore. Implement the new methods in bothClusterSnapshotStoreimplementations by just getting/listing the stored internalNodeInfoswithout any wrapping. - Change
AddNodeInfo()to call the newClusterSnapshotStore.ForceAddNodeInfo()and pass the whole internalNodeInfoto it. - Change
RemoveNodeInfoto call the newClusterSnapshotStore.ForceRemoveNodeInfo().
- Move
Mental model changes
Currently, a good mental model for Cluster Snapshot and DRA is something like this:
ClusterSnapshotStoreis the single storage source-of-truth for Nodes and Pods.ClusterSnapshotStore.DraSnapshot()is the single storage source-of-truth for DRA objects.ClusterSnapshot.GetNodeInfo()/ListNodeInfos()correlates Nodes and Pods the DRA objects, and returns the CA-internal NodeInfo objects combining both of them. The result is "ephemeral": read-only, only valid until the nextClusterSnapshotmutation, shouldn't be stored.- Any time
ClusterSnapshotis mutated,ClusterSnapshotStoreand the embeddedClusterSnapshotStore.DraSnapshot()are mutated independently. The previous results ofGetNodeInfo()/ListNodeInfos()are now invalid, so we don't care about "synchronizing" them with the mutation. For each tracked object, we only change the single source of truth it's stored in.
The proposed changes make the mental model more complex:
ClusterSnapshotStore.DraSnapshot()still stores the DRA objects and is still the source-of-truth conceptually, but it's not the only place where the DRA objects are stored.ClusterSnapshotStorestores fully-formed CA-internalNodeInfoobjects - so Nodes, Pods, and DRA objects. It's still the single storage source-of-truth for Nodes and Pods.- DRA objects are stored in 2 places now -
ClusterSnapshotStore.DraSnapshot(), and the CA-internalNodeInfostracked inClusterSnapshotStore. We can't get away from having a separate DRA snapshot because there are objects to track that aren't bound to a particular Node or Pod. We also can't get away from having the DRA objects in the internalNodeInfosstored byClusterSnapshotStore- either they're persisted there, or we're back to allocating newNodeInfoobjects on the fly. - We still don't care about synchronizing mutations to the ephemeral results of
GetNodeInfo()/ListNodeInfos(), but we now need to synchronize any mutations betweenClusterSnapshotStore.DraSnapshot()and the CA-internalNodeInfostracked inClusterSnapshotStore. In return, any time we get a NodeInfo fromClusterSnapshotStore, we can just return it with no extra logic.
I couldn't come up with a solution that doesn't make the mental model more complex at all, and at least in the proposed solution the new complexity is limited to ClusterSnapshot and ClusterSnapshotStore implementations. The expectations for the rest of the code base shouldn't change.
Additional context:
- Add csinode limit awareness in cluster-autoscaler #8721 adds more fields to the CA-internal
NodeInfo, and a newCSISnapshotfor filling them toClusterSnapshotStore. It'll probably land before this issue is fixed, so the parts that mention "DRA" in the implementation plan above will become "DRA and CSINode". Both work the same way, so the same solution should be fine.