Management_Benchmark by ashishsuneja · Pull Request #6690 · GoogleCloudPlatform/PerfKitBenchmarker

ashishsuneja · 2026-05-22T16:24:32Z

Implements the GKE Management Plane Operations benchmark across EKS, GKE, and AKS. Three scenarios are covered: Scenario A — 100 pools created, upgraded (N-1→N minor version), and deleted concurrently across 3 AZs; Scenario B — a NodePool Create fired 3 seconds into an ongoing ClusterUpdate to measure control-plane serialization behavior; Scenario C — 100 pools streamed continuously (max 50 at a time) to measure large-scale provisioning throughput.

…neja/PerfKitBenchmarker into Management_Plane_Ashish

…ium, poll 5s

…nt_v2

- Rename k8s_management_benchmark.py to kubernetes_management_benchmark.py - elastic_kubernetes_service.py: add capacityReservationSpecification with preference 'open' to target EC2 capacity reservations in CreateNodePoolAsync — ensures reserved t3.medium instances are used by EKS nodegroups instead of competing for on-demand capacity

google-cla · 2026-05-22T16:24:55Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

mahesh8842 · 2026-05-26T09:41:33Z

+  sorted_lats = sorted(latencies)
+  meta = {'sample_count': str(n)}
+
+  def _Percentile(p):


can we use statistics.quantiles() instead of using custom percentile implementation here?

Hi @mahesh8842 , Thank you for pointing this. Updated code with your suggestion.

mahesh8842 · 2026-05-26T09:44:44Z

+
+  def __init__(self):
+    self._lock = threading.Lock()
+    self.entries: list[tuple[str, float, float, Exception | None]] = []


use dataclass or named_tuple to improve readability
this allows access to fields by name instead of index

@dataclasses.dataclass
class entry:
name: str
init_dur: float
e2e_dur: float
error: Exception | None = None

Hi @mahesh8842 , updated the suggested changes in https://github.com/ashishsuneja/PerfKitBenchmarker/blob/971455276f7730702d7fbd73f89421c26a8356f5/perfkitbenchmarker/linux_benchmarks/kubernetes_management_benchmark.py#L152

mahesh8842 · 2026-05-26T09:46:58Z

-  # /19 is narrowest CIDR range GKE supports
-  return min(cidr_size, 19)
+  # /17 is narrowest CIDR range GKE supports
+  return min(cidr_size, 16)


according to line 58, 17 is the narrowest supported range
but in line 59, 16 is mentioned.

Please check and keep the comment and implementation in sync

@mahesh8842 Updated the value in comment

mahesh8842 · 2026-05-26T09:50:07Z

+      try:
+        yield importlib.import_module(modname)
+      except Exception as e:  # pylint: disable=broad-except
+        logging.warning('Skipping module %s due to import error: %s', modname, e)


Please prefer logging.exception instead of logging.warning

logging.exception would capture the full stack trace

Hi @mahesh8842 , Please ignore , reverted the changes

mahesh8842 · 2026-05-26T10:07:35Z

+  e2e_lat  = total wall time including wait. On kickoff failure both are set
+             to elapsed time at failure point.
+  """
+  init_start = time.time()


prefer time.monotonic() instead of time.time()

at line 613

init_dur = time.monotonic() - init_start

Hi @mahesh8842 , Updated the suggested changes https://github.com/ashishsuneja/PerfKitBenchmarker/blob/971455276f7730702d7fbd73f89421c26a8356f5/perfkitbenchmarker/linux_benchmarks/kubernetes_management_benchmark.py#L615. Thanks

mahesh8842 · 2026-05-26T10:08:48Z

+def _OpSamples(
+    metric_prefix: str,
+    results: list[tuple[str, float, float, Exception | None]],
+    attempted_ops: int = None,


int | None = None

Hi @mahesh8842 , Updated the suggested changes https://github.com/ashishsuneja/PerfKitBenchmarker/blob/971455276f7730702d7fbd73f89421c26a8356f5/perfkitbenchmarker/linux_benchmarks/kubernetes_management_benchmark.py#L672.

mahesh8842 · 2026-05-26T10:13:44Z

+  def add(self, name: str, init_dur: float, e2e_dur: float,
+          err: Exception | None) -> None:
+    with self._lock:
+      self.entries.append((name, init_dur, e2e_dur, err))


You can use the same dataclass here

result = entry(name,init_dur,e2e_dur,err)
with self._lock:
self.entries.append(result)

Hi @mahesh8842 , Updated the changes.

hubatish · 2026-05-26T17:23:34Z

+  )
+  if rc:
+    logging.warning(
+        'Sleep workload deploy returned rc=%d (non-fatal; continuing)', rc)


Generally we should just fail rather than silently continue & potentially have bad results. There are perhaps some errors/warning which are truly ignorable, but that should be the minority. If there are some that are worth ignoring, I'd also expect those to have something like "out, err, code = IssueCommand..

if code: if 'expected but ignorable warning message' in err: log "this is fine" return else: raise

Hi @hubatish , Updated code to fail instead of warning

Great. Expand to pretty much all uses of logging.warning in providers as well.

hubatish · 2026-05-26T17:27:02Z

-      self.event_poller.StartPolling()
+      try:
+        self.event_poller.StartPolling()
+      except Exception as exc:  # pylint: disable=broad-except


Are you encountering this error when using eg run_stage=provision,prepare --run_uri=foo & then later --run_stage=run --run_uri=foo on the same cluster, or just every time when running any benchmark in py 3.14? Either way please separate into a different PR & raise the issue in chat as we will likely need a good solution here.

Hi @hubatish , Reverted the changes. We will review again and raise in separate PR

hubatish · 2026-05-26T17:28:46Z

+    """Initiates node-pool delete; returns opaque op handle. Does NOT wait."""
+    raise NotImplementedError
+
+  def UpdateClusterAsync(self) -> str:


Why do we have sync vs async variants? Is this due to cloud variance? We shouldn't have both - stick to one pattern or the other. If eg all clouds don't support async (you were talking about a GKE bug here), then perhaps just go with everything using a sync variant.

Hi @hubatish , The sync/async split is intentional for concurrency and overlapping scenarios. the benchmark's core scenarios require concurrency (Scenario A fires N creates simultaneously, Scenario B overlaps a create with a cluster update), which isn't possible with blocking calls. The sync variants (CreateNodePool, DeleteNodePool) are kept only for non-timed setup/teardown paths (Prepare, _CleanStartSweep) where blocking is correct.

And UpdateCluster (sync) seems is unused.

Ok - sort of makes sense. However, I took a look at the provider specific implementations & had a hard time telling what exactly was different. What is different about them currently?
/ actual comment/change: the sync & async versions should share most of their code. A simple way to do this might be to always call the async version, but then wait for the result in the sync version. With a nice, sufficiently abstract presentation, I bet you could do this waiting for the async version from the sync in the parent class. Cleanly that might look like:

def UpdateClusterSync(self): op = self.UpdateClusterAsync() self.WaitForOperation(op)

Also consider moving _RunAsync & _TimedAsync to kubernetes_cluster (unsure of this one).

hubatish · 2026-05-26T17:29:48Z

  ):
    if not is_pkg:
-      yield importlib.import_module(modname)
+      try:


yeah you just shouldn't be causing errors in nor editing this file. ie I'm not sure why you need this but you should fix in some other fashion.

Hi @hubatish , Reverted the changes.

hubatish · 2026-05-26T17:59:00Z

    """Adds an additional nodepool with the given name to the cluster."""
    pass

+  def CreateNodePool(


We already have AddNodepool which is used by kubernetes nodepool provisioning. Why do we need this one too? This quite possibly might be the better implementation, but please refactor the old one to use the new one as well. I can send a sample provision node pools benchmark run command for you to test with.

Hi @hubatish, AddNodepool delegate to CreateNodePool so CreateNodePool will be single source of truth for standard clustersc. Please let me know if this is expected.

There are client specific implementations for Azure / EKS. Please remove those in favor of just callin ghte CreateNodepool implementation as done here & give it a test. + maybe update the provision node pool benchmark. It looks like the naming wrapper here should work, but tbh that benchmark is the only caller & using it should work. Also the "BaseNodePoolConfig" used here with only a name is certainly wrong - again look at the code in provision_node_pools_benchmark.py & update it to work with this logic instead.

Here are some sample args to try with:

--cloud=AWS --benchmarks=provision_node_pools --config_override='provision_node_pools.container_cluster.type='"'"'Karpenter'"'"'' --config_override='provision_node_pools.container_cluster.vm_spec.AWS.machine_type='"'"'m6i.xlarge'"'"'' --config_override='provision_node_pools.container_cluster.nodepools.fibpool.vm_spec.AWS.machine_type='"'"'m6i.xlarge'"'"'' --metadata=cloud:AWS --provision_node_pools_init_batch=1 --provision_node_pools_test_batch=2 --zone=us-east-1a --timeout_minutes=236

--cloud=Azure --benchmarks=provision_node_pools --config_override='provision_node_pools.container_cluster.type='"'"'Kubernetes'"'"'' --config_override='provision_node_pools.container_cluster.vm_spec.Azure.machine_type='"'"'Standard_D4s_v3'"'"'' --config_override='provision_node_pools.container_cluster.nodepools.fibpool.vm_spec.Azure.machine_type='"'"'Standard_D4s_v3'"'"'' --metadata=cloud:Azure --provision_node_pools_init_batch=1 --provision_node_pools_test_batch=2 --zone=eastus2-1 --timeout_minutes=236

--cloud=GCP --benchmarks=provision_node_pools --config_override='provision_node_pools.container_cluster.type='"'"'Kubernetes'"'"'' --config_override='provision_node_pools.container_cluster.vm_spec.GCP.machine_type='"'"'c4-standard-4'"'"'' --config_override='provision_node_pools.container_cluster.nodepools.fibpool.vm_spec.GCP.machine_type='"'"'c4-standard-4'"'"'' --metadata=cloud:GCP --project=p3rf-gke --provision_node_pools_init_batch=1 --provision_node_pools_test_batch=2 --zone=europe-west4-a --timeout_minutes=236

cagataygurturk · 2026-05-27T08:07:21Z

+          ],
+          raise_on_failure=False,
+      )
+      self.cluster_version = ver_out.strip() if ver_rc == 0 and ver_out.strip() else '1.34'


Hardcoding 1.34 will quickly age out once 1.34 is EOL. Maybe throw an exception instead of silently setting 1.34?

fixed. Now raises a CreationError if the cluster version can't be determined from describe-cluster instead of silently falling back to '1.34'

cagataygurturk · 2026-05-27T08:19:42Z

+    # Reserve enough capacity per AZ for 100 pools:
+    # ~67 pools per AZ × 2 nodes = 134 instances max per AZ (Scenario A)
+    # Plus default nodegroup (2) + buffer = 80 minimum for 10 pools, 150 for 100 pools
+    concurrent = getattr(FLAGS, 'k8s_mgmt_concurrent_nodepools', 10)


This is in EksCluster._Create, which is the default EKS class. Every benchmark that runs on AWS with a Kubernetes cluster goes through here, not just k8s_management. So all the new behavior (capacity reservations per AZ, launch templates, the forced 3-AZ subnet layout, and the default nodegroup getting pinned to control_plane_zones[0] over in _RenderNodeGroupJson) gets applied to every EKS run.

A few things that make this concretely bad for unrelated benchmarks:

The reservations are hardcoded to t3.medium. Something like kubernetes_nginx uses m6i.xlarge, so its nodegroups won't consume the reservation at all, and the user just gets billed for reserved capacity they can't use, plus the on-demand they actually wanted. Same story for most of the other AWS k8s benchmarks (kubernetes_redis_memtier, kubernetes_mongodb_ycsb, container_netperf, etc.). I count ~14 of them that hit this path.

The getattr(FLAGS, 'k8s_mgmt_concurrent_nodepools', 10) on this line is a tell that this code knows it's in the wrong place. That flag only exists when your benchmark module is loaded; for any other benchmark you silently fall back to 10 and still create 80 reservations × 3 AZs per run.

Pinning the default nodegroup to a single AZ is also a quiet behavior change. Anything that relied on multi-AZ default placement (HA, anti-affinity, latency tests) loses it without warning.

Could you gate this? Either an eks_reserve_capacity_per_az flag in providers/aws/flags.py defaulted off, or move the reservation/launch-template setup into your benchmark's Prepare() and expose a small helper on EksCluster for it. The bar I'd want to hit before this lands: running provision_node_pools or kubernetes_nginx on EKS produces the same AWS resources as on master, with no extra reservations, no extra launch templates, and default nodegroup placement unchanged.

I've addressed this by:

By Adding a new --eks_reserve_capacity_per_az flag in providers/aws/flags.py (default: False)

Gating all capacity reservation creation/cleanup and launch template lookup in _Create, _Delete, CreateNodePoolAsync, and UpgradeNodePoolAsync behind this flag.

When the flag is False (default), the code path is identical to master — no reservations created, no launch templates, no extra API calls, default nodegroup placement unchanged
The kubernetes_management benchmark explicitly passes --eks_reserve_capacity_per_az=true to opt in

…Management_Plane_Combined

…suneja/PerfKitBenchmarker into Management_Plane_Combined

hubatish · 2026-05-27T17:07:37Z

    """Adds an additional nodepool with the given name to the cluster."""
    pass

+  def CreateNodePool(


There are client specific implementations for Azure / EKS. Please remove those in favor of just callin ghte CreateNodepool implementation as done here & give it a test. + maybe update the provision node pool benchmark. It looks like the naming wrapper here should work, but tbh that benchmark is the only caller & using it should work. Also the "BaseNodePoolConfig" used here with only a name is certainly wrong - again look at the code in provision_node_pools_benchmark.py & update it to work with this logic instead.

Here are some sample args to try with:

--cloud=AWS --benchmarks=provision_node_pools --config_override='provision_node_pools.container_cluster.type='"'"'Karpenter'"'"'' --config_override='provision_node_pools.container_cluster.vm_spec.AWS.machine_type='"'"'m6i.xlarge'"'"'' --config_override='provision_node_pools.container_cluster.nodepools.fibpool.vm_spec.AWS.machine_type='"'"'m6i.xlarge'"'"'' --metadata=cloud:AWS --provision_node_pools_init_batch=1 --provision_node_pools_test_batch=2 --zone=us-east-1a --timeout_minutes=236

--cloud=Azure --benchmarks=provision_node_pools --config_override='provision_node_pools.container_cluster.type='"'"'Kubernetes'"'"'' --config_override='provision_node_pools.container_cluster.vm_spec.Azure.machine_type='"'"'Standard_D4s_v3'"'"'' --config_override='provision_node_pools.container_cluster.nodepools.fibpool.vm_spec.Azure.machine_type='"'"'Standard_D4s_v3'"'"'' --metadata=cloud:Azure --provision_node_pools_init_batch=1 --provision_node_pools_test_batch=2 --zone=eastus2-1 --timeout_minutes=236

--cloud=GCP --benchmarks=provision_node_pools --config_override='provision_node_pools.container_cluster.type='"'"'Kubernetes'"'"'' --config_override='provision_node_pools.container_cluster.vm_spec.GCP.machine_type='"'"'c4-standard-4'"'"'' --config_override='provision_node_pools.container_cluster.nodepools.fibpool.vm_spec.GCP.machine_type='"'"'c4-standard-4'"'"'' --metadata=cloud:GCP --project=p3rf-gke --provision_node_pools_init_batch=1 --provision_node_pools_test_batch=2 --zone=europe-west4-a --timeout_minutes=236

hubatish · 2026-05-27T17:12:19Z

+)
+_SCENARIOS = flags.DEFINE_list(
+    'k8s_mgmt_scenarios',
+    ['A', 'B', 'C'],


these should have helpful names & probably use define enum

hubatish · 2026-05-27T17:14:12Z

      result['gce_local_ssd_count'] = self.default_nodepool.max_local_disks
      result['gce_local_ssd_interface'] = self.default_nodepool.ssd_interface
    result['gke_nccl_fast_socket'] = self.enable_nccl_fast_socket
+    if 'nccl' in self.nodepools:


I think these got synced out (sync to head & they're no longer there).

hubatish · 2026-05-27T17:20:36Z

+  )
+  if rc:
+    logging.warning(
+        'Sleep workload deploy returned rc=%d (non-fatal; continuing)', rc)


Great. Expand to pretty much all uses of logging.warning in providers as well.

hubatish · 2026-05-27T17:22:57Z

+# The kubernetes_management benchmark does not use persistent volumes, so
+# EBS CSI setup (OIDC + IAM role + addon install) is unnecessary and adds
+# ~3 minutes to every run. Set to True to skip it and save time.
+# Defined before FLAGS = flags.FLAGS so it is registered at import time


this sounds overly complicated & weird. your code should not care about import order etc.

hubatish · 2026-05-27T17:28:44Z

+      concurrent = getattr(FLAGS, 'k8s_mgmt_concurrent_nodepools', 10)
+      nodes_per_az = max(80, concurrent * 2 + 20)
+      # Fetch cluster CA and endpoint for bootstrap user data
+      import json as _json


all imports should go at top of files

hubatish · 2026-05-27T17:35:40Z

+    """Initiates node-pool delete; returns opaque op handle. Does NOT wait."""
+    raise NotImplementedError
+
+  def UpdateClusterAsync(self) -> str:


Ok - sort of makes sense. However, I took a look at the provider specific implementations & had a hard time telling what exactly was different. What is different about them currently?
/ actual comment/change: the sync & async versions should share most of their code. A simple way to do this might be to always call the async version, but then wait for the result in the sync version. With a nice, sufficiently abstract presentation, I bet you could do this waiting for the async version from the sync in the parent class. Cleanly that might look like:

def UpdateClusterSync(self): op = self.UpdateClusterAsync() self.WaitForOperation(op)

Also consider moving _RunAsync & _TimedAsync to kubernetes_cluster (unsure of this one).

Ashish Suneja and others added 14 commits May 13, 2026 12:48

management_plane_benchmarking

1a81f30

Merge branch 'GoogleCloudPlatform:master' into Management_Plane_Ashish

d6ea3a1

management_plane_benchmarking

43bf72d

Merge branch 'Management_Plane_Ashish' of https://github.com/ashishsu…

3cbc14d

…neja/PerfKitBenchmarker into Management_Plane_Ashish

Merge branch 'GoogleCloudPlatform:master' into Management_Plane_Ashish

42d4a7d

fix: correct BENCHMARK_NAME to k8s_management, add abc import, t3.med…

ba13361

…ium, poll 5s

azure flag issue fixes

6c28163

GCP overlapping issue fixes

2425aac

GCP overlapping issue fixes

169b7f7

GCP logging update

d993325

Merge branch 'GoogleCloudPlatform:master' into Management_Plane_Srika…

f4c5fc1

…nt_v2

fix: 200~3-AZ round-robin nodegroup distribution for EKS~

4966105

merge: combine Ashish and Srikant_v2 branches

83486e5

mahesh8842 reviewed May 26, 2026

View reviewed changes

hubatish reviewed May 26, 2026

View reviewed changes

EKS

8338677

ashishsuneja force-pushed the Management_Plane_Combined branch from bc904ab to 8338677 Compare May 26, 2026 18:14

cagataygurturk reviewed May 27, 2026

View reviewed changes

cagataygurturk suggested changes May 27, 2026

View reviewed changes

DevVegeta force-pushed the Management_Plane_Combined branch from e1f248c to d993325 Compare May 27, 2026 08:59

Merge branch 'GoogleCloudPlatform:master' into Management_Plane_Combined

51cffd3

DevVegeta and others added 8 commits May 27, 2026 15:40

GCP: Upgrade command fixes and Test Cases update

f9e72e5

Removed duplicate file

d658cf9

Merge remote-tracking branch 'origin/Management_Plane_Combined' into …

3828cef

…Management_Plane_Combined

EKS: re-apply AWS fixes on top of merged remote

57ba381

PR comments fixes

5f49bef

Merge branch 'Management_Plane_Combined' of https://github.com/ashish…

de4acae

…suneja/PerfKitBenchmarker into Management_Plane_Combined

Azure Report fixes: scenario running before previous one finished

79fc26d

EKS: gate capacity reservations + launch templates behind flag

73ec550

ashishsuneja force-pushed the Management_Plane_Combined branch from 79fc26d to 73ec550 Compare May 27, 2026 12:32

Ashish Suneja and others added 4 commits May 27, 2026 12:53

EKS: raise exception instead of hardcoding k8s version 1.34

d07c14e

ManagementPlane: Azure Report and Comment fixes

6a7747f

Merge branch 'Management_Plane_Combined' of https://github.com/ashish…

9714552

…suneja/PerfKitBenchmarker into Management_Plane_Combined

ManagementPlane: event poller kubernetes clueter revert

1c12d9c

hubatish reviewed May 27, 2026

View reviewed changes

Conversation

ashishsuneja commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

google-cla Bot commented May 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mahesh8842 May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashishsuneja commented May 22, 2026 •

edited

Loading

mahesh8842 May 26, 2026 •

edited

Loading