Skip to content

Management_Benchmark#6690

Open
ashishsuneja wants to merge 28 commits into
GoogleCloudPlatform:masterfrom
ashishsuneja:Management_Plane_Combined
Open

Management_Benchmark#6690
ashishsuneja wants to merge 28 commits into
GoogleCloudPlatform:masterfrom
ashishsuneja:Management_Plane_Combined

Conversation

@ashishsuneja
Copy link
Copy Markdown

@ashishsuneja ashishsuneja commented May 22, 2026

Implements the GKE Management Plane Operations benchmark across EKS, GKE, and AKS. Three scenarios are covered: Scenario A — 100 pools created, upgraded (N-1→N minor version), and deleted concurrently across 3 AZs; Scenario B — a NodePool Create fired 3 seconds into an ongoing ClusterUpdate to measure control-plane serialization behavior; Scenario C — 100 pools streamed continuously (max 50 at a time) to measure large-scale provisioning throughput.

@google-cla
Copy link
Copy Markdown

google-cla Bot commented May 22, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

sorted_lats = sorted(latencies)
meta = {'sample_count': str(n)}

def _Percentile(p):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use statistics.quantiles() instead of using custom percentile implementation here?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mahesh8842 , Thank you for pointing this. Updated code with your suggestion.


def __init__(self):
self._lock = threading.Lock()
self.entries: list[tuple[str, float, float, Exception | None]] = []
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use dataclass or named_tuple to improve readability
this allows access to fields by name instead of index

@dataclasses.dataclass
class entry:
name: str
init_dur: float
e2e_dur: float
error: Exception | None = None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# /19 is narrowest CIDR range GKE supports
return min(cidr_size, 19)
# /17 is narrowest CIDR range GKE supports
return min(cidr_size, 16)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to line 58, 17 is the narrowest supported range
but in line 59, 16 is mentioned.

Please check and keep the comment and implementation in sync

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mahesh8842 Updated the value in comment

Comment thread perfkitbenchmarker/import_util.py Outdated
try:
yield importlib.import_module(modname)
except Exception as e: # pylint: disable=broad-except
logging.warning('Skipping module %s due to import error: %s', modname, e)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please prefer logging.exception instead of logging.warning

logging.exception would capture the full stack trace

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mahesh8842 , Please ignore , reverted the changes

e2e_lat = total wall time including wait. On kickoff failure both are set
to elapsed time at failure point.
"""
init_start = time.time()
Copy link
Copy Markdown

@mahesh8842 mahesh8842 May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer time.monotonic() instead of time.time()

at line 613

init_dur = time.monotonic() - init_start

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def _OpSamples(
metric_prefix: str,
results: list[tuple[str, float, float, Exception | None]],
attempted_ops: int = None,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int | None = None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def add(self, name: str, init_dur: float, e2e_dur: float,
err: Exception | None) -> None:
with self._lock:
self.entries.append((name, init_dur, e2e_dur, err))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the same dataclass here

result = entry(name,init_dur,e2e_dur,err)
with self._lock:
self.entries.append(result)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mahesh8842 , Updated the changes.

)
if rc:
logging.warning(
'Sleep workload deploy returned rc=%d (non-fatal; continuing)', rc)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally we should just fail rather than silently continue & potentially have bad results. There are perhaps some errors/warning which are truly ignorable, but that should be the minority. If there are some that are worth ignoring, I'd also expect those to have something like "out, err, code = IssueCommand..

if code:
  if 'expected but ignorable warning message' in err:
    log "this is fine"
    return
  else:
    raise

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @hubatish , Updated code to fail instead of warning

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Expand to pretty much all uses of logging.warning in providers as well.

self.event_poller.StartPolling()
try:
self.event_poller.StartPolling()
except Exception as exc: # pylint: disable=broad-except
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you encountering this error when using eg run_stage=provision,prepare --run_uri=foo & then later --run_stage=run --run_uri=foo on the same cluster, or just every time when running any benchmark in py 3.14? Either way please separate into a different PR & raise the issue in chat as we will likely need a good solution here.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @hubatish , Reverted the changes. We will review again and raise in separate PR

"""Initiates node-pool delete; returns opaque op handle. Does NOT wait."""
raise NotImplementedError

def UpdateClusterAsync(self) -> str:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have sync vs async variants? Is this due to cloud variance? We shouldn't have both - stick to one pattern or the other. If eg all clouds don't support async (you were talking about a GKE bug here), then perhaps just go with everything using a sync variant.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @hubatish , The sync/async split is intentional for concurrency and overlapping scenarios. the benchmark's core scenarios require concurrency (Scenario A fires N creates simultaneously, Scenario B overlaps a create with a cluster update), which isn't possible with blocking calls. The sync variants (CreateNodePool, DeleteNodePool) are kept only for non-timed setup/teardown paths (Prepare, _CleanStartSweep) where blocking is correct.

And UpdateCluster (sync) seems is unused.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - sort of makes sense. However, I took a look at the provider specific implementations & had a hard time telling what exactly was different. What is different about them currently?
/ actual comment/change: the sync & async versions should share most of their code. A simple way to do this might be to always call the async version, but then wait for the result in the sync version. With a nice, sufficiently abstract presentation, I bet you could do this waiting for the async version from the sync in the parent class. Cleanly that might look like:

def UpdateClusterSync(self):
  op = self.UpdateClusterAsync()
  self.WaitForOperation(op)

Also consider moving _RunAsync & _TimedAsync to kubernetes_cluster (unsure of this one).

Comment thread perfkitbenchmarker/import_util.py Outdated
):
if not is_pkg:
yield importlib.import_module(modname)
try:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah you just shouldn't be causing errors in nor editing this file. ie I'm not sure why you need this but you should fix in some other fashion.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @hubatish , Reverted the changes.

"""Adds an additional nodepool with the given name to the cluster."""
pass

def CreateNodePool(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have AddNodepool which is used by kubernetes nodepool provisioning. Why do we need this one too? This quite possibly might be the better implementation, but please refactor the old one to use the new one as well. I can send a sample provision node pools benchmark run command for you to test with.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @hubatish, AddNodepool delegate to CreateNodePool so CreateNodePool will be single source of truth for standard clustersc. Please let me know if this is expected.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are client specific implementations for Azure / EKS. Please remove those in favor of just callin ghte CreateNodepool implementation as done here & give it a test. + maybe update the provision node pool benchmark. It looks like the naming wrapper here should work, but tbh that benchmark is the only caller & using it should work. Also the "BaseNodePoolConfig" used here with only a name is certainly wrong - again look at the code in provision_node_pools_benchmark.py & update it to work with this logic instead.

Here are some sample args to try with:

 --cloud=AWS --benchmarks=provision_node_pools --config_override='provision_node_pools.container_cluster.type='"'"'Karpenter'"'"'' --config_override='provision_node_pools.container_cluster.vm_spec.AWS.machine_type='"'"'m6i.xlarge'"'"'' --config_override='provision_node_pools.container_cluster.nodepools.fibpool.vm_spec.AWS.machine_type='"'"'m6i.xlarge'"'"'' --metadata=cloud:AWS --provision_node_pools_init_batch=1 --provision_node_pools_test_batch=2 --zone=us-east-1a --timeout_minutes=236
 --cloud=Azure --benchmarks=provision_node_pools --config_override='provision_node_pools.container_cluster.type='"'"'Kubernetes'"'"'' --config_override='provision_node_pools.container_cluster.vm_spec.Azure.machine_type='"'"'Standard_D4s_v3'"'"'' --config_override='provision_node_pools.container_cluster.nodepools.fibpool.vm_spec.Azure.machine_type='"'"'Standard_D4s_v3'"'"'' --metadata=cloud:Azure --provision_node_pools_init_batch=1 --provision_node_pools_test_batch=2 --zone=eastus2-1 --timeout_minutes=236
--cloud=GCP --benchmarks=provision_node_pools --config_override='provision_node_pools.container_cluster.type='"'"'Kubernetes'"'"'' --config_override='provision_node_pools.container_cluster.vm_spec.GCP.machine_type='"'"'c4-standard-4'"'"'' --config_override='provision_node_pools.container_cluster.nodepools.fibpool.vm_spec.GCP.machine_type='"'"'c4-standard-4'"'"'' --metadata=cloud:GCP --project=p3rf-gke --provision_node_pools_init_batch=1 --provision_node_pools_test_batch=2 --zone=europe-west4-a --timeout_minutes=236

@ashishsuneja ashishsuneja force-pushed the Management_Plane_Combined branch from bc904ab to 8338677 Compare May 26, 2026 18:14
],
raise_on_failure=False,
)
self.cluster_version = ver_out.strip() if ver_rc == 0 and ver_out.strip() else '1.34'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoding 1.34 will quickly age out once 1.34 is EOL. Maybe throw an exception instead of silently setting 1.34?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. Now raises a CreationError if the cluster version can't be determined from describe-cluster instead of silently falling back to '1.34'

# Reserve enough capacity per AZ for 100 pools:
# ~67 pools per AZ × 2 nodes = 134 instances max per AZ (Scenario A)
# Plus default nodegroup (2) + buffer = 80 minimum for 10 pools, 150 for 100 pools
concurrent = getattr(FLAGS, 'k8s_mgmt_concurrent_nodepools', 10)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in EksCluster._Create, which is the default EKS class. Every benchmark that runs on AWS with a Kubernetes cluster goes through here, not just k8s_management. So all the new behavior (capacity reservations per AZ, launch templates, the forced 3-AZ subnet layout, and the default nodegroup getting pinned to control_plane_zones[0] over in _RenderNodeGroupJson) gets applied to every EKS run.

A few things that make this concretely bad for unrelated benchmarks:

The reservations are hardcoded to t3.medium. Something like kubernetes_nginx uses m6i.xlarge, so its nodegroups won't consume the reservation at all, and the user just gets billed for reserved capacity they can't use, plus the on-demand they actually wanted. Same story for most of the other AWS k8s benchmarks (kubernetes_redis_memtier, kubernetes_mongodb_ycsb, container_netperf, etc.). I count ~14 of them that hit this path.

The getattr(FLAGS, 'k8s_mgmt_concurrent_nodepools', 10) on this line is a tell that this code knows it's in the wrong place. That flag only exists when your benchmark module is loaded; for any other benchmark you silently fall back to 10 and still create 80 reservations × 3 AZs per run.

Pinning the default nodegroup to a single AZ is also a quiet behavior change. Anything that relied on multi-AZ default placement (HA, anti-affinity, latency tests) loses it without warning.

Could you gate this? Either an eks_reserve_capacity_per_az flag in providers/aws/flags.py defaulted off, or move the reservation/launch-template setup into your benchmark's Prepare() and expose a small helper on EksCluster for it. The bar I'd want to hit before this lands: running provision_node_pools or kubernetes_nginx on EKS produces the same AWS resources as on master, with no extra reservations, no extra launch templates, and default nodegroup placement unchanged.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've addressed this by:

  1. By Adding a new --eks_reserve_capacity_per_az flag in providers/aws/flags.py (default: False)

  2. Gating all capacity reservation creation/cleanup and launch template lookup in _Create, _Delete, CreateNodePoolAsync, and UpgradeNodePoolAsync behind this flag.

  3. When the flag is False (default), the code path is identical to master — no reservations created, no launch templates, no extra API calls, default nodegroup placement unchanged
    The kubernetes_management benchmark explicitly passes --eks_reserve_capacity_per_az=true to opt in

@DevVegeta DevVegeta force-pushed the Management_Plane_Combined branch from e1f248c to d993325 Compare May 27, 2026 08:59
@ashishsuneja ashishsuneja force-pushed the Management_Plane_Combined branch from 79fc26d to 73ec550 Compare May 27, 2026 12:32
"""Adds an additional nodepool with the given name to the cluster."""
pass

def CreateNodePool(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are client specific implementations for Azure / EKS. Please remove those in favor of just callin ghte CreateNodepool implementation as done here & give it a test. + maybe update the provision node pool benchmark. It looks like the naming wrapper here should work, but tbh that benchmark is the only caller & using it should work. Also the "BaseNodePoolConfig" used here with only a name is certainly wrong - again look at the code in provision_node_pools_benchmark.py & update it to work with this logic instead.

Here are some sample args to try with:

 --cloud=AWS --benchmarks=provision_node_pools --config_override='provision_node_pools.container_cluster.type='"'"'Karpenter'"'"'' --config_override='provision_node_pools.container_cluster.vm_spec.AWS.machine_type='"'"'m6i.xlarge'"'"'' --config_override='provision_node_pools.container_cluster.nodepools.fibpool.vm_spec.AWS.machine_type='"'"'m6i.xlarge'"'"'' --metadata=cloud:AWS --provision_node_pools_init_batch=1 --provision_node_pools_test_batch=2 --zone=us-east-1a --timeout_minutes=236
 --cloud=Azure --benchmarks=provision_node_pools --config_override='provision_node_pools.container_cluster.type='"'"'Kubernetes'"'"'' --config_override='provision_node_pools.container_cluster.vm_spec.Azure.machine_type='"'"'Standard_D4s_v3'"'"'' --config_override='provision_node_pools.container_cluster.nodepools.fibpool.vm_spec.Azure.machine_type='"'"'Standard_D4s_v3'"'"'' --metadata=cloud:Azure --provision_node_pools_init_batch=1 --provision_node_pools_test_batch=2 --zone=eastus2-1 --timeout_minutes=236
--cloud=GCP --benchmarks=provision_node_pools --config_override='provision_node_pools.container_cluster.type='"'"'Kubernetes'"'"'' --config_override='provision_node_pools.container_cluster.vm_spec.GCP.machine_type='"'"'c4-standard-4'"'"'' --config_override='provision_node_pools.container_cluster.nodepools.fibpool.vm_spec.GCP.machine_type='"'"'c4-standard-4'"'"'' --metadata=cloud:GCP --project=p3rf-gke --provision_node_pools_init_batch=1 --provision_node_pools_test_batch=2 --zone=europe-west4-a --timeout_minutes=236

)
_SCENARIOS = flags.DEFINE_list(
'k8s_mgmt_scenarios',
['A', 'B', 'C'],
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should have helpful names & probably use define enum

result['gce_local_ssd_count'] = self.default_nodepool.max_local_disks
result['gce_local_ssd_interface'] = self.default_nodepool.ssd_interface
result['gke_nccl_fast_socket'] = self.enable_nccl_fast_socket
if 'nccl' in self.nodepools:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these got synced out (sync to head & they're no longer there).

)
if rc:
logging.warning(
'Sleep workload deploy returned rc=%d (non-fatal; continuing)', rc)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Expand to pretty much all uses of logging.warning in providers as well.

# The kubernetes_management benchmark does not use persistent volumes, so
# EBS CSI setup (OIDC + IAM role + addon install) is unnecessary and adds
# ~3 minutes to every run. Set to True to skip it and save time.
# Defined before FLAGS = flags.FLAGS so it is registered at import time
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sounds overly complicated & weird. your code should not care about import order etc.

concurrent = getattr(FLAGS, 'k8s_mgmt_concurrent_nodepools', 10)
nodes_per_az = max(80, concurrent * 2 + 20)
# Fetch cluster CA and endpoint for bootstrap user data
import json as _json
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all imports should go at top of files

"""Initiates node-pool delete; returns opaque op handle. Does NOT wait."""
raise NotImplementedError

def UpdateClusterAsync(self) -> str:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - sort of makes sense. However, I took a look at the provider specific implementations & had a hard time telling what exactly was different. What is different about them currently?
/ actual comment/change: the sync & async versions should share most of their code. A simple way to do this might be to always call the async version, but then wait for the result in the sync version. With a nice, sufficiently abstract presentation, I bet you could do this waiting for the async version from the sync in the parent class. Cleanly that might look like:

def UpdateClusterSync(self):
  op = self.UpdateClusterAsync()
  self.WaitForOperation(op)

Also consider moving _RunAsync & _TimedAsync to kubernetes_cluster (unsure of this one).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants