Skip to content

Add AWS Inferentia2/Neuron support to k8s AI inference#6678

Open
kiryl-filatau wants to merge 11 commits into
GoogleCloudPlatform:masterfrom
kiryl-filatau:aws-tpu
Open

Add AWS Inferentia2/Neuron support to k8s AI inference#6678
kiryl-filatau wants to merge 11 commits into
GoogleCloudPlatform:masterfrom
kiryl-filatau:aws-tpu

Conversation

@kiryl-filatau
Copy link
Copy Markdown
Collaborator

@kiryl-filatau kiryl-filatau commented May 19, 2026

Summary

Enables running the kubernetes_ai_inference benchmark on AWS EKS Auto with AWS Inferentia2 (Neuron): PKB provisions the cluster, deploys vLLM via wg-serving, loads weights from S3, and runs the same benchmark flow as GKE + TPU.

Companion PR: kubernetes-sigs/wg-serving #71 (aws-neuron/ overlays). Until it merges: --wg_serving_repo_url=https://github.com/vofish/wg-serving.git, --wg_serving_repo_branch=aws-tpu, catalog_provider=aws-neuron.


Main changes

  • kubernetes_ai_inference on AWS Neuron: EKS Auto (type=Auto), S3-backed weights, Neuron node pool, wg-serving aws-neuron catalog; log/event parsing for Neuron vLLM startup.

  • EKS Auto: Optional Mountpoint S3 CSI addon (--eks_install_s3_csi_addon) and Neuron device plugin (--eks_install_neuron_device_plugin). nodePools: [general-purpose, system] so the wg-serving catalog Job can schedule.

  • S3 weights: --k8s_inference_server_s3_bucket / _region, PV/PVC; model files under /data/models (HF snapshot in S3 before the run).

  • wg-serving inference server: catalog_provider=aws-neuron tells serving-catalog-cli to render vllm/<model>/aws-neuron/ (from kubernetes-sigs/wg-serving #71 ) instead of the GPU aws/ path. PKB adds a Karpenter NodePool for Inf / Trn instance families so the inference pod runs on a Neuron node. HF token may be a plain hf_… value or an s3://… URI.

  • Benchmark: Model-load parser refactor; K8s v1.35 "Container started" events; Neuron log patterns. GetEvents() tolerates empty kubectl output.

  • Tests: wg_serving_inference_server_test, elastic_kubernetes_service_test, kubernetes_ai_inference_benchmark_test.


Models and weights

  • S3: upload HF weights before the run. Bucket and read access (IAM / Pod Identity for the CSI driver) are prerequisites; in test E2E run they were set up manually, not by PKB.
  • Neuron compile: first pod start compiles for the chip even when weights are on S3. Pre-compiled Neuron artifacts on NEURON_COMPILED_ARTIFACTS can skip re-compile (not automated in this PR).
  • Model compatibility: not every HuggingFace checkpoint compiles on Neuron; check architecture and Neuron SDK support before adding a catalog overlay. Example we hit: meta-llama/Llama-3.2-1B (llama32-1b) failed during Neuron compile with NCC_ITEN404 (model uses "llama3" rope_type, Neuron SDK 2.28.0 in the image). Overlay removed from wg-serving kubernetes-sigs/wg-serving #71.
  • Neuron vLLM image from AWS ECR, not vllm/vllm-openai.
  • E2E validated: tinyllama-1b only (see below).
  • Not tested: llama3-8b. Overlay for it exist in kubernetes-sigs/wg-serving #71.

Tested with

tinyllama-1b, weights pre-loaded to S3, us-east-1:

./pkb.py \
  --benchmarks=kubernetes_ai_inference \
  --cloud=AWS \
  --config_override=kubernetes_ai_inference.container_cluster.type='Auto' \
  --config_override=kubernetes_ai_inference.container_cluster.inference_server.model_server='vllm' \
  --config_override=kubernetes_ai_inference.container_cluster.inference_server.model_name='tinyllama-1b' \
  --config_override=kubernetes_ai_inference.container_cluster.inference_server.extra_deployment_args.pvc-name='s3-csi-static-pvc' \
  --config_override=kubernetes_ai_inference.container_cluster.inference_server.catalog_components='1-Inf2,s3' \
  --config_override=kubernetes_ai_inference.container_cluster.inference_server.catalog_provider='aws-neuron' \
  --config_override=kubernetes_ai_inference.container_cluster.inference_server.deployment_timeout=3600 \
  --config_override=kubernetes_ai_inference.container_cluster.inference_server.extra_deployment_args.model-path='vllm_models/tinyllama-1b' \
  --config_override=kubernetes_ai_inference.container_cluster.inference_server.extra_deployment_args.container-image='public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.13.0-neuronx-py312-sdk2.28.0-ubuntu24.04' \
  --k8s_ai_inference_hf_token=<hf_token> \
  --k8s_ai_inference_request_rate=5 \
  --k8s_inference_server_image_repo=<image_repo> \
  --metadata=cloud:AWS \
  --zone=us-east-1a \
  --k8s_inference_server_s3_bucket=<bucket> \
  --k8s_inference_server_s3_region=us-east-1 \
  --eks_install_s3_csi_addon=true \
  --eks_install_neuron_device_plugin=true \
  --wg_serving_repo_url=https://github.com/vofish/wg-serving.git \
  --wg_serving_repo_branch=aws-neuron

@kiryl-filatau kiryl-filatau changed the title Add AWS Inferentia2/Neuron support to k8s AI inference: EKS S3 CSI addon, Neuron device plugin installation, and Neuron-specific vLLM model load time metric parsing Add AWS Inferentia2/Neuron support to k8s AI inference May 19, 2026
storage_service.PrepareService()
# Pick the storage backend by URI scheme so existing gs:// usage is
# untouched while s3:// URIs work for AWS-only runs.
storage_cloud = (
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this part was just for getting the huggingface token? / can we not just always pull the huggingface token from GCS, or does that effect other usages of the storage class somewhere?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hubatish

Yes, this block is only for the HuggingFace token in _InjectDefaultHuggingfaceToken.

If the flag value starts with hf_, we skip object storage and put the token straight into the K8s secret.

If it is a URI, we download the file once with Copy and then create the secret. Model weights are not loaded here (they use S3 CSI / GCS via other flags and the wg-serving catalog).

Before this PR we always used the GCP storage class (for gs:// tokens). On AWS runs the token can be in S3, so I pick AWS vs GCP from the URI scheme. I did not want to always use GCS because that needs GCP credentials even when the benchmark runs only on AWS.

This does not change how the storage class is used elsewhere in PKB. We create a new service instance only inside this function.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, it does not, it just adds support. All right, sounds reasonable. Maybe I should add this to object_storage_service logic.

# parameterised. Only the values differ for Neuron (instance family,
# taint, NodePool name).
kubernetes_commands.ApplyManifest(
'container/kubernetes_ai_inference/aws-gpu-nodepool.yaml.j2',
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a follow up, but in #6687 I started moving us away from "configure the gpu nodepool in one-off benchmarks" & to "configure the gpu nodepool during _Create/_PostCreate phases based off spec values". eg using nodepools & --config_override=kubernetes_scale.container_cluster.nodepools.pool1.vm_spec.AWS.machine_type='g6.2xlarge' in addition to a default nodepool.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, and I saw the direction in #6687.

Right now _ProvisionGPUNodePool() runs from the inference server _Create() and applies aws-gpu-nodepool.yaml.j2. That was already the pattern for AWS GPU in this benchmark; I reused the same template for Neuron (different instance family + aws.amazon.com/neuron taint).

For EKS Auto we only get general-purpose and system from the cluster setup. The wg-serving pod still needs a pool that can schedule on Inf/Trn with the Neuron taint, so I kept this extra NodePool here for now.

I agree it would be cleaner to move this to cluster _Create / _PostCreate and spec nodepools like in #6687. However, for now I propose to leave that as a follow-up.

if 'gcsfuse' in self.spec.catalog_components:
self._ApplyGCSFusePVC()
elif 's3' in self.spec.catalog_components:
self._ApplyS3PVC()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as for Vlodymyr's #6654 (review)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed to change after #6654 is merged

from perfkitbenchmarker.resources.container_service import kubectl
# Imported to register --k8s_inference_server_s3_bucket so flagsaver can set
# it from this test file.
from perfkitbenchmarker.resources.kubernetes import wg_serving_inference_server # pylint: disable=unused-import
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move the flags over to aws/flags; I don't think you need them in wg_serving_inference_server (esp if you also move the ApplyFusePVC function)

Copy link
Copy Markdown
Collaborator Author

@kiryl-filatau kiryl-filatau May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to aws/


def _InstallNeuronDevicePlugin(self):
"""Applies the AWS Neuron Device Plugin DaemonSet to the cluster."""
# Non-empty kwargs force Jinja render (see ReadAndRenderJinja2Template); image
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does forcing a jinja render matter? What happens if you don't?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. In PKB, ReadAndRenderJinja2Template only runs Jinja when kwargs is non-empty. If we call ApplyManifest with no kwargs, the file is applied as raw text and the DaemonSet still has the literal {{ neuron_device_plugin_image | ... }} in the image field, so apply fails.

I pass neuron_device_plugin_image=... to trigger rendering. The value matches the template default for now. I do not think it is worth changing vm_util behavior in this PR.

I updated the comment in code to explain this.

def _InstallS3CsiAddon(self):
"""Installs the S3 CSI Driver and the IAM glue (Role/Policy + PIA)."""
# Local import: the inference-server module owns the bucket flag.
from perfkitbenchmarker.resources.kubernetes import wg_serving_inference_server # pylint: disable=g-import-not-at-top
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see other comments about moving flags here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to aws/

vllm_start_timestamp = _ParseInferenceServerTimeStamp(line)
break
if container_init_timestamp is None:
raise ValueError('Container init timestamp is not found in the logs.')
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You talked about sharing some code. It doesn't look too bad but some ideas:

  1. Maybe have a shared _ParseModelLoadTimeMetrics which does some helper pieces. Rename _ParseModelLoadTimeMetrics below to _ParseRegularModelLoadTimeMetrics or add Shared/Common to the shared one somewhere.
  2. This "if _ is None, raise, otherwise return X-Y" piece seems like it could be shared pretty easily. Have each of the individual functions return Tuple[float | None, float | None, float | None, float | None] & have the Shared/Common one the validation + subtraction & return Tuple[float,float,float,float].
  3. The shared one could also call cluster.GetEvents & result_stdout.splitlines() while the individual ones could take those variables preprocessed.
  4. A fancier refactor would possibly make 4 classes - a base class & 3 implementations. Then eg "FindStartupEvent" could be one function taking events while "find other timestamps" could be a second function taking logs.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hubatish thanks for the suggestions.

I already moved some shared logic out:

  • _GetContainerInitTimestamp: when the inference container started (from K8s events, same for all backends).
  • _ComputeModelLoadDurations: takes 4 times (container start + 3 from logs: model load start, load end, vLLM ready) and returns the 3 numbers we publish as metrics (time until load starts, load duration, time until API server is up).

The three parsers only scan logs for different strings (GPU vs TPU vs Neuron). Then they call _ComputeModelLoadDurations.

I could still do the other ideas (rename default parser, one shared for line in logs loop). Do you think we need that for this PR, or is the current split enough?

About 4 classes: I would prefer not to overcomplicate this. We have 3 backends and the difference is mostly which log substring to match. Classes would mean more code than we need for something that is already clear in one function.

@hubatish
Copy link
Copy Markdown
Collaborator

Looks like the test failure is in tests/providers/gcp/gce_managed_instance_group_test.py . If you can't see a relationship between your code & the failure try repulling/syncing & running through checks again

@kiryl-filatau kiryl-filatau marked this pull request as ready for review May 26, 2026 16:10
storage_service.PrepareService()
# Pick the storage backend by URI scheme so existing gs:// usage is
# untouched while s3:// URIs work for AWS-only runs.
storage_cloud = (
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, it does not, it just adds support. All right, sounds reasonable. Maybe I should add this to object_storage_service logic.

)
application_start_time = GetTimeDifference(
model_load_end_timestamp, vllm_start_timestamp
return _ComputeModelLoadDurations(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ty for making this refactor

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the review

@hubatish hubatish self-assigned this May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants