| title | Dynamo Benchmarking |
|---|---|
| subtitle | Benchmark and compare performance across Dynamo deployment configurations |
This guide shows how to benchmark Dynamo deployments using AIPerf, a comprehensive tool for measuring generative AI inference performance. AIPerf provides detailed metrics, real-time dashboards, and automatic visualization — you call it directly against your endpoints.
You can benchmark any combination of:
- DynamoGraphDeployments
- External HTTP endpoints (vLLM, llm-d, AIBrix, etc.)
Client-side runs benchmarks on your local machine via port-forwarding. Server-side runs benchmarks directly within the Kubernetes cluster using internal service URLs.
TLDR: Need high performance/load testing? Server-side. Just quick testing/comparison? Client-side.
- You want to quickly test deployments
- You want immediate access to results on your local machine
- You're comparing external services or deployments (not necessarily just Dynamo deployments)
- You need to run benchmarks from your laptop/workstation
→ Go to Client-Side Benchmarking (Local)
- You have a development environment with kubectl access
- You're doing performance validation with high load/speed requirements
- You're experiencing timeouts or performance issues with client-side benchmarking
- You want optimal network performance (no port-forwarding overhead)
- You're running automated CI/CD pipelines
- You need isolated execution environments
- You want persistent result storage in the cluster
→ Go to Server-Side Benchmarking (In-Cluster)
| Feature | Client-Side | Server-Side |
|---|---|---|
| Location | Your local machine | Kubernetes cluster |
| Network | Port-forwarding required | Direct service DNS |
| Setup | Quick and simple | Requires cluster resources |
| Performance | Limited by local resources, may timeout under high load | Optimal cluster performance, handles high load |
| Isolation | Shared environment | Isolated job execution |
| Results | Local filesystem | Persistent volumes |
| Best for | Light load | High load |
AIPerf is a standalone benchmarking tool available on PyPI. It is pre-installed in Dynamo container images. Key features:
- Measures latency, throughput, TTFT, inter-token latency, and more
- Multiple load modes: concurrency, request-rate, trace replay
- Automatic visualization with
aiperf plot(Pareto curves, time series, GPU telemetry) - Interactive dashboard mode for real-time exploration
- Arrival patterns (Poisson, constant, gamma) for realistic traffic simulation
- Warmup phases, gradual ramping, and multi-URL load balancing
Important: The --model parameter must match the model deployed at the endpoint.
For full documentation, see the AIPerf docs.
Client-side benchmarking runs on your local machine and connects to Kubernetes deployments via port-forwarding.
-
Dynamo container environment - You must be running inside a Dynamo container with AIPerf pre-installed, or install it locally:
pip install aiperf
-
HTTP endpoints - Ensure you have HTTP endpoints available for benchmarking. These can be:
- DynamoGraphDeployments exposed via HTTP endpoints
- External services (vLLM, llm-d, AIBrix, etc.)
- Any HTTP endpoint serving OpenAI-compatible models
Set up your Kubernetes cluster with NVIDIA GPUs and install the Dynamo Kubernetes Platform following the installation guide. Then deploy your DynamoGraphDeployments using the deployment documentation.
Wait for model readiness. Before benchmarking, ensure your deployment has fully loaded the model. Check pod logs or hit the health endpoint (
curl http://localhost:8000/health) — it should return200 OKbefore you proceed.
# Port-forward the frontend service
kubectl port-forward -n <namespace> svc/<frontend-service-name> 8000:8000 > /dev/null 2>&1 &
# Run a single benchmark
aiperf profile \
--model <your-model-name> \
--url http://localhost:8000 \
--endpoint-type chat \
--streaming \
--concurrency 10 \
--request-count 100 \
--synthetic-input-tokens-mean 2000 \
--output-tokens-mean 256This produces results in artifacts/ and prints a summary table to the console:
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p50 ┃ std ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ Time to First Token │ 234.56 │ 189.23 │ 298.45 │ 289.34 │ 267.12 │ 231.12 │ 28.45 │
│ (ms) │ │ │ │ │ │ │ │
│ Request Latency │ 1234.56 │ 987.34 │ 1567.89 │ 1534.23 │ 1456.78 │ 1223.45 │ 156.78 │
│ (ms) │ │ │ │ │ │ │ │
│ Inter Token Latency │ 15.67 │ 12.34 │ 19.45 │ 19.01 │ 18.23 │ 15.45 │ 1.89 │
│ (ms) │ │ │ │ │ │ │ │
│ Request Throughput │ 31.45 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ (requests/sec) │ │ │ │ │ │ │ │
└─────────────────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
Actual numbers will vary based on model size, hardware, batch size, and network conditions. Client-side benchmarks include port-forwarding overhead — use server-side benchmarking for accurate performance measurement.
To stop the port-forward when done: kill %1 (or kill <PID>).
To understand how your deployment behaves across load levels, run a concurrency sweep. Each concurrency level sends enough requests for stable measurements (max(c*3, 10)):
MODEL="<your-model-name>"
URL="http://localhost:8000"
for c in 1 2 5 10 50 100; do
aiperf profile \
--model "$MODEL" \
--url "$URL" \
--endpoint-type chat \
--streaming \
--concurrency $c \
--request-count $(( c * 3 > 10 ? c * 3 : 10 )) \
--synthetic-input-tokens-mean 2000 \
--output-tokens-mean 256 \
--artifact-dir "artifacts/deployment-a/c$c"
doneNote: Adjust concurrency levels to match your deployment's capacity. Very high concurrency on a small deployment (e.g., c250 on a single GPU) will cause server errors. Start with lower values and increase until you find the saturation point.
Teardown deployment A and deploy deployment B with a different configuration. Kill the previous port-forward (kill %1), then repeat:
kubectl port-forward -n <namespace> svc/<frontend-service-b> 8000:8000 > /dev/null 2>&1 &
for c in 1 2 5 10 50 100; do
aiperf profile \
--model "$MODEL" \
--url "$URL" \
--endpoint-type chat \
--streaming \
--concurrency $c \
--request-count $(( c * 3 > 10 ? c * 3 : 10 )) \
--synthetic-input-tokens-mean 2000 \
--output-tokens-mean 256 \
--artifact-dir "artifacts/deployment-b/c$c"
done# Compare all runs — auto-detects multi-run directories
aiperf plot artifacts/deployment-a artifacts/deployment-b
# Or compare all subdirectories under a parent
aiperf plot artifacts/
# Launch interactive dashboard for exploration
aiperf plot artifacts/ --dashboardAIPerf automatically generates plots based on available data:
- TTFT vs Throughput — find the sweet spot between responsiveness and capacity (always generated for multi-run comparisons)
- Pareto Curves — throughput per GPU vs latency and interactivity (only generated when GPU telemetry data is available — add
--gpu-telemetryduring profiling if DCGM is running) - Time series — per-request TTFT, ITL, and latency over time (generated for single-run analysis)
Here is an example Pareto frontier from a concurrency sweep of Qwen3-0.6B on 8x H200 with vLLM, showing the tradeoff between user experience (tokens/sec per user) and resource efficiency (tokens/sec per GPU):
See the AIPerf Visualization Guide for full details on plot customization, experiment classification, and themes.
- Compare DynamoGraphDeployments (e.g., aggregated vs disaggregated configurations)
- Compare different backends (e.g., SGLang vs TensorRT-LLM vs vLLM)
- Compare Dynamo vs other platforms (e.g., Dynamo vs llm-d vs AIBrix)
- Compare different models (e.g., Llama-3-8B vs Llama-3-70B vs Qwen-3-0.6B)
- Compare different hardware configurations (e.g., H100 vs A100 vs H200)
- Compare different parallelization strategies (e.g., different GPU counts or memory configurations)
aiperf profile [OPTIONS]
REQUIRED:
--model MODEL Model name (must match the deployed model)
--url URL Endpoint URL (e.g., http://localhost:8000)
COMMON OPTIONS:
--endpoint-type TYPE Endpoint type: chat, completions, embeddings (default: chat)
--streaming Enable streaming responses
--concurrency N Number of concurrent requests
--request-rate N Target requests per second (alternative to --concurrency)
--request-count N Total number of requests to send
--benchmark-duration N Run for N seconds instead of a fixed request count
--synthetic-input-tokens-mean N Average input sequence length in tokens
--output-tokens-mean N Average output sequence length in tokens
--artifact-dir DIR Output directory for results (default: artifacts/)
--warmup-request-count N Warmup requests before measurement
--ui TYPE UI mode: dashboard, simple, none (default: dashboard)
For the complete CLI reference, see aiperf profile --help or the CLI docs.
To enforce a specific output length, pass ignore_eos and min_tokens via --extra-inputs:
aiperf profile \
--model <model> \
--url http://localhost:8000 \
--endpoint-type chat \
--streaming \
--concurrency 10 \
--output-tokens-mean 256 \
--extra-inputs max_tokens:256 \
--extra-inputs min_tokens:256 \
--extra-inputs ignore_eos:trueEach aiperf profile run produces an artifact directory containing:
profile_export_aiperf.json— Structured metrics (latency, throughput, TTFT, ITL, etc.)profile_export.jsonl— Per-request raw dataprofile_export_aiperf.csv— CSV format metrics
Results are organized by the --artifact-dir you specify. For concurrency sweeps, a common pattern is:
artifacts/
├── deployment-a/
│ ├── c1/
│ │ ├── profile_export_aiperf.json
│ │ └── profile_export.jsonl
│ ├── c10/
│ ├── c50/
│ └── c100/
├── deployment-b/
│ ├── c1/
│ ├── c10/
│ ├── c50/
│ └── c100/
└── plots/ # Generated by aiperf plot
├── ttft_vs_throughput.png
├── pareto_curve_throughput_per_gpu_vs_latency.png # If GPU telemetry available
└── pareto_curve_throughput_per_gpu_vs_interactivity.png # If GPU telemetry available
Server-side benchmarking runs directly within the Kubernetes cluster, eliminating port-forwarding overhead and enabling high-load testing.
- Kubernetes cluster with NVIDIA GPUs and Dynamo namespace setup (see Dynamo Kubernetes Platform docs)
- Storage: PersistentVolumeClaim configured with appropriate permissions (see deploy/utils README)
- Docker image containing AIPerf (Dynamo runtime images include it)
Deploy using the deployment documentation. Ensure it has a frontend service exposed and the model is fully loaded before running benchmarks — check pod logs or verify the health endpoint returns 200 OK.
First, edit benchmarks/incluster/benchmark_job.yaml to match your deployment:
- Model name: Update the
MODELvariable - Service URL: Update the
URLvariable (use<svc_name>.<namespace>.svc.cluster.local:portfor cross-namespace access) - Concurrency levels: Adjust the
for c in ...loop - Docker image: Update the
imagefield if needed
Then deploy:
export NAMESPACE=benchmarking
# Deploy the benchmark job
kubectl apply -f benchmarks/incluster/benchmark_job.yaml -n $NAMESPACE
# Monitor the job
kubectl logs -f job/dynamo-benchmark -n $NAMESPACE# Create access pod (skip if already running)
kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
# Download the results
kubectl cp $NAMESPACE/pvc-access-pod:/data/results ./results
# Cleanup
kubectl delete pod pvc-access-pod -n $NAMESPACEaiperf plot ./resultsWhen referencing services in other namespaces, use full Kubernetes DNS:
# Same namespace
--url http://vllm-agg-frontend:8000
# Different namespace
--url http://vllm-agg-frontend.production.svc.cluster.local:8000# Check job status
kubectl describe job dynamo-benchmark -n $NAMESPACE
# Follow logs
kubectl logs -f job/dynamo-benchmark -n $NAMESPACE
# Check pod status
kubectl get pods -n $NAMESPACE -l job-name=dynamo-benchmark
# Debug failed pod
kubectl describe pod <pod-name> -n $NAMESPACE- Service not found: Ensure your DynamoGraphDeployment frontend service is running
- PVC access: Check that
dynamo-pvcis properly configured and accessible - Image pull issues: Ensure the Docker image is accessible from the cluster
- Resource constraints: Adjust resource limits if the job is being evicted
# Check PVC status
kubectl get pvc dynamo-pvc -n $NAMESPACE
# Verify service exists and has endpoints
kubectl get svc -n $NAMESPACE
kubectl get endpoints <service-name> -n $NAMESPACEFor development and testing purposes, Dynamo provides a mocker backend that simulates LLM inference without requiring actual GPU resources. This is useful for:
- Testing deployments without expensive GPU infrastructure
- Developing and debugging router, planner, or frontend logic
- CI/CD pipelines that need to validate infrastructure without model execution
- Benchmarking framework validation to ensure your setup works before using real backends
The mocker backend mimics the API and behavior of real backends (SGLang, TensorRT-LLM, vLLM) but generates mock responses instead of running actual inference.
See the mocker directory for usage examples and configuration options.
AIPerf has many capabilities beyond basic profiling. Here are some particularly useful for Dynamo benchmarking:
| Feature | Description | Docs |
|---|---|---|
| Trace Replay | Replay production traces for deterministic benchmarking | Trace Replay |
| Arrival Patterns | Poisson, constant, gamma traffic distributions | Arrival Patterns |
| Gradual Ramping | Smooth ramp-up of concurrency and request rate | Ramping |
| Warmup Phase | Eliminate cold-start effects from measurements | Warmup |
| Multi-URL Load Balancing | Distribute requests across multiple endpoints | Multi-URL |
| GPU Telemetry | Collect DCGM metrics during benchmarking | GPU Telemetry |
| Goodput Analysis | SLO-based throughput measurement | Goodput |
| Timeslice Analysis | Per-timeslice performance breakdown | Timeslices |
| Multi-Turn Conversations | Benchmark multi-turn chat workloads | Multi-Turn |
| Experiment Classification | Baseline vs treatment semantic colors in plots | Plotting |
