Skip to content
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 79 additions & 0 deletions serving/kserve-keda-autoscaling/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# KServe Autoscaling with KEDA and Custom Prometheus Metrics

This example demonstrates autoscaling a KServe InferenceService using
[KEDA](https://keda.sh/) with custom Prometheus metrics from vLLM.
It scales based on total token throughput rather than simple request count,
which is better suited for LLM inference workloads.

For full documentation, see the
[prokube autoscaling docs](https://docs.prokube.cloud/user_docs/model_serving_autoscaling/#keda-kubernetes-event-driven-autoscaling).

## Why Token Throughput?

LLM requests vary wildly in duration depending on prompt and output length.
Request-count metrics (concurrency, QPS) don't reflect actual GPU load.
Token throughput stays elevated as long as the model is under pressure,
making it a stable scaling signal.

## Prerequisites

- KEDA installed in the cluster (`helm install keda kedacore/keda -n keda --create-namespace`)
- Prometheus scraping vLLM metrics (prokube clusters include a cluster-wide PodMonitor)

## Files

| File | Description |
|------|-------------|
| `inference-service.yaml` | KServe InferenceService (OPT-125M, RawDeployment mode) |
| `scaled-object.yaml` | KEDA ScaledObject — scales on token throughput |

## Quick Start

```bash
export NAMESPACE="default"
Comment thread
tmvfb marked this conversation as resolved.
Outdated

# 1. Deploy the InferenceService
kubectl apply -n $NAMESPACE -f inference-service.yaml

# 2. Wait for it to become ready
kubectl get isvc opt-125m -n $NAMESPACE -w

# 3. Deploy the KEDA ScaledObject
kubectl apply -n $NAMESPACE -f scaled-object.yaml

# 4. Verify
kubectl get scaledobject -n $NAMESPACE
kubectl get hpa -n $NAMESPACE
```

## Customization

**Namespace and model name**: replace `default` and `opt-125m` in the
Prometheus queries inside `scaled-object.yaml`.

**Threshold**: the `threshold: "5"` value means "scale up when each replica
handles more than 5 tokens/second on average" (`AverageValue` divides the
query result by replica count). Tune this based on load testing for your
model and hardware.

**GPU deployments**: remove `--dtype=float32` and `--max-model-len=512`
from the InferenceService args, add GPU resource requests, and consider
adding a second trigger for GPU KV-cache utilization:

```yaml
# Add to scaled-object.yaml triggers list
- type: prometheus
metadata:
serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus
query: >-
avg(vllm:gpu_cache_usage_perc{namespace="my-namespace",model_name="my-model"})
metricType: AverageValue
threshold: "0.75"
```

Comment thread
tmvfb marked this conversation as resolved.
## References

- [prokube autoscaling documentation](https://docs.prokube.cloud/user_docs/model_serving_autoscaling/)
- [KServe KEDA autoscaler docs](https://kserve.github.io/website/docs/model-serving/predictive-inference/autoscaling/keda-autoscaler)
- [KEDA Prometheus scaler](https://keda.sh/docs/scalers/prometheus/)
- [vLLM metrics reference](https://docs.vllm.ai/en/latest/serving/metrics.html)
40 changes: 40 additions & 0 deletions serving/kserve-keda-autoscaling/inference-service.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# KServe InferenceService for OPT-125M with vLLM backend.
# Uses RawDeployment mode — required when scaling with KEDA.
#
# This example runs on CPU. For GPU, remove --dtype=float32 and
# --max-model-len, and adjust resources to request nvidia.com/gpu.
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: opt-125m
annotations:
# RawDeployment mode — creates a plain Deployment instead of a Knative Revision.
serving.kserve.io/deploymentMode: "RawDeployment"
# Tell KServe not to create its own HPA (KEDA will manage scaling).
serving.kserve.io/autoscalerClass: "external"
spec:
predictor:
minReplicas: 1
maxReplicas: 3
model:
modelFormat:
name: huggingface
args:
- --model_name=opt-125m
- --model_id=facebook/opt-125m
- --backend=vllm
- --dtype=float32
- --max-model-len=512
# Explicit port declaration is required in RawDeployment mode
# for the cluster-wide PodMonitor to discover the metrics endpoint.
ports:
- name: user-port
containerPort: 8080
protocol: TCP
resources:
requests:
cpu: "2"
memory: 4Gi
limits:
cpu: "4"
memory: 8Gi
Comment thread
tmvfb marked this conversation as resolved.
44 changes: 44 additions & 0 deletions serving/kserve-keda-autoscaling/scaled-object.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# KEDA ScaledObject for KServe InferenceService with vLLM backend.
# Scales based on total token throughput (prompt + generation) from Prometheus.
Comment thread
tmvfb marked this conversation as resolved.
#
# Prerequisites:
# - KEDA installed (https://keda.sh/docs/deploy/)
# - Prometheus scraping vLLM metrics (prokube includes a cluster-wide PodMonitor)
#
# Before deploying, replace:
# - "default" in the Prometheus queries with your namespace
# - "opt-125m" in model_name with your --model_name value
# - The serverAddress if your Prometheus uses a different URL
#
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: opt-125m-scaledobject
spec:
scaleTargetRef:
# In RawDeployment mode KServe names the Deployment {isvc-name}-predictor.
name: opt-125m-predictor
minReplicaCount: 1
maxReplicaCount: 3
pollingInterval: 15 # how often KEDA checks the metric (seconds)
cooldownPeriod: 120 # seconds after last trigger activation before scaling to minReplicaCount
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleUp:
stabilizationWindowSeconds: 0
scaleDown:
stabilizationWindowSeconds: 120
policies:
- type: Pods
value: 1 # remove at most 1 replica per minute
periodSeconds: 60
triggers:
- type: prometheus
metadata:
serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus
query: >-
sum(rate(vllm:prompt_tokens_total{namespace="default",model_name="opt-125m"}[2m]))
+ sum(rate(vllm:generation_tokens_total{namespace="default",model_name="opt-125m"}[2m]))
metricType: AverageValue
Comment thread
tmvfb marked this conversation as resolved.
threshold: "5"
Comment thread
tmvfb marked this conversation as resolved.
Loading