prokube · tmvfb · Mar 23, 2026 · Feb 16, 2026 · Feb 16, 2026 · Feb 16, 2026
diff --git a/serving/kserve-keda-autoscaling/README.md b/serving/kserve-keda-autoscaling/README.md
@@ -0,0 +1,79 @@
+# KServe Autoscaling with KEDA and Custom Prometheus Metrics
+
+This example demonstrates autoscaling a KServe InferenceService using
+[KEDA](https://keda.sh/) with custom Prometheus metrics from vLLM.
+It scales based on total token throughput rather than simple request count,
+which is better suited for LLM inference workloads.
+
+For full documentation, see the
+[prokube autoscaling docs](https://docs.prokube.cloud/user_docs/model_serving_autoscaling/#keda-kubernetes-event-driven-autoscaling).
+
+## Why Token Throughput?
+
+LLM requests vary wildly in duration depending on prompt and output length.
+Request-count metrics (concurrency, QPS) don't reflect actual GPU load.
+Token throughput stays elevated as long as the model is under pressure,
+making it a stable scaling signal.
+
+## Prerequisites
+
+- KEDA installed in the cluster (`helm install keda kedacore/keda -n keda --create-namespace`)
+- Prometheus scraping vLLM metrics (prokube clusters include a cluster-wide PodMonitor)
+
+## Files
+
+| File | Description |
+|------|-------------|
+| `inference-service.yaml` | KServe InferenceService (OPT-125M, RawDeployment mode) |
+| `scaled-object.yaml` | KEDA ScaledObject — scales on token throughput |
+
+## Quick Start
+
+```bash
+export NAMESPACE="default"
+
+# 1. Deploy the InferenceService
+kubectl apply -n $NAMESPACE -f inference-service.yaml
+
+# 2. Wait for it to become ready
+kubectl get isvc opt-125m -n $NAMESPACE -w
+
+# 3. Deploy the KEDA ScaledObject
+kubectl apply -n $NAMESPACE -f scaled-object.yaml
+
+# 4. Verify
+kubectl get scaledobject -n $NAMESPACE
+kubectl get hpa -n $NAMESPACE
+```
+
+## Customization
+
+**Namespace and model name**: replace `default` and `opt-125m` in the
+Prometheus queries inside `scaled-object.yaml`.
+
+**Threshold**: the `threshold: "5"` value means "scale up when each replica
+handles more than 5 tokens/second on average" (`AverageValue` divides the
+query result by replica count). Tune this based on load testing for your
+model and hardware.
+
+**GPU deployments**: remove `--dtype=float32` and `--max-model-len=512`
+from the InferenceService args, add GPU resource requests, and consider
+adding a second trigger for GPU KV-cache utilization:
+
+```yaml
+# Add to scaled-object.yaml triggers list
+- type: prometheus
+  metadata:
+    serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus
+    query: >-
+      avg(vllm:gpu_cache_usage_perc{namespace="my-namespace",model_name="my-model"})
+    metricType: AverageValue
+    threshold: "0.75"
+```
+
+## References
+
+- [prokube autoscaling documentation](https://docs.prokube.cloud/user_docs/model_serving_autoscaling/)
+- [KServe KEDA autoscaler docs](https://kserve.github.io/website/docs/model-serving/predictive-inference/autoscaling/keda-autoscaler)
+- [KEDA Prometheus scaler](https://keda.sh/docs/scalers/prometheus/)
+- [vLLM metrics reference](https://docs.vllm.ai/en/latest/serving/metrics.html)
diff --git a/serving/kserve-keda-autoscaling/inference-service.yaml b/serving/kserve-keda-autoscaling/inference-service.yaml
@@ -0,0 +1,40 @@
+# KServe InferenceService for OPT-125M with vLLM backend.
+# Uses RawDeployment mode — required when scaling with KEDA.
+#
+# This example runs on CPU. For GPU, remove --dtype=float32 and
+# --max-model-len, and adjust resources to request nvidia.com/gpu.
+apiVersion: serving.kserve.io/v1beta1
+kind: InferenceService
+metadata:
+  name: opt-125m
+  annotations:
+    # RawDeployment mode — creates a plain Deployment instead of a Knative Revision.
+    serving.kserve.io/deploymentMode: "RawDeployment"
+    # Tell KServe not to create its own HPA (KEDA will manage scaling).
+    serving.kserve.io/autoscalerClass: "external"
+spec:
+  predictor:
+    minReplicas: 1
+    maxReplicas: 3
+    model:
+      modelFormat:
+        name: huggingface
+      args:
+        - --model_name=opt-125m
+        - --model_id=facebook/opt-125m
+        - --backend=vllm
+        - --dtype=float32
+        - --max-model-len=512
+      # Explicit port declaration is required in RawDeployment mode
+      # for the cluster-wide PodMonitor to discover the metrics endpoint.
+      ports:
+        - name: user-port
+          containerPort: 8080
+          protocol: TCP
+      resources:
+        requests:
+          cpu: "2"
+          memory: 4Gi
+        limits:
+          cpu: "4"
+          memory: 8Gi
diff --git a/serving/kserve-keda-autoscaling/scaled-object.yaml b/serving/kserve-keda-autoscaling/scaled-object.yaml
@@ -0,0 +1,44 @@
+# KEDA ScaledObject for KServe InferenceService with vLLM backend.
+# Scales based on total token throughput (prompt + generation) from Prometheus.
+#
+# Prerequisites:
+# - KEDA installed (https://keda.sh/docs/deploy/)
+# - Prometheus scraping vLLM metrics (prokube includes a cluster-wide PodMonitor)
+#
+# Before deploying, replace:
+# - "default" in the Prometheus queries with your namespace
+# - "opt-125m" in model_name with your --model_name value
+# - The serverAddress if your Prometheus uses a different URL
+#
+apiVersion: keda.sh/v1alpha1
+kind: ScaledObject
+metadata:
+  name: opt-125m-scaledobject
+spec:
+  scaleTargetRef:
+    # In RawDeployment mode KServe names the Deployment {isvc-name}-predictor.
+    name: opt-125m-predictor
+  minReplicaCount: 1
+  maxReplicaCount: 3
+  pollingInterval: 15          # how often KEDA checks the metric (seconds)
+  cooldownPeriod: 120          # seconds after last trigger activation before scaling to minReplicaCount
+  advanced:
+    horizontalPodAutoscalerConfig:
+      behavior:
+        scaleUp:
+          stabilizationWindowSeconds: 0
+        scaleDown:
+          stabilizationWindowSeconds: 120
+          policies:
+            - type: Pods
+              value: 1          # remove at most 1 replica per minute
+              periodSeconds: 60
+  triggers:
+    - type: prometheus
+      metadata:
+        serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus
+        query: >-
+          sum(rate(vllm:prompt_tokens_total{namespace="default",model_name="opt-125m"}[2m]))
+          + sum(rate(vllm:generation_tokens_total{namespace="default",model_name="opt-125m"}[2m]))
+        metricType: AverageValue
+        threshold: "5"