Add KServe KEDA autoscaling example with custom metrics by hsteude · Pull Request #41 · prokube/examples

hsteude · 2026-02-16T19:20:27Z

Summary

Adds an example for autoscaling KServe InferenceServices using KEDA with custom Prometheus metrics from vLLM. This addresses the need for better LLM autoscaling beyond simple request-based scaling.

Features

Time To First Token (TTFT) scaling: Scale based on P95 TTFT latency
GPU KV-cache utilization scaling: Scale based on vLLM's cache usage (for GPU deployments)
Running requests fallback: Traditional request-based scaling as backup

Files Added

serving/kserve-keda-autoscaling/inference-service.yaml - Example InferenceService with OPT-125M model
serving/kserve-keda-autoscaling/scaled-object.yaml - KEDA ScaledObject with multiple triggers
serving/kserve-keda-autoscaling/service-monitor.yaml - PodMonitor and PrometheusRules for metrics
serving/kserve-keda-autoscaling/README.md - Documentation

Testing

Tested on prokube cluster:

Deployed InferenceService with vLLM backend
Configured PodMonitor for Prometheus scraping
Verified KEDA ScaledObject triggers on TTFT metric
Successfully observed scale-up from 1 to 5 replicas under load

References

KServe Issue #3561: Native KEDA integration

- InferenceService for vLLM-based model serving - KEDA ScaledObject with multiple scaling strategies (token throughput, GPU, power) - ServiceMonitor and PrometheusRules for metrics collection - README with setup instructions and troubleshooting

- Switch from DistilBERT to OPT-125M model with vLLM backend - Fix Prometheus serverAddress to include /prometheus routePrefix - Fix metric queries to handle vLLM's colon-namespaced metrics - Simplify ScaledObject to focus on running/waiting requests - Update PodMonitor and PrometheusRules for vLLM metrics Tested on cluster: autoscaling triggers correctly when load increases

- Add Time To First Token (TTFT) P95 as primary scaling metric - Add GPU KV-cache utilization scaling (for GPU deployments) - Keep running requests as fallback metric - Update README to match other examples in repo - Replace hardcoded namespace with <your-namespace> placeholder - Fix Prometheus URL to include /prometheus prefix for prokube - Document vLLM's colon-namespaced metrics (vllm:*)

Copilot

Pull request overview

This pull request adds a comprehensive example for autoscaling KServe InferenceServices using KEDA with custom Prometheus metrics from vLLM. The example addresses the limitation of traditional request-based autoscaling for LLM workloads by implementing scaling based on Time To First Token (TTFT), GPU KV-cache utilization, and running request count.

Changes:

Adds InferenceService configuration for OPT-125M model with vLLM backend on CPU
Implements KEDA ScaledObject with three Prometheus-based autoscaling triggers
Configures Prometheus monitoring with PodMonitor and recording rules for vLLM metrics
Provides comprehensive documentation covering deployment, testing, and troubleshooting

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 15 comments.

File	Description
serving/kserve-keda-autoscaling/inference-service.yaml	Defines KServe InferenceService for OPT-125M model with vLLM backend and CPU deployment
serving/kserve-keda-autoscaling/scaled-object.yaml	Configures KEDA ScaledObject with three Prometheus-based triggers for autoscaling
serving/kserve-keda-autoscaling/service-monitor.yaml	Sets up PodMonitor for metrics collection and PrometheusRule for recording rules
serving/kserve-keda-autoscaling/README.md	Provides comprehensive documentation with deployment instructions, examples, and troubleshooting

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

serving/kserve-keda-autoscaling/service-monitor.yaml

serving/kserve-keda-autoscaling/scaled-object.yaml

serving/kserve-keda-autoscaling/README.md

serving/kserve-keda-autoscaling/scaled-object.yaml

serving/kserve-keda-autoscaling/README.md

serving/kserve-keda-autoscaling/scaled-object.yaml

serving/kserve-keda-autoscaling/README.md

serving/kserve-keda-autoscaling/service-monitor.yaml

- Remove unused PrometheusRules (vLLM metrics use colons natively) - Fix trailing whitespace in scaled-object.yaml - Clarify that vLLM uses colons in metric names (unusual but correct) - Add note about minReplicas/maxReplicas when using KEDA - Add step to find predictor service name before load testing - Remove prokube-specific reference in troubleshooting

hsteude · 2026-02-16T19:45:18Z

Response to Copilot Review

Thanks for the review! I've addressed most of the feedback, but want to clarify one point where Copilot's suggestion was incorrect:

vLLM metric naming (colons vs underscores)

Copilot suggested that vLLM uses underscores in metric names (e.g., vllm_num_requests_running) and that the colons are from recording rules. This is incorrect.

vLLM actually uses colons in its raw metric names. I verified this directly from the running pod:

$ kubectl exec -n developer1 $POD -c kserve-container -- curl -s localhost:8080/metrics | grep "^vllm"

vllm:num_requests_running{model_name="facebook/opt-125m"} 0.0
vllm:num_requests_waiting{model_name="facebook/opt-125m"} 0.0
vllm:gpu_cache_usage_perc{model_name="facebook/opt-125m"} 0.0
vllm:time_to_first_token_seconds_bucket{le="0.001",model_name="facebook/opt-125m"} 0.0
...

This is unusual (Prometheus convention is underscores), but it's how vLLM implements it. That's why the queries need the {"__name__"="vllm:..."} syntax.

As a result, I've:

Removed the PrometheusRules (they were trying to aggregate non-existent underscore metrics)
Kept the colon-based metric names in the queries
Added clarifying comments that this is unusual but correct

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

serving/kserve-keda-autoscaling/service-monitor.yaml

serving/kserve-keda-autoscaling/scaled-object.yaml

serving/kserve-keda-autoscaling/inference-service.yaml

serving/kserve-keda-autoscaling/README.md

- Fix KEDA trigger description (evaluates all, uses highest replica count) - Make Prometheus URL configurable (<your-prometheus-url> placeholder) - Add pod selector to queries to avoid cross-InferenceService metric aggregation - Update README with additional configuration steps

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

serving/kserve-keda-autoscaling/scaled-object.yaml

AverageValue divides total token throughput by replica count, which means the per-replica value halves after a scale-up event. With stabilizationWindowSeconds: 0 this could cause flapping near the threshold. Setting it to 30s requires the metric to stay above threshold for two consecutive polling intervals before a scale-up is committed, while the existing 120s scaleDown window prevents premature scale-down.

hsteude

Thanks Igor, left comments here and there :)

hsteude · 2026-03-12T14:45:25Z

serving/kserve-keda-autoscaling/README.md

+## Quick Start
+
+```bash
+export NAMESPACE="default"


From within a notebook, this section would also work without specifying the namesapce, which would make it slightly easier to run. However, I'm not sure if the premetheus query can be adjusted accordingly... (see below)

adjusted and fixed for simplicity

hsteude · 2026-03-12T14:47:00Z

serving/kserve-keda-autoscaling/scaled-object.yaml

+        query: >-
+          sum(rate(vllm:prompt_tokens_total{namespace="default",model_name="opt-125m"}[2m]))
+          + sum(rate(vllm:generation_tokens_total{namespace="default",model_name="opt-125m"}[2m]))
+        metricType: AverageValue


Can we find a way to do this without specifying the namespace here? Ideally a new user doesn't have to edit the files at all.

adjusted and fixed for simplicity. The caveat is that if multiple users are running that example at the same time, we are gonna get problems, so maybe we should recommend adjusting the queries, or think about some other approach

UPD: Added a comment to the readme.

hsteude · 2026-03-12T14:48:23Z

serving/kserve-keda-autoscaling/README.md

+    metricType: AverageValue
+    threshold: "0.75"
+```
+


So what do I do next in order to see this in action? As a user I'd like to know a) how I send requests, b) how I send so many that it actually scales up and c) how i can sea that it actually did scale up :)

added some instructions

…section

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

hsteude added 4 commits February 16, 2026 18:14

Remove prokube-specific Prometheus note from prerequisites

dcfea4f

hsteude requested a review from Copilot February 16, 2026 19:26

Copilot started reviewing on behalf of hsteude February 16, 2026 19:26 View session

Copilot AI reviewed Feb 16, 2026

View reviewed changes

hsteude requested a review from Copilot February 16, 2026 19:48

Copilot started reviewing on behalf of hsteude February 16, 2026 19:49 View session

Copilot AI reviewed Feb 16, 2026

View reviewed changes

tmvfb self-requested a review February 18, 2026 08:41

Update KEDA example with new insights

b69e04d

tmvfb force-pushed the feature/kserve-keda-autoscaling branch from 84b79f0 to b69e04d Compare March 12, 2026 13:22

tmvfb requested a review from Copilot March 12, 2026 13:22

tmvfb previously approved these changes Mar 12, 2026

View reviewed changes

Copilot started reviewing on behalf of tmvfb March 12, 2026 13:22 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

serving/kserve-keda-autoscaling/scaled-object.yaml Show resolved Hide resolved

serving/kserve-keda-autoscaling/scaled-object.yaml Show resolved Hide resolved

tmvfb dismissed their stale review via b3816d7 March 12, 2026 13:31

hsteude commented Mar 12, 2026

View reviewed changes

Address reviewer's feedback

81e4b82

tmvfb force-pushed the feature/kserve-keda-autoscaling branch from 2a5f07c to 81e4b82 Compare March 12, 2026 17:21

tmvfb added 3 commits March 12, 2026 18:26

Better readme and scaling watching instructions

f2e6f96

Fix service URL to use internal cluster address and simplify observe …

75c9d0a

…section

Improve KEDA autoscaling documentation

392a22a

tmvfb requested a review from Copilot March 13, 2026 17:17

Copilot started reviewing on behalf of tmvfb March 13, 2026 17:17 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

Warn about KEDA availability and namespace metric collision

35ef20d

tmvfb force-pushed the feature/kserve-keda-autoscaling branch from 6feebd2 to 35ef20d Compare March 13, 2026 17:35

tmvfb added 2 commits March 19, 2026 15:06

Improve load generation

bd2a6e4

Update dashboards in readme

ae78abb

Conversation

hsteude commented Feb 16, 2026

Summary

Features

Files Added

Testing

References

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hsteude commented Feb 16, 2026

Response to Copilot Review

vLLM metric naming (colons vs underscores)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

hsteude left a comment

Choose a reason for hiding this comment

Uh oh!

hsteude Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

tmvfb Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsteude Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

tmvfb Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsteude Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

tmvfb Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

tmvfb Mar 13, 2026 •

edited

Loading

tmvfb Mar 13, 2026 •

edited

Loading