Skip to content

Add KServe KEDA autoscaling example with custom metrics#41

Open
hsteude wants to merge 15 commits intomainfrom
feature/kserve-keda-autoscaling
Open

Add KServe KEDA autoscaling example with custom metrics#41
hsteude wants to merge 15 commits intomainfrom
feature/kserve-keda-autoscaling

Conversation

@hsteude
Copy link
Contributor

@hsteude hsteude commented Feb 16, 2026

Summary

Adds an example for autoscaling KServe InferenceServices using KEDA with custom Prometheus metrics from vLLM. This addresses the need for better LLM autoscaling beyond simple request-based scaling.

Features

  • Time To First Token (TTFT) scaling: Scale based on P95 TTFT latency
  • GPU KV-cache utilization scaling: Scale based on vLLM's cache usage (for GPU deployments)
  • Running requests fallback: Traditional request-based scaling as backup

Files Added

  • serving/kserve-keda-autoscaling/inference-service.yaml - Example InferenceService with OPT-125M model
  • serving/kserve-keda-autoscaling/scaled-object.yaml - KEDA ScaledObject with multiple triggers
  • serving/kserve-keda-autoscaling/service-monitor.yaml - PodMonitor and PrometheusRules for metrics
  • serving/kserve-keda-autoscaling/README.md - Documentation

Testing

Tested on prokube cluster:

  • Deployed InferenceService with vLLM backend
  • Configured PodMonitor for Prometheus scraping
  • Verified KEDA ScaledObject triggers on TTFT metric
  • Successfully observed scale-up from 1 to 5 replicas under load

References

- InferenceService for vLLM-based model serving
- KEDA ScaledObject with multiple scaling strategies (token throughput, GPU, power)
- ServiceMonitor and PrometheusRules for metrics collection
- README with setup instructions and troubleshooting
- Switch from DistilBERT to OPT-125M model with vLLM backend
- Fix Prometheus serverAddress to include /prometheus routePrefix
- Fix metric queries to handle vLLM's colon-namespaced metrics
- Simplify ScaledObject to focus on running/waiting requests
- Update PodMonitor and PrometheusRules for vLLM metrics

Tested on cluster: autoscaling triggers correctly when load increases
- Add Time To First Token (TTFT) P95 as primary scaling metric
- Add GPU KV-cache utilization scaling (for GPU deployments)
- Keep running requests as fallback metric
- Update README to match other examples in repo
- Replace hardcoded namespace with <your-namespace> placeholder
- Fix Prometheus URL to include /prometheus prefix for prokube
- Document vLLM's colon-namespaced metrics (vllm:*)
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds a comprehensive example for autoscaling KServe InferenceServices using KEDA with custom Prometheus metrics from vLLM. The example addresses the limitation of traditional request-based autoscaling for LLM workloads by implementing scaling based on Time To First Token (TTFT), GPU KV-cache utilization, and running request count.

Changes:

  • Adds InferenceService configuration for OPT-125M model with vLLM backend on CPU
  • Implements KEDA ScaledObject with three Prometheus-based autoscaling triggers
  • Configures Prometheus monitoring with PodMonitor and recording rules for vLLM metrics
  • Provides comprehensive documentation covering deployment, testing, and troubleshooting

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 15 comments.

File Description
serving/kserve-keda-autoscaling/inference-service.yaml Defines KServe InferenceService for OPT-125M model with vLLM backend and CPU deployment
serving/kserve-keda-autoscaling/scaled-object.yaml Configures KEDA ScaledObject with three Prometheus-based triggers for autoscaling
serving/kserve-keda-autoscaling/service-monitor.yaml Sets up PodMonitor for metrics collection and PrometheusRule for recording rules
serving/kserve-keda-autoscaling/README.md Provides comprehensive documentation with deployment instructions, examples, and troubleshooting

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Remove unused PrometheusRules (vLLM metrics use colons natively)
- Fix trailing whitespace in scaled-object.yaml
- Clarify that vLLM uses colons in metric names (unusual but correct)
- Add note about minReplicas/maxReplicas when using KEDA
- Add step to find predictor service name before load testing
- Remove prokube-specific reference in troubleshooting
@hsteude
Copy link
Contributor Author

hsteude commented Feb 16, 2026

Response to Copilot Review

Thanks for the review! I've addressed most of the feedback, but want to clarify one point where Copilot's suggestion was incorrect:

vLLM metric naming (colons vs underscores)

Copilot suggested that vLLM uses underscores in metric names (e.g., vllm_num_requests_running) and that the colons are from recording rules. This is incorrect.

vLLM actually uses colons in its raw metric names. I verified this directly from the running pod:

$ kubectl exec -n developer1 $POD -c kserve-container -- curl -s localhost:8080/metrics | grep "^vllm"

vllm:num_requests_running{model_name="facebook/opt-125m"} 0.0
vllm:num_requests_waiting{model_name="facebook/opt-125m"} 0.0
vllm:gpu_cache_usage_perc{model_name="facebook/opt-125m"} 0.0
vllm:time_to_first_token_seconds_bucket{le="0.001",model_name="facebook/opt-125m"} 0.0
...

This is unusual (Prometheus convention is underscores), but it's how vLLM implements it. That's why the queries need the {"__name__"="vllm:..."} syntax.

As a result, I've:

  • Removed the PrometheusRules (they were trying to aggregate non-existent underscore metrics)
  • Kept the colon-based metric names in the queries
  • Added clarifying comments that this is unusual but correct

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Fix KEDA trigger description (evaluates all, uses highest replica count)
- Make Prometheus URL configurable (<your-prometheus-url> placeholder)
- Add pod selector to queries to avoid cross-InferenceService metric aggregation
- Update README with additional configuration steps
@tmvfb tmvfb self-requested a review February 18, 2026 08:41
@tmvfb tmvfb force-pushed the feature/kserve-keda-autoscaling branch from 84b79f0 to b69e04d Compare March 12, 2026 13:22
@tmvfb tmvfb requested a review from Copilot March 12, 2026 13:22
tmvfb
tmvfb previously approved these changes Mar 12, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

AverageValue divides total token throughput by replica count, which means
the per-replica value halves after a scale-up event. With stabilizationWindowSeconds: 0
this could cause flapping near the threshold. Setting it to 30s requires the
metric to stay above threshold for two consecutive polling intervals before
a scale-up is committed, while the existing 120s scaleDown window prevents
premature scale-down.
Copy link
Contributor Author

@hsteude hsteude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Igor, left comments here and there :)

## Quick Start

```bash
export NAMESPACE="default"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From within a notebook, this section would also work without specifying the namesapce, which would make it slightly easier to run. However, I'm not sure if the premetheus query can be adjusted accordingly... (see below)

Copy link
Collaborator

@tmvfb tmvfb Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjusted and fixed for simplicity

query: >-
sum(rate(vllm:prompt_tokens_total{namespace="default",model_name="opt-125m"}[2m]))
+ sum(rate(vllm:generation_tokens_total{namespace="default",model_name="opt-125m"}[2m]))
metricType: AverageValue
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we find a way to do this without specifying the namespace here? Ideally a new user doesn't have to edit the files at all.

Copy link
Collaborator

@tmvfb tmvfb Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjusted and fixed for simplicity. The caveat is that if multiple users are running that example at the same time, we are gonna get problems, so maybe we should recommend adjusting the queries, or think about some other approach

UPD: Added a comment to the readme.

metricType: AverageValue
threshold: "0.75"
```

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what do I do next in order to see this in action? As a user I'd like to know a) how I send requests, b) how I send so many that it actually scales up and c) how i can sea that it actually did scale up :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some instructions

@tmvfb tmvfb force-pushed the feature/kserve-keda-autoscaling branch from 2a5f07c to 81e4b82 Compare March 12, 2026 17:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tmvfb tmvfb force-pushed the feature/kserve-keda-autoscaling branch from 6feebd2 to 35ef20d Compare March 13, 2026 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants