-
Notifications
You must be signed in to change notification settings - Fork 1
Add KServe KEDA autoscaling example with custom metrics #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 7 commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
bcf91f8
Add KServe KEDA autoscaling example with custom Prometheus metrics
hsteude 03fc3cd
Update KEDA autoscaling example to use vLLM with OPT-125M
hsteude 471ba50
Update KEDA autoscaling example with TTFT scaling and documentation
hsteude dcfea4f
Remove prokube-specific Prometheus note from prerequisites
hsteude 93ae429
Address Copilot review feedback
hsteude b1a1034
Address additional Copilot review feedback
hsteude b69e04d
Update KEDA example with new insights
tmvfb b3816d7
Add scaleUp stabilization window to mitigate metric oscillation
tmvfb 81e4b82
Address reviewer's feedback
tmvfb f2e6f96
Better readme and scaling watching instructions
tmvfb 75c9d0a
Fix service URL to use internal cluster address and simplify observe …
tmvfb 392a22a
Improve KEDA autoscaling documentation
tmvfb 35ef20d
Warn about KEDA availability and namespace metric collision
tmvfb bd2a6e4
Improve load generation
tmvfb ae78abb
Update dashboards in readme
tmvfb d0635cd
Add a mermaid diagram to illustrate KEDA
tmvfb 5fec793
Prettify dashboard name and readme nitpick
tmvfb File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,79 @@ | ||
| # KServe Autoscaling with KEDA and Custom Prometheus Metrics | ||
|
|
||
| This example demonstrates autoscaling a KServe InferenceService using | ||
| [KEDA](https://keda.sh/) with custom Prometheus metrics from vLLM. | ||
| It scales based on total token throughput rather than simple request count, | ||
| which is better suited for LLM inference workloads. | ||
|
|
||
| For full documentation, see the | ||
| [prokube autoscaling docs](https://docs.prokube.cloud/user_docs/model_serving_autoscaling/#keda-kubernetes-event-driven-autoscaling). | ||
|
|
||
| ## Why Token Throughput? | ||
|
|
||
| LLM requests vary wildly in duration depending on prompt and output length. | ||
| Request-count metrics (concurrency, QPS) don't reflect actual GPU load. | ||
| Token throughput stays elevated as long as the model is under pressure, | ||
| making it a stable scaling signal. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - KEDA installed in the cluster (`helm install keda kedacore/keda -n keda --create-namespace`) | ||
| - Prometheus scraping vLLM metrics (prokube clusters include a cluster-wide PodMonitor) | ||
|
|
||
| ## Files | ||
|
|
||
| | File | Description | | ||
| |------|-------------| | ||
| | `inference-service.yaml` | KServe InferenceService (OPT-125M, RawDeployment mode) | | ||
| | `scaled-object.yaml` | KEDA ScaledObject — scales on token throughput | | ||
|
|
||
| ## Quick Start | ||
|
|
||
| ```bash | ||
| export NAMESPACE="default" | ||
|
|
||
| # 1. Deploy the InferenceService | ||
| kubectl apply -n $NAMESPACE -f inference-service.yaml | ||
|
|
||
| # 2. Wait for it to become ready | ||
| kubectl get isvc opt-125m -n $NAMESPACE -w | ||
|
|
||
| # 3. Deploy the KEDA ScaledObject | ||
| kubectl apply -n $NAMESPACE -f scaled-object.yaml | ||
|
|
||
| # 4. Verify | ||
| kubectl get scaledobject -n $NAMESPACE | ||
| kubectl get hpa -n $NAMESPACE | ||
| ``` | ||
|
|
||
| ## Customization | ||
|
|
||
| **Namespace and model name**: replace `default` and `opt-125m` in the | ||
| Prometheus queries inside `scaled-object.yaml`. | ||
|
|
||
| **Threshold**: the `threshold: "5"` value means "scale up when each replica | ||
| handles more than 5 tokens/second on average" (`AverageValue` divides the | ||
| query result by replica count). Tune this based on load testing for your | ||
| model and hardware. | ||
|
|
||
| **GPU deployments**: remove `--dtype=float32` and `--max-model-len=512` | ||
| from the InferenceService args, add GPU resource requests, and consider | ||
| adding a second trigger for GPU KV-cache utilization: | ||
|
|
||
| ```yaml | ||
| # Add to scaled-object.yaml triggers list | ||
| - type: prometheus | ||
| metadata: | ||
| serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus | ||
| query: >- | ||
| avg(vllm:gpu_cache_usage_perc{namespace="my-namespace",model_name="my-model"}) | ||
| metricType: AverageValue | ||
| threshold: "0.75" | ||
| ``` | ||
|
|
||
|
tmvfb marked this conversation as resolved.
|
||
| ## References | ||
|
|
||
| - [prokube autoscaling documentation](https://docs.prokube.cloud/user_docs/model_serving_autoscaling/) | ||
| - [KServe KEDA autoscaler docs](https://kserve.github.io/website/docs/model-serving/predictive-inference/autoscaling/keda-autoscaler) | ||
| - [KEDA Prometheus scaler](https://keda.sh/docs/scalers/prometheus/) | ||
| - [vLLM metrics reference](https://docs.vllm.ai/en/latest/serving/metrics.html) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,40 @@ | ||
| # KServe InferenceService for OPT-125M with vLLM backend. | ||
| # Uses RawDeployment mode — required when scaling with KEDA. | ||
| # | ||
| # This example runs on CPU. For GPU, remove --dtype=float32 and | ||
| # --max-model-len, and adjust resources to request nvidia.com/gpu. | ||
| apiVersion: serving.kserve.io/v1beta1 | ||
| kind: InferenceService | ||
| metadata: | ||
| name: opt-125m | ||
| annotations: | ||
| # RawDeployment mode — creates a plain Deployment instead of a Knative Revision. | ||
| serving.kserve.io/deploymentMode: "RawDeployment" | ||
| # Tell KServe not to create its own HPA (KEDA will manage scaling). | ||
| serving.kserve.io/autoscalerClass: "external" | ||
| spec: | ||
| predictor: | ||
| minReplicas: 1 | ||
| maxReplicas: 3 | ||
| model: | ||
| modelFormat: | ||
| name: huggingface | ||
| args: | ||
| - --model_name=opt-125m | ||
| - --model_id=facebook/opt-125m | ||
| - --backend=vllm | ||
| - --dtype=float32 | ||
| - --max-model-len=512 | ||
| # Explicit port declaration is required in RawDeployment mode | ||
| # for the cluster-wide PodMonitor to discover the metrics endpoint. | ||
| ports: | ||
| - name: user-port | ||
| containerPort: 8080 | ||
| protocol: TCP | ||
| resources: | ||
| requests: | ||
| cpu: "2" | ||
| memory: 4Gi | ||
| limits: | ||
| cpu: "4" | ||
| memory: 8Gi | ||
|
tmvfb marked this conversation as resolved.
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,44 @@ | ||
| # KEDA ScaledObject for KServe InferenceService with vLLM backend. | ||
| # Scales based on total token throughput (prompt + generation) from Prometheus. | ||
|
tmvfb marked this conversation as resolved.
|
||
| # | ||
| # Prerequisites: | ||
| # - KEDA installed (https://keda.sh/docs/deploy/) | ||
| # - Prometheus scraping vLLM metrics (prokube includes a cluster-wide PodMonitor) | ||
| # | ||
| # Before deploying, replace: | ||
| # - "default" in the Prometheus queries with your namespace | ||
| # - "opt-125m" in model_name with your --model_name value | ||
| # - The serverAddress if your Prometheus uses a different URL | ||
| # | ||
| apiVersion: keda.sh/v1alpha1 | ||
| kind: ScaledObject | ||
| metadata: | ||
| name: opt-125m-scaledobject | ||
| spec: | ||
| scaleTargetRef: | ||
| # In RawDeployment mode KServe names the Deployment {isvc-name}-predictor. | ||
| name: opt-125m-predictor | ||
| minReplicaCount: 1 | ||
| maxReplicaCount: 3 | ||
| pollingInterval: 15 # how often KEDA checks the metric (seconds) | ||
| cooldownPeriod: 120 # seconds after last trigger activation before scaling to minReplicaCount | ||
| advanced: | ||
| horizontalPodAutoscalerConfig: | ||
| behavior: | ||
| scaleUp: | ||
| stabilizationWindowSeconds: 0 | ||
| scaleDown: | ||
| stabilizationWindowSeconds: 120 | ||
| policies: | ||
| - type: Pods | ||
| value: 1 # remove at most 1 replica per minute | ||
| periodSeconds: 60 | ||
| triggers: | ||
| - type: prometheus | ||
| metadata: | ||
| serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/prometheus | ||
| query: >- | ||
| sum(rate(vllm:prompt_tokens_total{namespace="default",model_name="opt-125m"}[2m])) | ||
| + sum(rate(vllm:generation_tokens_total{namespace="default",model_name="opt-125m"}[2m])) | ||
| metricType: AverageValue | ||
|
tmvfb marked this conversation as resolved.
|
||
| threshold: "5" | ||
|
tmvfb marked this conversation as resolved.
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.