feat: Add concurrency saturation detector #2062

LukeAVanDrie · 2026-01-05T22:54:25Z

What type of PR is this?
/kind feature

What this PR does / why we need it:
This PR introduces the core logic for the Concurrency Saturation Detector, a new component designed to provide real-time saturation signals and traffic shaping for the Flow Control layer.

The Problem: Limitations of Heuristic Thresholds
The current Saturation Detector relies on heuristic thresholds (SD_QUEUE_DEPTH_THRESHOLD, SD_KV_CACHE_UTIL_THRESHOLD). These require careful tuning to achieve the desired average dispatch rate and maintain a healthy buffer. The optimal values can vary based on hardware, model characteristics, and traffic patterns.

Specific challenges include:

Tuning Complexity: Finding the right balance can be iterative.
Potential Oscillations: Simple thresholds lead to oscillations in the dispatch rate, where the system rapidly switches between thinking it is saturated and not saturated.
Model/Hardware Dependence: The current thresholds are not inherently aware of the specific capabilities of the models or hardware.
Indirection: These heuristics are a rough proxy for the desired average dispatch rate that we are ultimately trying to control.

The Solution: concurrencydetector
This component tracks request lifecycles (PreRequest / ResponseComplete) directly within the EPP to maintain an atomic, zero-latency view of in-flight requests. By normalizing capacity into a single control variable (MaxConcurrency), we remove the indirection of proxy metrics.

Key Features:

Real-Time Circuit Breaker (IsSaturated): Signals the Flow Controller to stop admitting requests when the aggregate pool is full.
Traffic Shaping (Filter): Implements the Scheduler Filter interface. Unlike the legacy implementation, this explicitly removes overloaded pods from the scheduling view. This solves the "hot spot" issue where the scheduler might continuously route to a saturated pod (leaving others empty) because the saturation detector only looked for "at least one" available pod.
Configurable Headroom: Introduces a Headroom parameter (default 0.0), allowing the scheduler to burst slightly above the saturation limit to satisfy affinity rules without violating hard safety constraints.

Implementation Notes & Scope:

Plugin Status: This package implements the standard requestcontrol (PreRequest, ResponseComplete) and scheduling (Filter) extension points. However, SaturationDetector itself is not yet a dynamic top-level extension point.
Deferred Work: This PR contains only the package implementation and unit tests. Configuration loading, command-line enablement, and wiring into the Director/Scheduler are deferred to the next PR.

Which issue(s) this PR fixes:
Relevant to #1405 and #1793

Does this PR introduce a user-facing change?:

[Experimental] Added the core logic for a new Concurrency Saturation Detector. This component is designed to replace metric-based polling with real-time concurrency tracking, offering easier configuration and explicit traffic shaping. (Note: Not yet enabled by default).

netlify · 2026-01-05T22:54:31Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`3f641ac`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/695c44523400ef000858f467
😎 Deploy Preview	https://deploy-preview-2062--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2026-01-05T22:54:33Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: LukeAVanDrie
Once this PR has been reviewed and has the lgtm label, please assign kfswain for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2026-01-05T22:54:35Z

Hi @LukeAVanDrie. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

This introduces the `concurrencydetector` package, a new saturation mechanism based on real-time in-flight request tracking. Unlike the legacy `utilizationdetector` which relies on polling proxy metrics (queue depth, KV cache) and suffers from scrape lag and complex tuning, this detector maintains atomic counters updated via request lifecycle hooks. Key features: - Real-time `IsSaturated` signal to drive Flow Control backpressure. - `Filter` implementation to prevent scheduling to overloaded pods, solving "hot spot" issues where one pod is saturated while others idle. - Configurable `Headroom` to allow controlled bursting for affinity. Note: Wiring and configuration enablement are deferred to a follow-up PR.

LukeAVanDrie · 2026-01-05T23:09:50Z

pkg/epp/saturationdetector/framework/plugins/concurrencydetector/detector.go

+// This two-tier approach allows the Flow Controller to manage average pool load, while the Scheduler retains the
+// flexibility to burst slightly above ideal targets (the "Headroom") to satisfy affinity or scoring objectives.
+//
+// # Consistency & Drift Warning


FYI @kfswain

The next PR I am working on will ensure that we have some deferred HandleStreamClosed path in server.go that executes ResponseComplete plugins to ensure symmetry here. I need to check that other plugins do not depend on asymmetry in these failure modes first though.

See: #2064

This is technically a behavioral change.

LukeAVanDrie · 2026-01-05T23:11:59Z

pkg/epp/saturationdetector/framework/plugins/concurrencydetector/detector.go

+
+	for _, pod := range pods {
+		podID := pod.GetPod().NamespacedName.String()
+		if d.tracker.get(podID) <= limit {


With Headroom = 0, the system effectively allows MaxConcurrency + 1 requests on a specific pod as long as the global pool isn't saturated. Let me know if you prefer the stricter inequality here instead.

kfswain · 2026-01-06T00:15:43Z

/ok-to-test

LukeAVanDrie · 2026-01-06T04:58:48Z

The test failure appears to be a flake:

+ kind create cluster --name inference-e2e
/usr/local/bin/kind: line 1: syntax error near unexpected token `<'
/usr/local/bin/kind: line 1: `<html><body><h1>504 Gateway Time-out</h1>'
make: *** [Makefile:157: test-e2e] Error 2

/retest

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 5, 2026

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 5, 2026

k8s-ci-robot requested review from kfswain and nirrozenbaum January 5, 2026 22:54

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 5, 2026

LukeAVanDrie force-pushed the feat/concurrency-detector branch from 50be0e9 to 3f641ac Compare January 5, 2026 23:08

LukeAVanDrie commented Jan 5, 2026

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add concurrency saturation detector #2062

feat: Add concurrency saturation detector #2062

LukeAVanDrie commented Jan 5, 2026

Uh oh!

netlify bot commented Jan 5, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented Jan 5, 2026

Uh oh!

k8s-ci-robot commented Jan 5, 2026

Uh oh!

LukeAVanDrie Jan 5, 2026 •

edited

Loading

Uh oh!

LukeAVanDrie Jan 5, 2026

Uh oh!

LukeAVanDrie Jan 5, 2026

Uh oh!

kfswain commented Jan 6, 2026

Uh oh!

LukeAVanDrie commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Add concurrency saturation detector #2062

Are you sure you want to change the base?

feat: Add concurrency saturation detector #2062

Conversation

LukeAVanDrie commented Jan 5, 2026

Uh oh!

netlify bot commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

k8s-ci-robot commented Jan 5, 2026

Uh oh!

k8s-ci-robot commented Jan 5, 2026

Uh oh!

LukeAVanDrie Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LukeAVanDrie Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

LukeAVanDrie Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

kfswain commented Jan 6, 2026

Uh oh!

LukeAVanDrie commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

netlify bot commented Jan 5, 2026 •

edited

Loading

LukeAVanDrie Jan 5, 2026 •

edited

Loading