-
Notifications
You must be signed in to change notification settings - Fork 219
feat: Add concurrency saturation detector #2062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add concurrency saturation detector #2062
Conversation
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: LukeAVanDrie The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @LukeAVanDrie. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This introduces the `concurrencydetector` package, a new saturation mechanism based on real-time in-flight request tracking. Unlike the legacy `utilizationdetector` which relies on polling proxy metrics (queue depth, KV cache) and suffers from scrape lag and complex tuning, this detector maintains atomic counters updated via request lifecycle hooks. Key features: - Real-time `IsSaturated` signal to drive Flow Control backpressure. - `Filter` implementation to prevent scheduling to overloaded pods, solving "hot spot" issues where one pod is saturated while others idle. - Configurable `Headroom` to allow controlled bursting for affinity. Note: Wiring and configuration enablement are deferred to a follow-up PR.
50be0e9 to
3f641ac
Compare
| // This two-tier approach allows the Flow Controller to manage average pool load, while the Scheduler retains the | ||
| // flexibility to burst slightly above ideal targets (the "Headroom") to satisfy affinity or scoring objectives. | ||
| // | ||
| // # Consistency & Drift Warning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @kfswain
The next PR I am working on will ensure that we have some deferred HandleStreamClosed path in server.go that executes ResponseComplete plugins to ensure symmetry here. I need to check that other plugins do not depend on asymmetry in these failure modes first though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See: #2064
This is technically a behavioral change.
|
|
||
| for _, pod := range pods { | ||
| podID := pod.GetPod().NamespacedName.String() | ||
| if d.tracker.get(podID) <= limit { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With Headroom = 0, the system effectively allows MaxConcurrency + 1 requests on a specific pod as long as the global pool isn't saturated. Let me know if you prefer the stricter inequality here instead.
|
/ok-to-test |
|
The test failure appears to be a flake: /retest |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR introduces the core logic for the Concurrency Saturation Detector, a new component designed to provide real-time saturation signals and traffic shaping for the Flow Control layer.
The Problem: Limitations of Heuristic Thresholds
The current Saturation Detector relies on heuristic thresholds (
SD_QUEUE_DEPTH_THRESHOLD,SD_KV_CACHE_UTIL_THRESHOLD). These require careful tuning to achieve the desired average dispatch rate and maintain a healthy buffer. The optimal values can vary based on hardware, model characteristics, and traffic patterns.Specific challenges include:
The Solution:
concurrencydetectorThis component tracks request lifecycles (
PreRequest/ResponseComplete) directly within the EPP to maintain an atomic, zero-latency view of in-flight requests. By normalizing capacity into a single control variable (MaxConcurrency), we remove the indirection of proxy metrics.Key Features:
IsSaturated): Signals the Flow Controller to stop admitting requests when the aggregate pool is full.Filter): Implements the SchedulerFilterinterface. Unlike the legacy implementation, this explicitly removes overloaded pods from the scheduling view. This solves the "hot spot" issue where the scheduler might continuously route to a saturated pod (leaving others empty) because the saturation detector only looked for "at least one" available pod.Headroomparameter (default 0.0), allowing the scheduler to burst slightly above the saturation limit to satisfy affinity rules without violating hard safety constraints.Implementation Notes & Scope:
requestcontrol(PreRequest,ResponseComplete) andscheduling(Filter) extension points. However,SaturationDetectoritself is not yet a dynamic top-level extension point.Which issue(s) this PR fixes:
Relevant to #1405 and #1793
Does this PR introduce a user-facing change?: