Skip to content

Conversation

@LukeAVanDrie
Copy link
Contributor

What type of PR is this?
/kind feature

What this PR does / why we need it:
This PR introduces the core logic for the Concurrency Saturation Detector, a new component designed to provide real-time saturation signals and traffic shaping for the Flow Control layer.

The Problem: Limitations of Heuristic Thresholds
The current Saturation Detector relies on heuristic thresholds (SD_QUEUE_DEPTH_THRESHOLD, SD_KV_CACHE_UTIL_THRESHOLD). These require careful tuning to achieve the desired average dispatch rate and maintain a healthy buffer. The optimal values can vary based on hardware, model characteristics, and traffic patterns.

Specific challenges include:

  • Tuning Complexity: Finding the right balance can be iterative.
  • Potential Oscillations: Simple thresholds lead to oscillations in the dispatch rate, where the system rapidly switches between thinking it is saturated and not saturated.
  • Model/Hardware Dependence: The current thresholds are not inherently aware of the specific capabilities of the models or hardware.
  • Indirection: These heuristics are a rough proxy for the desired average dispatch rate that we are ultimately trying to control.

The Solution: concurrencydetector
This component tracks request lifecycles (PreRequest / ResponseComplete) directly within the EPP to maintain an atomic, zero-latency view of in-flight requests. By normalizing capacity into a single control variable (MaxConcurrency), we remove the indirection of proxy metrics.

Key Features:

  • Real-Time Circuit Breaker (IsSaturated): Signals the Flow Controller to stop admitting requests when the aggregate pool is full.
  • Traffic Shaping (Filter): Implements the Scheduler Filter interface. Unlike the legacy implementation, this explicitly removes overloaded pods from the scheduling view. This solves the "hot spot" issue where the scheduler might continuously route to a saturated pod (leaving others empty) because the saturation detector only looked for "at least one" available pod.
  • Configurable Headroom: Introduces a Headroom parameter (default 0.0), allowing the scheduler to burst slightly above the saturation limit to satisfy affinity rules without violating hard safety constraints.

Implementation Notes & Scope:

  • Plugin Status: This package implements the standard requestcontrol (PreRequest, ResponseComplete) and scheduling (Filter) extension points. However, SaturationDetector itself is not yet a dynamic top-level extension point.
  • Deferred Work: This PR contains only the package implementation and unit tests. Configuration loading, command-line enablement, and wiring into the Director/Scheduler are deferred to the next PR.

Which issue(s) this PR fixes:
Relevant to #1405 and #1793

Does this PR introduce a user-facing change?:

[Experimental] Added the core logic for a new Concurrency Saturation Detector. This component is designed to replace metric-based polling with real-time concurrency tracking, offering easier configuration and explicit traffic shaping. (Note: Not yet enabled by default).

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 5, 2026
@netlify
Copy link

netlify bot commented Jan 5, 2026

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 3f641ac
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/695c44523400ef000858f467
😎 Deploy Preview https://deploy-preview-2062--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 5, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: LukeAVanDrie
Once this PR has been reviewed and has the lgtm label, please assign kfswain for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Hi @LukeAVanDrie. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 5, 2026
This introduces the `concurrencydetector` package, a new saturation
mechanism based on real-time in-flight request tracking.

Unlike the legacy `utilizationdetector` which relies on polling proxy
metrics (queue depth, KV cache) and suffers from scrape lag and complex
tuning, this detector maintains atomic counters updated via request
lifecycle hooks.

Key features:
- Real-time `IsSaturated` signal to drive Flow Control backpressure.
- `Filter` implementation to prevent scheduling to overloaded pods,
  solving "hot spot" issues where one pod is saturated while others
  idle.
- Configurable `Headroom` to allow controlled bursting for affinity.

Note: Wiring and configuration enablement are deferred to a follow-up
PR.
@LukeAVanDrie LukeAVanDrie force-pushed the feat/concurrency-detector branch from 50be0e9 to 3f641ac Compare January 5, 2026 23:08
// This two-tier approach allows the Flow Controller to manage average pool load, while the Scheduler retains the
// flexibility to burst slightly above ideal targets (the "Headroom") to satisfy affinity or scoring objectives.
//
// # Consistency & Drift Warning
Copy link
Contributor Author

@LukeAVanDrie LukeAVanDrie Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @kfswain

The next PR I am working on will ensure that we have some deferred HandleStreamClosed path in server.go that executes ResponseComplete plugins to ensure symmetry here. I need to check that other plugins do not depend on asymmetry in these failure modes first though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See: #2064

This is technically a behavioral change.


for _, pod := range pods {
podID := pod.GetPod().NamespacedName.String()
if d.tracker.get(podID) <= limit {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With Headroom = 0, the system effectively allows MaxConcurrency + 1 requests on a specific pod as long as the global pool isn't saturated. Let me know if you prefer the stricter inequality here instead.

@kfswain
Copy link
Collaborator

kfswain commented Jan 6, 2026

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 6, 2026
@LukeAVanDrie
Copy link
Contributor Author

The test failure appears to be a flake:

+ kind create cluster --name inference-e2e
/usr/local/bin/kind: line 1: syntax error near unexpected token `<'
/usr/local/bin/kind: line 1: `<html><body><h1>504 Gateway Time-out</h1>'
make: *** [Makefile:157: test-e2e] Error 2

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants