feat: enhance U-shape idle prediction for scale-down scenarios#19562
Open
Fly-Style wants to merge 1 commit into
Open
feat: enhance U-shape idle prediction for scale-down scenarios#19562Fly-Style wants to merge 1 commit into
Fly-Style wants to merge 1 commit into
Conversation
FrankChen021
reviewed
Jun 7, 2026
FrankChen021
left a comment
Member
There was a problem hiding this comment.
I have reviewed the code for correctness, edge cases, concurrency, and integration risks; no issues found.
Reviewed 3 of 3 changed files.
This is an automated review by Codex GPT-5.5
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The cost-based supervisor autoscaler wouldn't scale down a healthy, over-provisioned supervisor - one above the ideal idle ratio with low lag stayed pinned at its current task count.
Root cause. The idle projection was linear:
rawIdle = 1.0 - busyFraction / taskRatio; // taskRatio = proposed / currentThis assumes busy time is fully conserved when work moves onto fewer tasks, so a reasonable consolidation projects negative idle
(e.g. 1 − 0.6/0.5 =−0.2). That clamps to 0 (the worst point of the U-shaped idle cost) and turns an overrun into phantom virtual lag — pinning the task count even at ~0 real lag. In reality, busy grows sublinearly (an observed 2× consolidation raised busy ~1.25×, not 2×).Fix. Redistribute busy sublinearly:
IDLE_SUBLINEARITY_EXPONENT = 0.32 (≈ log₂(1.25))is a tuned constant based on careful testing and theoretical math application.A healthy consolidation now lands near the ideal idle ratio instead of going negative, so the supervisor scales down; the exponent stays > 0, so extreme over-consolidation still diverges and is broken.
Validation (plots under hood)
Details
Optimal task count vs. observed poll-idle ratio, across realistic configs (rate = total cluster throughput, split per-task):Old version stays pinned at 128 until idle ~0.55, while new version consolidates from ~0.32.
Safe under load: new version consolidates earlier on the high-idle side, but at low idle both still jump to max — lag-driven scale-up is unaffected.
The existing version is flat (pinned at max by the phantom overrun); new version consolidates and holds more tasks as lag weight rises.
Release note
Fixed an issue where the cost-based supervisor autoscaler would not scale down an over-provisioned supervisor running above its ideal idle ratio with low lag.