feat: enhance U-shape idle prediction for scale-down scenarios by Fly-Style · Pull Request #19562 · apache/druid

Fly-Style · 2026-06-05T13:46:00Z

Description

The cost-based supervisor autoscaler wouldn't scale down a healthy, over-provisioned supervisor - one above the ideal idle ratio with low lag stayed pinned at its current task count.

Root cause. The idle projection was linear:

rawIdle = 1.0 - busyFraction / taskRatio; // taskRatio = proposed / current

This assumes busy time is fully conserved when work moves onto fewer tasks, so a reasonable consolidation projects negative idle (e.g. 1 − 0.6/0.5 =−0.2). That clamps to 0 (the worst point of the U-shaped idle cost) and turns an overrun into phantom virtual lag — pinning the task count even at ~0 real lag. In reality, busy grows sublinearly (an observed 2× consolidation raised busy ~1.25×, not 2×).

Fix. Redistribute busy sublinearly:

projectedBusy = busyFraction * (currentTaskCount / proposedTaskCount) ^ IDLE_SUBLINEARITY_EXPONENT;  // 0.32
rawIdle = 1.0 - projectedBusy;

IDLE_SUBLINEARITY_EXPONENT = 0.32 (≈ log₂(1.25)) is a tuned constant based on careful testing and theoretical math application.

A healthy consolidation now lands near the ideal idle ratio instead of going negative, so the supervisor scales down; the exponent stays > 0, so extreme over-consolidation still diverges and is broken.

Validation (plots under hood)

Details

Optimal task count vs. observed poll-idle ratio, across realistic configs (rate = total cluster throughput, split per-task):

Old version stays pinned at 128 until idle ~0.55, while new version consolidates from ~0.32.

Safe under load: new version consolidates earlier on the high-idle side, but at low idle both still jump to max — lag-driven scale-up is unaffected.

The existing version is flat (pinned at max by the phantom overrun); new version consolidates and holds more tasks as lag weight rises.

Release note

Fixed an issue where the cost-based supervisor autoscaler would not scale down an over-provisioned supervisor running above its ideal idle ratio with low lag.

self-reviewed.
added comments explaining the "why".
added/updated unit tests.

FrankChen021

I have reviewed the code for correctness, edge cases, concurrency, and integration risks; no issues found.

Reviewed 3 of 3 changed files.

This is an automated review by Codex GPT-5.5

feat: enhance U-shape idle prediction

8a1405f

Fly-Style self-assigned this Jun 5, 2026

Fly-Style requested a review from kfaraz June 5, 2026 13:46

github-actions Bot added the Area - Ingestion label Jun 5, 2026

FrankChen021 reviewed Jun 7, 2026

View reviewed changes

Fly-Style mentioned this pull request Jun 10, 2026

feat: introduce intermediate valid task counts for big partition counts #19549

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: enhance U-shape idle prediction for scale-down scenarios#19562

feat: enhance U-shape idle prediction for scale-down scenarios#19562
Fly-Style wants to merge 1 commit into
apache:masterfrom
Fly-Style:cba-enhance-ushape

Fly-Style commented Jun 5, 2026 •

edited

Loading

Uh oh!

FrankChen021 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Fly-Style commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FrankChen021 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fly-Style commented Jun 5, 2026 •

edited

Loading