resourcegroup: replace blind sleep with signal-aware wait in acquireTokens by JmPotato · Pull Request #10252 · tikv/pd

JmPotato · 2026-02-25T12:46:52Z

What problem does this PR solve?

Issue Number: close #10251.

What is changed and how does it work?

Add a close-and-recreate channel on the `Limiter` that is closed on every `Reconfigure()` call.
The retry loop in `acquireTokens` now selects on this channel instead of blind-sleeping,
waking up immediately when new tokens arrive. The timer serves as a fallback to preserve original
behavior when no `Reconfigure` occurs.

Check List

Tests

Unit test
Integration test

Release note

None.

Summary by CodeRabbit

Bug Fixes
- Token acquisition retry now uses an interruptible wait so reconfiguration cancels unnecessary delays, improving responsiveness during limiter changes.
New Features
- Limiter now signals reconfiguration to promptly wake waiting operations.
Tests
- Added coverage for signal-aware wakeups, timer fallback behavior, and waking multiple waiters during reconfiguration.

…okens Add a close-and-recreate channel on the Limiter that is closed on every Reconfigure() call. The retry loop in acquireTokens now selects on this channel instead of blind-sleeping, waking up immediately when new tokens arrive. The timer serves as a fallback to preserve original behavior when no Reconfigure occurs. Signed-off-by: JmPotato <github@ipotato.me>

coderabbitai · 2026-02-25T12:48:03Z

📝 Walkthrough

Walkthrough

Replaces blind sleep in token-acquisition retries with a timer-based wait that listens for limiter reconfiguration signals and context cancellation; adds a reconfiguration channel on the limiter and tests for signal-aware wakeups and timer fallback behavior.

Changes

Cohort / File(s)	Summary
Limiter infra & tests `client/resource_group/controller/limiter.go`, `client/resource_group/controller/limiter_test.go`	Add `reconfiguredCh` to `Limiter`, initialize it in constructors, expose `GetReconfiguredCh()` and close/recreate the channel in `Reconfigure()`; tests validate channel lifecycle and that multiple waiters are woken.
Group controller logic & tests `client/resource_group/controller/group_controller.go`, `client/resource_group/controller/group_controller_test.go`	Replace blind `time.Sleep` in `acquireTokens` retry loop with a timer + `select` over `ctx.Done()`, limiter `reconfiguredCh`, and timer; update wait duration consistently. Add tests for signal-aware wake and timer fallback scenarios.

Sequence Diagram

sequenceDiagram
    participant Client
    participant AcquireTokens as acquireTokens
    participant Limiter
    participant Timer

    Client->>AcquireTokens: request tokens
    AcquireTokens->>Limiter: Reserve(tokens)
    Limiter-->>AcquireTokens: reservation failed

    AcquireTokens->>Limiter: GetReconfiguredCh()
    AcquireTokens->>Timer: start retry timer

    par waiting
        AcquireTokens->>AcquireTokens: select on timer, reconfiguredCh, ctx.Done()
    and reconfigure
        Client->>Limiter: Reconfigure(...) (async)
        Limiter->>Limiter: close & recreate reconfiguredCh
        Limiter-->>AcquireTokens: reconfiguredCh closes (wake)
    end

    AcquireTokens->>Timer: stop timer (if needed)
    AcquireTokens->>Limiter: Reserve(tokens) [retry]
    Limiter-->>AcquireTokens: success or fail
    AcquireTokens-->>Client: return result

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

rleungx
bufferflies

Poem

🐰 I used to nap through blind, fixed sleeps,

But now I wake when reconfigure peeps.

Timers and channels dance in tune—

No more second-long afternoon! 🥕🐇

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: replacing blind sleep with signal-aware wait in acquireTokens, which is clearly reflected in the code changes.
Description check	✅ Passed	The description includes the required issue number, explains changes via commit message, and lists applicable tests. All mandatory sections are present.
Linked Issues check	✅ Passed	The changes directly address issue `#10251` by replacing blind sleep with signal-aware wait using Reconfigure channel, enabling immediate wakeup when tokens refresh instead of continuing blind sleeps.
Out of Scope Changes check	✅ Passed	All changes are directly related to the stated objective: adding reconfiguredCh to Limiter, implementing signal-aware wait in acquireTokens, and adding corresponding unit tests.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/resource_group/controller/group_controller_test.go`:
- Around line 266-268: The test currently performs a bare receive on notifyCh
which can hang; replace the `<-notifyCh` synchronization with a bounded select
that waits for either `<-notifyCh` or a timeout (e.g. time.After(1s)) and fails
the test (t.Fatalf or t.Fatal) on timeout. Update the test that uses notifyCh in
group_controller_test.go accordingly and add the time import if missing so the
test fails fast instead of blocking indefinitely.

In `@client/resource_group/controller/group_controller.go`:
- Around line 520-528: Call counter.limiter.GetReconfiguredCh() before starting
the retry wait and make the wait select context-aware: capture reconfiguredCh
early (reconfiguredCh := counter.limiter.GetReconfiguredCh()), create waitTimer
:= time.NewTimer(gc.mainCfg.WaitRetryInterval), then select on <-reconfiguredCh,
<-waitTimer.C and <-ctx.Done(); on the reconfiguredCh or ctx.Done() branches
stop the timer (and drain it if Stop() returns false) to avoid leaks, and still
add time.Since(waitStart) to *waitDuration; update references to reconfiguredCh,
waitTimer, waitStart, gc.mainCfg.WaitRetryInterval and ctx.Done() accordingly.

In `@client/resource_group/controller/limiter.go`:
- Around line 368-370: The close(lim.reconfiguredCh) can panic when
lim.reconfiguredCh is nil (zero-value Limiter), so guard the close and ensure
the channel exists before replacing it: in Limiter's Reconfigure (or the
function using reconfiguredCh) check if lim.reconfiguredCh != nil then
close(lim.reconfiguredCh), then always assign lim.reconfiguredCh = make(chan
struct{}); do this under the same synchronization (e.g., the Limiter mutex used
elsewhere) to avoid races; this preserves zero-value safety for Limiter and
prevents the nil-close panic.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 33d3863 and 139a892.

📒 Files selected for processing (4)

client/resource_group/controller/group_controller.go
client/resource_group/controller/group_controller_test.go
client/resource_group/controller/limiter.go
client/resource_group/controller/limiter_test.go

- Guard reconfiguredCh against nil-close panic on zero-value Limiter by lazy-initializing in GetReconfiguredCh and nil-checking in Reconfigure before close. - Capture reconfiguredCh before Reserve to prevent missed wakeups from a just-fired Reconfigure signal. - Add ctx.Done() to the retry wait select for cancellation awareness. - Properly drain timer channel on early wakeup branches. - Replace bare <-notifyCh in test with bounded select to fail fast. Signed-off-by: JmPotato <github@ipotato.me>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/resource_group/controller/group_controller_test.go`:
- Around line 260-264: The test spawns a goroutine that calls acquireTokens with
context.Background(), which can leak if acquireTokens stalls; change to use a
cancellable/timed context (e.g., ctx, cancel :=
context.WithTimeout(context.Background(), <reasonable timeout>); defer cancel())
when invoking acquireTokens in the goroutine and in the second occurrence noted
(lines ~308-310), pass that ctx into acquireTokens so the goroutine unblocks on
timeout and the test can cancel; ensure you still send the acquireResult into
resultCh and clean up cancel() to avoid goroutine leaks.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 139a892 and fbc9f18.

📒 Files selected for processing (3)

client/resource_group/controller/group_controller.go
client/resource_group/controller/group_controller_test.go
client/resource_group/controller/limiter.go

🚧 Files skipped from review as they are similar to previous changes (1)

client/resource_group/controller/group_controller.go

coderabbitai · 2026-02-25T13:22:59Z

+	go func() {
+		var waitDuration time.Duration
+		_, err := gc.acquireTokens(context.Background(), delta, &waitDuration, false)
+		resultCh <- acquireResult{err, waitDuration}
+	}()


⚠️ Potential issue | 🟡 Minor

Use cancellable contexts in tests to avoid lingering waits on failure paths.

Both calls use context.Background(). If a regression causes acquireTokens to stall, the goroutine/call can outlive the test’s timeout branches. Prefer context.WithTimeout + defer cancel().

🔧 Proposed test-hardening patch

func TestAcquireTokensSignalAwareWait(t *testing.T) { re := require.New(t) + ctx, cancel := context.WithTimeout(context.Background(), 2*cfg.WaitRetryInterval) + defer cancel() @@ go func() { var waitDuration time.Duration - _, err := gc.acquireTokens(context.Background(), delta, &waitDuration, false) + _, err := gc.acquireTokens(ctx, delta, &waitDuration, false) resultCh <- acquireResult{err, waitDuration} }() @@ func TestAcquireTokensFallbackToTimer(t *testing.T) { re := require.New(t) @@ - ctx := context.Background() + ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second) + defer cancel() var waitDuration time.Duration _, err := gc.acquireTokens(ctx, delta, &waitDuration, false)

Based on learnings: Applies to **/.go : Use context-aware timeouts and backoff for retries; Applies to **/.go : Prevent goroutine leaks: pair with cancellation; consider errgroup.

Also applies to: 308-310

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@client/resource_group/controller/group_controller_test.go` around lines 260 - 264, The test spawns a goroutine that calls acquireTokens with context.Background(), which can leak if acquireTokens stalls; change to use a cancellable/timed context (e.g., ctx, cancel := context.WithTimeout(context.Background(), <reasonable timeout>); defer cancel()) when invoking acquireTokens in the goroutine and in the second occurrence noted (lines ~308-310), pass that ctx into acquireTokens so the goroutine unblocks on timeout and the test can cancel; ensure you still send the acquireResult into resultCh and clean up cancel() to avoid goroutine leaks.

JmPotato · 2026-02-25T13:31:23Z

/retest

codecov · 2026-02-26T03:42:02Z

Codecov Report

❌ Patch coverage is 56.66667% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.81%. Comparing base (4d3b7ee) to head (fbc9f18).
⚠️ Report is 3 commits behind head on master.

❌ Your patch status has failed because the patch coverage (56.66%) is below the target coverage (74.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10252      +/-   ##
==========================================
+ Coverage   78.78%   78.81%   +0.03%     
==========================================
  Files         523      523              
  Lines       70491    70522      +31     
==========================================
+ Hits        55536    55585      +49     
+ Misses      10961    10944      -17     
+ Partials     3994     3993       -1

Flag	Coverage Δ
unittests	`78.81% <56.66%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ti-chi-bot · 2026-02-26T04:34:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: disksing, rleungx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [disksing,rleungx]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2026-02-26T04:34:29Z

[LGTM Timeline notifier]

Timeline:

2026-02-26 03:41:48.001692939 +0000 UTC m=+326380.516487548: ☑️ agreed by rleungx.
2026-02-26 04:34:28.42472904 +0000 UTC m=+329540.939523650: ☑️ agreed by disksing.

JmPotato · 2026-03-27T10:59:08Z

/cherry-pick release-8.5-20260323-v8.5.5

close tikv#10251 Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

ti-chi-bot · 2026-03-27T10:59:53Z

@JmPotato: new pull request created to branch release-8.5-20260323-v8.5.5: #10505.
But this PR has conflicts, please resolve them!

Details

In response to this:

/cherry-pick release-8.5-20260323-v8.5.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

This reverts commit c6ce272. Signed-off-by: JmPotato <github@ipotato.me>

…okens (#10252) (#10505)

ti-chi-bot Bot added do-not-merge/needs-triage-completed release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 25, 2026

ti-chi-bot Bot removed the do-not-merge/needs-triage-completed label Feb 25, 2026

coderabbitai Bot reviewed Feb 25, 2026

View reviewed changes

Comment thread client/resource_group/controller/group_controller_test.go

Comment thread client/resource_group/controller/group_controller.go Outdated

Comment thread client/resource_group/controller/limiter.go

JmPotato force-pushed the fix_slot_deletion branch from 70e3103 to fbc9f18 Compare February 25, 2026 13:11

coderabbitai Bot reviewed Feb 25, 2026

View reviewed changes

rleungx approved these changes Feb 26, 2026

View reviewed changes

ti-chi-bot Bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Feb 26, 2026

disksing approved these changes Feb 26, 2026

View reviewed changes

ti-chi-bot Bot added the lgtm label Feb 26, 2026

ti-chi-bot Bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Feb 26, 2026

ti-chi-bot Bot merged commit 2eb52ae into tikv:master Feb 26, 2026
39 of 44 checks passed

JmPotato deleted the fix_slot_deletion branch February 26, 2026 05:08

coderabbitai Bot mentioned this pull request Mar 26, 2026

resource_group: add client-side observability metrics for Controller #10494

Open

2 tasks

ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this pull request Mar 27, 2026

This is an automated cherry-pick of tikv#10252

c6ce272

close tikv#10251 Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

ti-chi-bot mentioned this pull request Mar 27, 2026

resourcegroup: replace blind sleep with signal-aware wait in acquireTokens (#10252) #10505

Merged

JmPotato added a commit to ti-chi-bot/pd that referenced this pull request Mar 27, 2026

Revert "This is an automated cherry-pick of tikv#10252"

43667c5

This reverts commit c6ce272. Signed-off-by: JmPotato <github@ipotato.me>

JmPotato mentioned this pull request Mar 30, 2026

resourcegroup: replace blind sleep with signal-aware wait in acquireTokens (#10252) #10507

Open

JmPotato pushed a commit that referenced this pull request Mar 30, 2026

resourcegroup: replace blind sleep with signal-aware wait in acquireT…

ebf9af0

…okens (#10252) (#10505)

This was referenced Mar 30, 2026

chore: bump pd client to include resourcegroup signal-aware wait fix pingcap/tidb#67423

Closed

chore: bump pd client to include resourcegroup signal-aware wait fix pingcap/tidb#67428

Merged

coderabbitai Bot mentioned this pull request Apr 21, 2026

client/resource_group: add RC paging pre-charge with PredictedReadBytes hint #10611

Open

Conversation

JmPotato commented Feb 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how does it work?

Check List

Release note

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

JmPotato commented Feb 25, 2026

Uh oh!

codecov Bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ti-chi-bot Bot commented Feb 26, 2026

Uh oh!

ti-chi-bot Bot commented Feb 26, 2026

[LGTM Timeline notifier]

Uh oh!

Uh oh!

JmPotato commented Mar 27, 2026

Uh oh!

ti-chi-bot commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JmPotato commented Feb 25, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 25, 2026 •

edited

Loading

codecov Bot commented Feb 26, 2026 •

edited

Loading