Skip to content

resourcegroup: replace blind sleep with signal-aware wait in acquireTokens#10252

Merged
ti-chi-bot[bot] merged 2 commits intotikv:masterfrom
JmPotato:fix_slot_deletion
Feb 26, 2026
Merged

resourcegroup: replace blind sleep with signal-aware wait in acquireTokens#10252
ti-chi-bot[bot] merged 2 commits intotikv:masterfrom
JmPotato:fix_slot_deletion

Conversation

@JmPotato
Copy link
Copy Markdown
Member

@JmPotato JmPotato commented Feb 25, 2026

What problem does this PR solve?

Issue Number: close #10251.

What is changed and how does it work?

Add a close-and-recreate channel on the `Limiter` that is closed on every `Reconfigure()` call.
The retry loop in `acquireTokens` now selects on this channel instead of blind-sleeping,
waking up immediately when new tokens arrive. The timer serves as a fallback to preserve original
behavior when no `Reconfigure` occurs.

Check List

Tests

  • Unit test
  • Integration test

Release note

None.

Summary by CodeRabbit

  • Bug Fixes

    • Token acquisition retry now uses an interruptible wait so reconfiguration cancels unnecessary delays, improving responsiveness during limiter changes.
  • New Features

    • Limiter now signals reconfiguration to promptly wake waiting operations.
  • Tests

    • Added coverage for signal-aware wakeups, timer fallback behavior, and waking multiple waiters during reconfiguration.

…okens

Add a close-and-recreate channel on the Limiter that is closed on every
Reconfigure() call. The retry loop in acquireTokens now selects on this
channel instead of blind-sleeping, waking up immediately when new tokens
arrive. The timer serves as a fallback to preserve original behavior when
no Reconfigure occurs.

Signed-off-by: JmPotato <github@ipotato.me>
@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-triage-completed release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 25, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 25, 2026

📝 Walkthrough

Walkthrough

Replaces blind sleep in token-acquisition retries with a timer-based wait that listens for limiter reconfiguration signals and context cancellation; adds a reconfiguration channel on the limiter and tests for signal-aware wakeups and timer fallback behavior.

Changes

Cohort / File(s) Summary
Limiter infra & tests
client/resource_group/controller/limiter.go, client/resource_group/controller/limiter_test.go
Add reconfiguredCh to Limiter, initialize it in constructors, expose GetReconfiguredCh() and close/recreate the channel in Reconfigure(); tests validate channel lifecycle and that multiple waiters are woken.
Group controller logic & tests
client/resource_group/controller/group_controller.go, client/resource_group/controller/group_controller_test.go
Replace blind time.Sleep in acquireTokens retry loop with a timer + select over ctx.Done(), limiter reconfiguredCh, and timer; update wait duration consistently. Add tests for signal-aware wake and timer fallback scenarios.

Sequence Diagram

sequenceDiagram
    participant Client
    participant AcquireTokens as acquireTokens
    participant Limiter
    participant Timer

    Client->>AcquireTokens: request tokens
    AcquireTokens->>Limiter: Reserve(tokens)
    Limiter-->>AcquireTokens: reservation failed

    AcquireTokens->>Limiter: GetReconfiguredCh()
    AcquireTokens->>Timer: start retry timer

    par waiting
        AcquireTokens->>AcquireTokens: select on timer, reconfiguredCh, ctx.Done()
    and reconfigure
        Client->>Limiter: Reconfigure(...) (async)
        Limiter->>Limiter: close & recreate reconfiguredCh
        Limiter-->>AcquireTokens: reconfiguredCh closes (wake)
    end

    AcquireTokens->>Timer: stop timer (if needed)
    AcquireTokens->>Limiter: Reserve(tokens) [retry]
    Limiter-->>AcquireTokens: success or fail
    AcquireTokens-->>Client: return result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • rleungx
  • bufferflies

Poem

🐰 I used to nap through blind, fixed sleeps,

But now I wake when reconfigure peeps.

Timers and channels dance in tune—

No more second-long afternoon! 🥕🐇

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: replacing blind sleep with signal-aware wait in acquireTokens, which is clearly reflected in the code changes.
Description check ✅ Passed The description includes the required issue number, explains changes via commit message, and lists applicable tests. All mandatory sections are present.
Linked Issues check ✅ Passed The changes directly address issue #10251 by replacing blind sleep with signal-aware wait using Reconfigure channel, enabling immediate wakeup when tokens refresh instead of continuing blind sleeps.
Out of Scope Changes check ✅ Passed All changes are directly related to the stated objective: adding reconfiguredCh to Limiter, implementing signal-aware wait in acquireTokens, and adding corresponding unit tests.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/resource_group/controller/group_controller_test.go`:
- Around line 266-268: The test currently performs a bare receive on notifyCh
which can hang; replace the `<-notifyCh` synchronization with a bounded select
that waits for either `<-notifyCh` or a timeout (e.g. time.After(1s)) and fails
the test (t.Fatalf or t.Fatal) on timeout. Update the test that uses notifyCh in
group_controller_test.go accordingly and add the time import if missing so the
test fails fast instead of blocking indefinitely.

In `@client/resource_group/controller/group_controller.go`:
- Around line 520-528: Call counter.limiter.GetReconfiguredCh() before starting
the retry wait and make the wait select context-aware: capture reconfiguredCh
early (reconfiguredCh := counter.limiter.GetReconfiguredCh()), create waitTimer
:= time.NewTimer(gc.mainCfg.WaitRetryInterval), then select on <-reconfiguredCh,
<-waitTimer.C and <-ctx.Done(); on the reconfiguredCh or ctx.Done() branches
stop the timer (and drain it if Stop() returns false) to avoid leaks, and still
add time.Since(waitStart) to *waitDuration; update references to reconfiguredCh,
waitTimer, waitStart, gc.mainCfg.WaitRetryInterval and ctx.Done() accordingly.

In `@client/resource_group/controller/limiter.go`:
- Around line 368-370: The close(lim.reconfiguredCh) can panic when
lim.reconfiguredCh is nil (zero-value Limiter), so guard the close and ensure
the channel exists before replacing it: in Limiter's Reconfigure (or the
function using reconfiguredCh) check if lim.reconfiguredCh != nil then
close(lim.reconfiguredCh), then always assign lim.reconfiguredCh = make(chan
struct{}); do this under the same synchronization (e.g., the Limiter mutex used
elsewhere) to avoid races; this preserves zero-value safety for Limiter and
prevents the nil-close panic.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 33d3863 and 139a892.

📒 Files selected for processing (4)
  • client/resource_group/controller/group_controller.go
  • client/resource_group/controller/group_controller_test.go
  • client/resource_group/controller/limiter.go
  • client/resource_group/controller/limiter_test.go

Comment thread client/resource_group/controller/group_controller_test.go
Comment thread client/resource_group/controller/group_controller.go Outdated
Comment thread client/resource_group/controller/limiter.go
- Guard reconfiguredCh against nil-close panic on zero-value Limiter
  by lazy-initializing in GetReconfiguredCh and nil-checking in
  Reconfigure before close.
- Capture reconfiguredCh before Reserve to prevent missed wakeups
  from a just-fired Reconfigure signal.
- Add ctx.Done() to the retry wait select for cancellation awareness.
- Properly drain timer channel on early wakeup branches.
- Replace bare <-notifyCh in test with bounded select to fail fast.

Signed-off-by: JmPotato <github@ipotato.me>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/resource_group/controller/group_controller_test.go`:
- Around line 260-264: The test spawns a goroutine that calls acquireTokens with
context.Background(), which can leak if acquireTokens stalls; change to use a
cancellable/timed context (e.g., ctx, cancel :=
context.WithTimeout(context.Background(), <reasonable timeout>); defer cancel())
when invoking acquireTokens in the goroutine and in the second occurrence noted
(lines ~308-310), pass that ctx into acquireTokens so the goroutine unblocks on
timeout and the test can cancel; ensure you still send the acquireResult into
resultCh and clean up cancel() to avoid goroutine leaks.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 139a892 and fbc9f18.

📒 Files selected for processing (3)
  • client/resource_group/controller/group_controller.go
  • client/resource_group/controller/group_controller_test.go
  • client/resource_group/controller/limiter.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • client/resource_group/controller/group_controller.go

Comment on lines +260 to +264
go func() {
var waitDuration time.Duration
_, err := gc.acquireTokens(context.Background(), delta, &waitDuration, false)
resultCh <- acquireResult{err, waitDuration}
}()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use cancellable contexts in tests to avoid lingering waits on failure paths.

Both calls use context.Background(). If a regression causes acquireTokens to stall, the goroutine/call can outlive the test’s timeout branches. Prefer context.WithTimeout + defer cancel().

🔧 Proposed test-hardening patch
 func TestAcquireTokensSignalAwareWait(t *testing.T) {
 	re := require.New(t)
+	ctx, cancel := context.WithTimeout(context.Background(), 2*cfg.WaitRetryInterval)
+	defer cancel()
@@
 	go func() {
 		var waitDuration time.Duration
-		_, err := gc.acquireTokens(context.Background(), delta, &waitDuration, false)
+		_, err := gc.acquireTokens(ctx, delta, &waitDuration, false)
 		resultCh <- acquireResult{err, waitDuration}
 	}()
@@
 func TestAcquireTokensFallbackToTimer(t *testing.T) {
 	re := require.New(t)
@@
-	ctx := context.Background()
+	ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
+	defer cancel()
 	var waitDuration time.Duration
 	_, err := gc.acquireTokens(ctx, delta, &waitDuration, false)

Based on learnings: Applies to **/.go : Use context-aware timeouts and backoff for retries; Applies to **/.go : Prevent goroutine leaks: pair with cancellation; consider errgroup.

Also applies to: 308-310

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/resource_group/controller/group_controller_test.go` around lines 260 -
264, The test spawns a goroutine that calls acquireTokens with
context.Background(), which can leak if acquireTokens stalls; change to use a
cancellable/timed context (e.g., ctx, cancel :=
context.WithTimeout(context.Background(), <reasonable timeout>); defer cancel())
when invoking acquireTokens in the goroutine and in the second occurrence noted
(lines ~308-310), pass that ctx into acquireTokens so the goroutine unblocks on
timeout and the test can cancel; ensure you still send the acquireResult into
resultCh and clean up cancel() to avoid goroutine leaks.

@JmPotato
Copy link
Copy Markdown
Member Author

/retest

@ti-chi-bot ti-chi-bot Bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Feb 26, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 26, 2026

Codecov Report

❌ Patch coverage is 56.66667% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.81%. Comparing base (4d3b7ee) to head (fbc9f18).
⚠️ Report is 3 commits behind head on master.

❌ Your patch status has failed because the patch coverage (56.66%) is below the target coverage (74.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10252      +/-   ##
==========================================
+ Coverage   78.78%   78.81%   +0.03%     
==========================================
  Files         523      523              
  Lines       70491    70522      +31     
==========================================
+ Hits        55536    55585      +49     
+ Misses      10961    10944      -17     
+ Partials     3994     3993       -1     
Flag Coverage Δ
unittests 78.81% <56.66%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ti-chi-bot ti-chi-bot Bot added the lgtm label Feb 26, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Feb 26, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: disksing, rleungx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Feb 26, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Feb 26, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-02-26 03:41:48.001692939 +0000 UTC m=+326380.516487548: ☑️ agreed by rleungx.
  • 2026-02-26 04:34:28.42472904 +0000 UTC m=+329540.939523650: ☑️ agreed by disksing.

@ti-chi-bot ti-chi-bot Bot merged commit 2eb52ae into tikv:master Feb 26, 2026
39 of 44 checks passed
@JmPotato JmPotato deleted the fix_slot_deletion branch February 26, 2026 05:08
@JmPotato
Copy link
Copy Markdown
Member Author

/cherry-pick release-8.5-20260323-v8.5.5

ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this pull request Mar 27, 2026
close tikv#10251

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Copy Markdown
Member

@JmPotato: new pull request created to branch release-8.5-20260323-v8.5.5: #10505.
But this PR has conflicts, please resolve them!

Details

In response to this:

/cherry-pick release-8.5-20260323-v8.5.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Resource Group: 1s latency spike when FillRate far exceeds actual RU consumption

4 participants