resourcegroup: replace blind sleep with signal-aware wait in acquireTokens#10252
resourcegroup: replace blind sleep with signal-aware wait in acquireTokens#10252ti-chi-bot[bot] merged 2 commits intotikv:masterfrom
Conversation
…okens Add a close-and-recreate channel on the Limiter that is closed on every Reconfigure() call. The retry loop in acquireTokens now selects on this channel instead of blind-sleeping, waking up immediately when new tokens arrive. The timer serves as a fallback to preserve original behavior when no Reconfigure occurs. Signed-off-by: JmPotato <github@ipotato.me>
📝 WalkthroughWalkthroughReplaces blind sleep in token-acquisition retries with a timer-based wait that listens for limiter reconfiguration signals and context cancellation; adds a reconfiguration channel on the limiter and tests for signal-aware wakeups and timer fallback behavior. Changes
Sequence DiagramsequenceDiagram
participant Client
participant AcquireTokens as acquireTokens
participant Limiter
participant Timer
Client->>AcquireTokens: request tokens
AcquireTokens->>Limiter: Reserve(tokens)
Limiter-->>AcquireTokens: reservation failed
AcquireTokens->>Limiter: GetReconfiguredCh()
AcquireTokens->>Timer: start retry timer
par waiting
AcquireTokens->>AcquireTokens: select on timer, reconfiguredCh, ctx.Done()
and reconfigure
Client->>Limiter: Reconfigure(...) (async)
Limiter->>Limiter: close & recreate reconfiguredCh
Limiter-->>AcquireTokens: reconfiguredCh closes (wake)
end
AcquireTokens->>Timer: stop timer (if needed)
AcquireTokens->>Limiter: Reserve(tokens) [retry]
Limiter-->>AcquireTokens: success or fail
AcquireTokens-->>Client: return result
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/resource_group/controller/group_controller_test.go`:
- Around line 266-268: The test currently performs a bare receive on notifyCh
which can hang; replace the `<-notifyCh` synchronization with a bounded select
that waits for either `<-notifyCh` or a timeout (e.g. time.After(1s)) and fails
the test (t.Fatalf or t.Fatal) on timeout. Update the test that uses notifyCh in
group_controller_test.go accordingly and add the time import if missing so the
test fails fast instead of blocking indefinitely.
In `@client/resource_group/controller/group_controller.go`:
- Around line 520-528: Call counter.limiter.GetReconfiguredCh() before starting
the retry wait and make the wait select context-aware: capture reconfiguredCh
early (reconfiguredCh := counter.limiter.GetReconfiguredCh()), create waitTimer
:= time.NewTimer(gc.mainCfg.WaitRetryInterval), then select on <-reconfiguredCh,
<-waitTimer.C and <-ctx.Done(); on the reconfiguredCh or ctx.Done() branches
stop the timer (and drain it if Stop() returns false) to avoid leaks, and still
add time.Since(waitStart) to *waitDuration; update references to reconfiguredCh,
waitTimer, waitStart, gc.mainCfg.WaitRetryInterval and ctx.Done() accordingly.
In `@client/resource_group/controller/limiter.go`:
- Around line 368-370: The close(lim.reconfiguredCh) can panic when
lim.reconfiguredCh is nil (zero-value Limiter), so guard the close and ensure
the channel exists before replacing it: in Limiter's Reconfigure (or the
function using reconfiguredCh) check if lim.reconfiguredCh != nil then
close(lim.reconfiguredCh), then always assign lim.reconfiguredCh = make(chan
struct{}); do this under the same synchronization (e.g., the Limiter mutex used
elsewhere) to avoid races; this preserves zero-value safety for Limiter and
prevents the nil-close panic.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
client/resource_group/controller/group_controller.goclient/resource_group/controller/group_controller_test.goclient/resource_group/controller/limiter.goclient/resource_group/controller/limiter_test.go
- Guard reconfiguredCh against nil-close panic on zero-value Limiter by lazy-initializing in GetReconfiguredCh and nil-checking in Reconfigure before close. - Capture reconfiguredCh before Reserve to prevent missed wakeups from a just-fired Reconfigure signal. - Add ctx.Done() to the retry wait select for cancellation awareness. - Properly drain timer channel on early wakeup branches. - Replace bare <-notifyCh in test with bounded select to fail fast. Signed-off-by: JmPotato <github@ipotato.me>
70e3103 to
fbc9f18
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/resource_group/controller/group_controller_test.go`:
- Around line 260-264: The test spawns a goroutine that calls acquireTokens with
context.Background(), which can leak if acquireTokens stalls; change to use a
cancellable/timed context (e.g., ctx, cancel :=
context.WithTimeout(context.Background(), <reasonable timeout>); defer cancel())
when invoking acquireTokens in the goroutine and in the second occurrence noted
(lines ~308-310), pass that ctx into acquireTokens so the goroutine unblocks on
timeout and the test can cancel; ensure you still send the acquireResult into
resultCh and clean up cancel() to avoid goroutine leaks.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
client/resource_group/controller/group_controller.goclient/resource_group/controller/group_controller_test.goclient/resource_group/controller/limiter.go
🚧 Files skipped from review as they are similar to previous changes (1)
- client/resource_group/controller/group_controller.go
| go func() { | ||
| var waitDuration time.Duration | ||
| _, err := gc.acquireTokens(context.Background(), delta, &waitDuration, false) | ||
| resultCh <- acquireResult{err, waitDuration} | ||
| }() |
There was a problem hiding this comment.
Use cancellable contexts in tests to avoid lingering waits on failure paths.
Both calls use context.Background(). If a regression causes acquireTokens to stall, the goroutine/call can outlive the test’s timeout branches. Prefer context.WithTimeout + defer cancel().
🔧 Proposed test-hardening patch
func TestAcquireTokensSignalAwareWait(t *testing.T) {
re := require.New(t)
+ ctx, cancel := context.WithTimeout(context.Background(), 2*cfg.WaitRetryInterval)
+ defer cancel()
@@
go func() {
var waitDuration time.Duration
- _, err := gc.acquireTokens(context.Background(), delta, &waitDuration, false)
+ _, err := gc.acquireTokens(ctx, delta, &waitDuration, false)
resultCh <- acquireResult{err, waitDuration}
}()
@@
func TestAcquireTokensFallbackToTimer(t *testing.T) {
re := require.New(t)
@@
- ctx := context.Background()
+ ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
+ defer cancel()
var waitDuration time.Duration
_, err := gc.acquireTokens(ctx, delta, &waitDuration, false)Based on learnings: Applies to **/.go : Use context-aware timeouts and backoff for retries; Applies to **/.go : Prevent goroutine leaks: pair with cancellation; consider errgroup.
Also applies to: 308-310
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@client/resource_group/controller/group_controller_test.go` around lines 260 -
264, The test spawns a goroutine that calls acquireTokens with
context.Background(), which can leak if acquireTokens stalls; change to use a
cancellable/timed context (e.g., ctx, cancel :=
context.WithTimeout(context.Background(), <reasonable timeout>); defer cancel())
when invoking acquireTokens in the goroutine and in the second occurrence noted
(lines ~308-310), pass that ctx into acquireTokens so the goroutine unblocks on
timeout and the test can cancel; ensure you still send the acquireResult into
resultCh and clean up cancel() to avoid goroutine leaks.
|
/retest |
Codecov Report❌ Patch coverage is ❌ Your patch status has failed because the patch coverage (56.66%) is below the target coverage (74.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## master #10252 +/- ##
==========================================
+ Coverage 78.78% 78.81% +0.03%
==========================================
Files 523 523
Lines 70491 70522 +31
==========================================
+ Hits 55536 55585 +49
+ Misses 10961 10944 -17
+ Partials 3994 3993 -1
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: disksing, rleungx The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/cherry-pick release-8.5-20260323-v8.5.5 |
close tikv#10251 Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
|
@JmPotato: new pull request created to branch DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
This reverts commit c6ce272. Signed-off-by: JmPotato <github@ipotato.me>
What problem does this PR solve?
Issue Number: close #10251.
What is changed and how does it work?
Check List
Tests
Release note
Summary by CodeRabbit
Bug Fixes
New Features
Tests