fix(scheduler): clamp backoff exponent to prevent duration overflow#1677
Open
devteamaegis wants to merge 1 commit into
Open
fix(scheduler): clamp backoff exponent to prevent duration overflow#1677devteamaegis wants to merge 1 commit into
devteamaegis wants to merge 1 commit into
Conversation
calculateBackoffDelay used time.Duration(math.Pow(2, retryCount)) which overflows int64 once retryCount reaches ~62. The wrapped value (0 or a negative duration) slips past the maxDelay clamp, so high retry counts silently get no backoff instead of the intended 5s cap. Retry counts up to maxScheduleRetryCount (120) are reachable, so this is hit in practice. Clamp the exponent before the power computation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What's broken
calculateBackoffDelaycomputes retry backoff astime.Duration(math.Pow(2, retryCount)) * baseDelayand clamps it to a 5smaxDelay. For largeretryCount, thetime.Duration(int64 ns) overflows before the clamp, wrapping to0then negative — so the clamp never fires and failing requests get zero/negative backoff instead of the intended 5s cap.retryCountis reachable up tomaxScheduleRetryCount = 120:A negative/zero duration makes the backlog reschedule immediately, turning the backoff cap into a retry storm — the opposite of its purpose.
Why it happens
math.Pow(2, retryCount)converted totime.Durationand multiplied by 1s overflows int64 aroundretryCount >= 62; the overflowed value is<= maxDelay, soif delay > maxDelayis skipped.Fix
Clamp the exponent before the power: if
retryCountexceeds a safe bound, returnmaxDelaydirectly.Test
TestCalculateBackoffDelayasserts retryCount 62/63/119 return the 5s cap (not 0/negative); fails before, passes after. Full scheduler package still green.Summary by cubic
Clamps the retry backoff exponent in
pkg/schedulerto preventtime.Durationoverflow that caused zero/negative delays and retry storms. High retry counts now correctly return the 5s cap.retryCount> 32, returnmaxDelay(5s).TestCalculateBackoffDelayto cover large counts (62, 63, 119) and assert non-negative, capped delays.Written for commit 8f87ad0. Summary will update on new commits.