Skip to content

fix(scheduler): clamp backoff exponent to prevent duration overflow#1677

Open
devteamaegis wants to merge 1 commit into
beam-cloud:mainfrom
devteamaegis:fix/backoff-overflow
Open

fix(scheduler): clamp backoff exponent to prevent duration overflow#1677
devteamaegis wants to merge 1 commit into
beam-cloud:mainfrom
devteamaegis:fix/backoff-overflow

Conversation

@devteamaegis

@devteamaegis devteamaegis commented Jun 11, 2026

Copy link
Copy Markdown

What's broken

calculateBackoffDelay computes retry backoff as time.Duration(math.Pow(2, retryCount)) * baseDelay and clamps it to a 5s maxDelay. For large retryCount, the time.Duration (int64 ns) overflows before the clamp, wrapping to 0 then negative — so the clamp never fires and failing requests get zero/negative backoff instead of the intended 5s cap. retryCount is reachable up to maxScheduleRetryCount = 120:

retryCount=3   -> 5s    (correct, clamped)
retryCount=62  -> 0s    (bug: no backoff)
retryCount=63  -> -1s   (bug: negative)
retryCount=119 -> -1s   (bug)

A negative/zero duration makes the backlog reschedule immediately, turning the backoff cap into a retry storm — the opposite of its purpose.

Why it happens

math.Pow(2, retryCount) converted to time.Duration and multiplied by 1s overflows int64 around retryCount >= 62; the overflowed value is <= maxDelay, so if delay > maxDelay is skipped.

Fix

Clamp the exponent before the power: if retryCount exceeds a safe bound, return maxDelay directly.

Test

TestCalculateBackoffDelay asserts retryCount 62/63/119 return the 5s cap (not 0/negative); fails before, passes after. Full scheduler package still green.


Summary by cubic

Clamps the retry backoff exponent in pkg/scheduler to prevent time.Duration overflow that caused zero/negative delays and retry storms. High retry counts now correctly return the 5s cap.

  • Bug Fixes
    • Clamp before power: if retryCount > 32, return maxDelay (5s).
    • Added TestCalculateBackoffDelay to cover large counts (62, 63, 119) and assert non-negative, capped delays.

Written for commit 8f87ad0. Summary will update on new commits.

Review in cubic

calculateBackoffDelay used time.Duration(math.Pow(2, retryCount)) which
overflows int64 once retryCount reaches ~62. The wrapped value (0 or a
negative duration) slips past the maxDelay clamp, so high retry counts
silently get no backoff instead of the intended 5s cap. Retry counts up
to maxScheduleRetryCount (120) are reachable, so this is hit in practice.

Clamp the exponent before the power computation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant