Summary
Azure::Messaging::EventHubs::_detail::RetryOperation in sdk/eventhubs/azure-messaging-eventhubs/src/retry_operation.cpp has two pre-existing bugs in its delay math, both of which were flagged by GitHub Copilot during review of #7131 but are out of scope for that change (which is narrowly about rethrowing the captured exception when retries are exhausted).
Bug 1: undefined behavior on first retry
RetryOperation::Execute initializes retryCount = 0 and passes it as attempt into ShouldRetry. If the first attempt fails (returns false or throws a retriable exception), ShouldRetry calls CalculateExponentialDelay(attempt=0, jitterFactor). Inside CalculateExponentialDelay:
auto exponentialRetryAfter = m_retryOptions.RetryDelay
* (((attempt - 1) <= beforeLastBit) ? (1 << (attempt - 1))
: (std::numeric_limits<int32_t>::max)());
With attempt = 0, this evaluates 1 << -1, which is undefined behavior per [expr.shift]. In practice on x86 it tends to mask the shift amount, but the result is implementation-defined at best and can change with optimizer settings or platforms.
Bug 2: extra backoff sleep after the final attempt
WasLastAttempt(attempt) is attempt >= MaxRetries. The loop checks ShouldRetry(..., retryCount, ...) against the pre-increment retryCount, then increments and sleep_fors. With MaxRetries = 3:
- attempt 1 fails:
WasLastAttempt(0) is false, so we sleep
- attempt 2 fails:
WasLastAttempt(1) is false, so we sleep
- attempt 3 fails:
WasLastAttempt(2) is false, so we sleep, then the loop condition retryCount < 3 becomes false and we exit
So we always sleep one extra time before reporting failure, which slows down failure surfacing for callers.
Suggested fix
Normalize the attempt index inside the delay calculation (treat the first retry as attempt=1 for the shift, and check WasLastAttempt against retryCount + 1), or restructure the loop so the sleep only happens before a subsequent attempt and never after the last one.
This is a behavior change for every existing consumer of RetryOperation (producer_client, consumer_client, partition_client, processor, processor_partition_client), so it deserves its own PR and review.
Context
Summary
Azure::Messaging::EventHubs::_detail::RetryOperationinsdk/eventhubs/azure-messaging-eventhubs/src/retry_operation.cpphas two pre-existing bugs in its delay math, both of which were flagged by GitHub Copilot during review of #7131 but are out of scope for that change (which is narrowly about rethrowing the captured exception when retries are exhausted).Bug 1: undefined behavior on first retry
RetryOperation::ExecuteinitializesretryCount = 0and passes it asattemptintoShouldRetry. If the first attempt fails (returnsfalseor throws a retriable exception),ShouldRetrycallsCalculateExponentialDelay(attempt=0, jitterFactor). InsideCalculateExponentialDelay:With
attempt = 0, this evaluates1 << -1, which is undefined behavior per [expr.shift]. In practice on x86 it tends to mask the shift amount, but the result is implementation-defined at best and can change with optimizer settings or platforms.Bug 2: extra backoff sleep after the final attempt
WasLastAttempt(attempt)isattempt >= MaxRetries. The loop checksShouldRetry(..., retryCount, ...)against the pre-incrementretryCount, then increments andsleep_fors. WithMaxRetries = 3:WasLastAttempt(0)is false, so we sleepWasLastAttempt(1)is false, so we sleepWasLastAttempt(2)is false, so we sleep, then the loop conditionretryCount < 3becomes false and we exitSo we always sleep one extra time before reporting failure, which slows down failure surfacing for callers.
Suggested fix
Normalize the attempt index inside the delay calculation (treat the first retry as
attempt=1for the shift, and checkWasLastAttemptagainstretryCount + 1), or restructure the loop so the sleep only happens before a subsequent attempt and never after the last one.This is a behavior change for every existing consumer of
RetryOperation(producer_client,consumer_client,partition_client,processor,processor_partition_client), so it deserves its own PR and review.Context
RetryOperationTest.ShouldRetryFalse1/ShouldRetryFalse2already exercise the shift-by-negative path.