Skip to content

eventhubs: RetryOperation::CalculateExponentialDelay invokes UB and over-sleeps on final attempt #7132

@j7nw4r

Description

@j7nw4r

Summary

Azure::Messaging::EventHubs::_detail::RetryOperation in sdk/eventhubs/azure-messaging-eventhubs/src/retry_operation.cpp has two pre-existing bugs in its delay math, both of which were flagged by GitHub Copilot during review of #7131 but are out of scope for that change (which is narrowly about rethrowing the captured exception when retries are exhausted).

Bug 1: undefined behavior on first retry

RetryOperation::Execute initializes retryCount = 0 and passes it as attempt into ShouldRetry. If the first attempt fails (returns false or throws a retriable exception), ShouldRetry calls CalculateExponentialDelay(attempt=0, jitterFactor). Inside CalculateExponentialDelay:

auto exponentialRetryAfter = m_retryOptions.RetryDelay
    * (((attempt - 1) <= beforeLastBit) ? (1 << (attempt - 1))
                                        : (std::numeric_limits<int32_t>::max)());

With attempt = 0, this evaluates 1 << -1, which is undefined behavior per [expr.shift]. In practice on x86 it tends to mask the shift amount, but the result is implementation-defined at best and can change with optimizer settings or platforms.

Bug 2: extra backoff sleep after the final attempt

WasLastAttempt(attempt) is attempt >= MaxRetries. The loop checks ShouldRetry(..., retryCount, ...) against the pre-increment retryCount, then increments and sleep_fors. With MaxRetries = 3:

  • attempt 1 fails: WasLastAttempt(0) is false, so we sleep
  • attempt 2 fails: WasLastAttempt(1) is false, so we sleep
  • attempt 3 fails: WasLastAttempt(2) is false, so we sleep, then the loop condition retryCount < 3 becomes false and we exit

So we always sleep one extra time before reporting failure, which slows down failure surfacing for callers.

Suggested fix

Normalize the attempt index inside the delay calculation (treat the first retry as attempt=1 for the shift, and check WasLastAttempt against retryCount + 1), or restructure the loop so the sleep only happens before a subsequent attempt and never after the last one.

This is a behavior change for every existing consumer of RetryOperation (producer_client, consumer_client, partition_client, processor, processor_partition_client), so it deserves its own PR and review.

Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageWorkflow: This is a new issue that needs to be triaged to the appropriate team.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions