Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds retry logic with exponential backoff to handle HTTP 429 (Too Many Requests) rate limiting errors when pulling Docker images in CI. The implementation prevents CI failures caused by Docker registry rate limits by automatically retrying failed pulls with increasing delays.
Changes:
- Added retry loop with configurable MAX_RETRIES (5 attempts) and exponential backoff strategy
- Implemented 429 error detection by checking for "toomanyrequests" or "429" in stderr output
- Added parsing of "retry-after" header from registry responses to respect server-suggested delays
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ) | ||
| break # Success, exit retry loop | ||
| except subprocess.CalledProcessError as e: | ||
| stderr_output = e.stderr.lower() |
There was a problem hiding this comment.
While capture_output=True with text=True typically guarantees stderr is a string, add defensive programming by using 'e.stderr or ""' before calling .lower() to handle any edge cases where stderr might be None. This prevents potential AttributeError exceptions.
| # Parse retry-after hint if available | ||
| retry_after = None | ||
| match = re.search( | ||
| r"retry-after:\s*([\d.]+)\s*ms", e.stderr, re.IGNORECASE |
There was a problem hiding this comment.
The regex pattern on line 43 searches against e.stderr (which is the original bytes or string), but line 35 processes it with .lower(). If the retry-after header contains uppercase letters like "Retry-After" (which is the standard HTTP header format), it may match correctly due to re.IGNORECASE flag. However, the inconsistency between using e.stderr here and stderr_output elsewhere could lead to confusion. Consider using e.stderr consistently, or apply the same transformations to both.
| r"retry-after:\s*([\d.]+)\s*ms", e.stderr, re.IGNORECASE | |
| r"retry-after:\s*([\d.]+)\s*ms", stderr_output, re.IGNORECASE |
| subprocess.run( | ||
| ["docker", "pull", full_image_name], | ||
| check=True, | ||
| capture_output=True, | ||
| text=True, | ||
| ) |
There was a problem hiding this comment.
The success message from docker pull is not printed, which makes the output silent on success. The original code allowed docker pull to print to stdout/stderr directly. With capture_output=True, successful pulls produce no output, making it harder to debug CI issues. Consider printing a success message or the stdout output on successful pulls to maintain visibility into what's happening during CI runs.
| subprocess.run( | |
| ["docker", "pull", full_image_name], | |
| check=True, | |
| capture_output=True, | |
| text=True, | |
| ) | |
| result = subprocess.run( | |
| ["docker", "pull", full_image_name], | |
| check=True, | |
| capture_output=True, | |
| text=True, | |
| ) | |
| if result.stdout: | |
| print(result.stdout, end="") | |
| print(f"Successfully pulled Docker image: {full_image_name}") |
| retry_after = float(match.group(1)) / 1000 | ||
|
|
||
| # Use retry-after if available, otherwise exponential backoff | ||
| delay = retry_after if retry_after else BASE_DELAY * (2**attempt) |
There was a problem hiding this comment.
The exponential backoff calculation uses 2attempt, which for attempt=0 results in 1 second (2^0 = 1), for attempt=1 results in 2 seconds (2^1 = 2), etc. However, the delay is BASE_DELAY * (2attempt), so for attempt=0, it's 1.0 * 1 = 1 second, and for attempt=4 (the last retry), it's 1.0 * 16 = 16 seconds. This is a reasonable backoff strategy, but could result in long delays. Consider documenting the expected backoff sequence in a comment or adjusting BASE_DELAY if needed.
Change Summary
Add 429 backoff support to the docker image puller
Rationale
My jobs kept failing due to this
Impact
Just CI