feat(transport): add HTTP retry with exponential backoff#1520
feat(transport): add HTTP retry with exponential backoff#1520
Conversation
|
b083a57 to
a264f66
Compare
|
@sentry review |
|
@cursor review |
|
@sentry review |
|
@cursor review |
|
@cursor review |
|
@cursor review |
243c880 to
d6aa792
Compare
|
@cursor review |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| sentry__retry_trigger(sentry_retry_t *retry) | ||
| { | ||
| sentry__bgworker_submit(retry->bgworker, retry_trigger_task, NULL, retry); | ||
| } |
There was a problem hiding this comment.
Trigger bypasses backoff and rapidly exhausts retry attempts
Medium Severity
retry_trigger_task passes UINT64_MAX as before, bypassing all backoff checks. On network failure, it immediately re-submits itself via sentry__retry_trigger. Since sentry__retry_send breaks after the first failure and bumps that file's count, and the trigger immediately re-queues, all retry attempts for every pending envelope can be exhausted in rapid succession (milliseconds) rather than over the intended hours of exponential backoff. A single mistimed call to sentry_transport_retry while offline permanently burns through all retries, causing data loss.
| #endif | ||
|
|
||
| #ifdef SENTRY_TRANSPORT_COMPRESSION | ||
| static bool |
There was a problem hiding this comment.
can_retry returns true even when retry disabled
Low Severity
sentry__transport_can_retry checks for the presence of retry_func, which is unconditionally set for all HTTP transports in sentry__http_transport_new. This means it returns true even when http_retry is false. In sentry_core.c, this causes sentry__cleanup_cache to run unnecessarily for every HTTP transport regardless of the retry option. While currently harmless (cleanup is a no-op with default cache limits), the function's semantics are misleading and could cause subtle issues if future code relies on it to indicate actual retry capability.
Additional Locations (2)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The deferred startup retry scan (100ms delay) could pick up files written by the current session. Filter by startup_time so only previous-session files are processed. Also ensure the cache directory exists when cache_keep is enabled, since sentry__process_old_runs only creates it conditionally. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Monotonic time is process-relative and doesn't work across restarts. Retry envelope timestamps need to persist across sessions, so use time() (seconds since epoch) for file timestamps, startup_time, and backoff comparison. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename SENTRY_RETRY_BACKOFF_BASE_MS to SENTRY_RETRY_BACKOFF_BASE_S and sentry__retry_backoff_ms to sentry__retry_backoff, since file timestamps are now in seconds. The bgworker delay sites multiply by 1000 to convert to the milliseconds it expects. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Make sentry__retry_flush block until the flush task completes by adding a bgworker_flush call, and subtract the elapsed time from the shutdown timeout. This ensures retries are actually sent before the worker stops. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Break out of the send loop on the first network error to avoid wasting time on a dead connection. Remaining envelopes stay untouched for the next retry poll. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When bgworker shutdown times out, persist any remaining queued envelopes to the retry directory so they are not lost. The retry module provides sentry__retry_dump_queue to keep retry internals out of the transport. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After shutdown timeout, the bgworker thread is detached but may still be executing an http_send_task. Since dump_queue already saves that task's envelope to the retry dir, the worker's subsequent call to retry_enqueue would create a duplicate file. Seal the retry module after dumping so that any late enqueue calls are silently skipped. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… logic Remove count_eligible_files helper that duplicated filtering logic from sentry__retry_send. The retry_backoff test now exercises the actual send path for both backoff and startup modes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Store parsed fields (ts, count, uuid) alongside the path during the filter phase so handle_result and future debug logging can use them without re-parsing. Also improves sort performance by comparing numeric fields before falling back to string comparison. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Log retry attempts at DEBUG level and max-retries-reached at WARN level to make retry behavior observable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…writes Three places independently constructed <database>/cache and wrote envelopes there. Add cache_path to sentry_run_t and introduce sentry__run_write_cache() and sentry__run_move_cache() to centralize the cache directory creation and file operations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CURLOPT_TIMEOUT_MS is a total transfer timeout that could cut off large envelopes. Use CURLOPT_CONNECTTIMEOUT_MS instead so only connection establishment is bounded. For winhttp, limit resolve and connect to 15s but leave send/receive at their defaults. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without this, sentry__retry_send overcounts remaining files, causing an unnecessary extra poll cycle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Restructure handle_result so "max retries reached" warnings only fire on actual network failures, not on successful delivery at the last attempt. Separate the warning logic from the cache/discard actions and put the re-enqueue branch first for clarity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the `can_retry` bool on the transport with a `retry_func` callback, and expose `sentry_transport_retry()` as an experimental public API for explicitly retrying all pending envelopes, e.g. when coming back online. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move retry envelopes from a separate retry/ directory into cache/ so that sentry__cleanup_cache() enforces disk limits for both file formats out of the box. The two formats are distinguishable by length: retry files use <ts>-<count>-<uuid>.envelope (49+ chars) while cache files use <uuid>.envelope (45 chars). Default http_retries to 0 (opt-in). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When bgworker is detached during shutdown timeout, retry_poll_task can access retry->run->cache_path after sentry_options_free frees the run. Clone the path so it outlives options and is freed with the bgworker. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The bgworker_flush in sentry__retry_flush would delay its flush_task by min(delayed_task_time, timeout) when a 15-minute delayed retry_poll_task existed. This consumed the entire shutdown timeout, leaving 0ms for bgworker_shutdown, which then detached the worker thread. On Windows, winhttp_client_shutdown would close handles still in use by the detached thread, causing a crash. The flush is unnecessary because retry_flush_task is an immediate task and bgworker_shutdown already processes all immediate tasks before the shutdown_task runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous commit removed bgworker_flush from retry_flush, which caused a race between WinHTTP connect timeout (~2s) and bgworker shutdown (2s). Restore the flush and pass the full timeout to both flush and shutdown — after flush drains in-flight work, shutdown completes near-instantly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Make retry count an internal constant (SENTRY_RETRY_ATTEMPTS = 5) and expose only a boolean toggle. Enabled by default. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0 means infinite, not default. Pass 30000ms to match WinHTTP defaults. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use a 'scheduled' flag with atomic compare-and-swap to ensure at most one retry_poll_task is queued at a time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move `sealed = 1` before `foreach_matching` in `retry_dump_queue` to prevent the detached worker from writing duplicate envelopes via `retry_enqueue` while the main thread is dumping the queue. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Drop the delayed retry_poll_task before bgworker_flush to prevent it from delaying the flush_task by min(retry_interval, timeout). Subtract elapsed flush time from the shutdown timeout so the total is bounded. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the bgworker is detached after shutdown timeout, retry_dump_queue writes retry files and sets sealed=1. The detached thread could then run retry_flush_task and re-send those files, causing duplicates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The retry system writes cache files directly via its own paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
d6aa792 to
fbbffb2
Compare


Add HTTP retry with exponential backoff for network failures, modeled after Crashpad's upload retry behavior.
Failed envelopes are stored as
<db>/cache/<ts>-<n>-<uuid>.envelopeand retried on startup after a 100ms throttle, and then with exponential backoff (15min, 30min, 1h, 2h, 8h). When retries are exhausted, and offline caching is enabled, envelopes are stored as<db>/cache/<uuid>.envelopeinstead of being discarded.flowchart TD startup --> R{retry?} R -->|yes| throttle R -->|no| C{cache?} throttle -. 100ms .-> resend resend -->|success| C resend -->|fail| C2[<db>/cache/<br/><ts>-<n>-<uuid>.envelope] C2 --> backoff backoff -. 2ⁿ×15min .-> resend C -->|yes| CACHE[<db>/cache/<br/><uuid>.envelope] C -->|no| discardSee also: https://develop.sentry.dev/sdk/expected-features/#buffer-to-disk
Depends on:
See also: