Skip to content

feat(transport): add HTTP retry with exponential backoff#1520

Draft
jpnurmi wants to merge 61 commits intomasterfrom
jpnurmi/feat/http-retry
Draft

feat(transport): add HTTP retry with exponential backoff#1520
jpnurmi wants to merge 61 commits intomasterfrom
jpnurmi/feat/http-retry

Conversation

@jpnurmi
Copy link
Collaborator

@jpnurmi jpnurmi commented Feb 13, 2026

Add HTTP retry with exponential backoff for network failures, modeled after Crashpad's upload retry behavior.

Failed envelopes are stored as <db>/cache/<ts>-<n>-<uuid>.envelope and retried on startup after a 100ms throttle, and then with exponential backoff (15min, 30min, 1h, 2h, 8h). When retries are exhausted, and offline caching is enabled, envelopes are stored as <db>/cache/<uuid>.envelope instead of being discarded.

flowchart TD
    startup --> R{retry?}
    R -->|yes| throttle
    R -->|no| C{cache?}
    throttle -. 100ms .-> resend
    resend -->|success| C
    resend -->|fail| C2[&lt;db&gt;/cache/<br/>&lt;ts&gt;-&lt;n&gt;-&lt;uuid&gt;.envelope]
    C2 --> backoff
    backoff -. 2ⁿ×15min .-> resend
    C -->|yes| CACHE[&lt;db&gt;/cache/<br/>&lt;uuid&gt;.envelope]
    C -->|no| discard
Loading

See also: https://develop.sentry.dev/sdk/expected-features/#buffer-to-disk

Depends on:

See also:

@github-actions
Copy link

github-actions bot commented Feb 13, 2026

Messages
📖 Do not forget to update Sentry-docs with your feature once the pull request gets approved.

Generated by 🚫 dangerJS against fbbffb2

@jpnurmi jpnurmi force-pushed the jpnurmi/feat/http-retry branch 2 times, most recently from b083a57 to a264f66 Compare February 13, 2026 17:47
@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@sentry review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@cursor review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@sentry review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@cursor review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@cursor review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

@jpnurmi jpnurmi force-pushed the jpnurmi/feat/http-retry branch 2 times, most recently from 243c880 to d6aa792 Compare February 16, 2026 19:41
@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

sentry__retry_trigger(sentry_retry_t *retry)
{
sentry__bgworker_submit(retry->bgworker, retry_trigger_task, NULL, retry);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trigger bypasses backoff and rapidly exhausts retry attempts

Medium Severity

retry_trigger_task passes UINT64_MAX as before, bypassing all backoff checks. On network failure, it immediately re-submits itself via sentry__retry_trigger. Since sentry__retry_send breaks after the first failure and bumps that file's count, and the trigger immediately re-queues, all retry attempts for every pending envelope can be exhausted in rapid succession (milliseconds) rather than over the intended hours of exponential backoff. A single mistimed call to sentry_transport_retry while offline permanently burns through all retries, causing data loss.

Fix in Cursor Fix in Web

#endif

#ifdef SENTRY_TRANSPORT_COMPRESSION
static bool
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can_retry returns true even when retry disabled

Low Severity

sentry__transport_can_retry checks for the presence of retry_func, which is unconditionally set for all HTTP transports in sentry__http_transport_new. This means it returns true even when http_retry is false. In sentry_core.c, this causes sentry__cleanup_cache to run unnecessarily for every HTTP transport regardless of the retry option. While currently harmless (cleanup is a no-op with default cache limits), the function's semantics are misleading and could cause subtle issues if future code relies on it to indicate actual retry capability.

Additional Locations (2)

Fix in Cursor Fix in Web

jpnurmi and others added 10 commits February 16, 2026 21:16
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The deferred startup retry scan (100ms delay) could pick up files
written by the current session. Filter by startup_time so only
previous-session files are processed. Also ensure the cache directory
exists when cache_keep is enabled, since sentry__process_old_runs
only creates it conditionally.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Monotonic time is process-relative and doesn't work across restarts.
Retry envelope timestamps need to persist across sessions, so use
time() (seconds since epoch) for file timestamps, startup_time, and
backoff comparison.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename SENTRY_RETRY_BACKOFF_BASE_MS to SENTRY_RETRY_BACKOFF_BASE_S
and sentry__retry_backoff_ms to sentry__retry_backoff, since file
timestamps are now in seconds. The bgworker delay sites multiply
by 1000 to convert to the milliseconds it expects.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jpnurmi and others added 29 commits February 16, 2026 21:16
Make sentry__retry_flush block until the flush task completes by adding
a bgworker_flush call, and subtract the elapsed time from the shutdown
timeout. This ensures retries are actually sent before the worker stops.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Break out of the send loop on the first network error to avoid wasting
time on a dead connection. Remaining envelopes stay untouched for the
next retry poll.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When bgworker shutdown times out, persist any remaining queued envelopes
to the retry directory so they are not lost. The retry module provides
sentry__retry_dump_queue to keep retry internals out of the transport.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After shutdown timeout, the bgworker thread is detached but may still
be executing an http_send_task. Since dump_queue already saves that
task's envelope to the retry dir, the worker's subsequent call to
retry_enqueue would create a duplicate file. Seal the retry module
after dumping so that any late enqueue calls are silently skipped.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… logic

Remove count_eligible_files helper that duplicated filtering logic
from sentry__retry_send. The retry_backoff test now exercises the
actual send path for both backoff and startup modes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Store parsed fields (ts, count, uuid) alongside the path during the
filter phase so handle_result and future debug logging can use them
without re-parsing. Also improves sort performance by comparing
numeric fields before falling back to string comparison.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Log retry attempts at DEBUG level and max-retries-reached at WARN
level to make retry behavior observable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…writes

Three places independently constructed <database>/cache and wrote
envelopes there. Add cache_path to sentry_run_t and introduce
sentry__run_write_cache() and sentry__run_move_cache() to centralize
the cache directory creation and file operations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CURLOPT_TIMEOUT_MS is a total transfer timeout that could cut off large
envelopes. Use CURLOPT_CONNECTTIMEOUT_MS instead so only connection
establishment is bounded. For winhttp, limit resolve and connect to 15s
but leave send/receive at their defaults.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without this, sentry__retry_send overcounts remaining files, causing an
unnecessary extra poll cycle.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Restructure handle_result so "max retries reached" warnings only fire
on actual network failures, not on successful delivery at the last
attempt. Separate the warning logic from the cache/discard actions and
put the re-enqueue branch first for clarity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the `can_retry` bool on the transport with a `retry_func`
callback, and expose `sentry_transport_retry()` as an experimental
public API for explicitly retrying all pending envelopes, e.g. when
coming back online.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move retry envelopes from a separate retry/ directory into cache/ so
that sentry__cleanup_cache() enforces disk limits for both file formats
out of the box. The two formats are distinguishable by length: retry
files use <ts>-<count>-<uuid>.envelope (49+ chars) while cache files
use <uuid>.envelope (45 chars). Default http_retries to 0 (opt-in).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When bgworker is detached during shutdown timeout, retry_poll_task can
access retry->run->cache_path after sentry_options_free frees the run.
Clone the path so it outlives options and is freed with the bgworker.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The bgworker_flush in sentry__retry_flush would delay its flush_task by
min(delayed_task_time, timeout) when a 15-minute delayed retry_poll_task
existed. This consumed the entire shutdown timeout, leaving 0ms for
bgworker_shutdown, which then detached the worker thread. On Windows,
winhttp_client_shutdown would close handles still in use by the detached
thread, causing a crash.

The flush is unnecessary because retry_flush_task is an immediate task
and bgworker_shutdown already processes all immediate tasks before the
shutdown_task runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous commit removed bgworker_flush from retry_flush, which
caused a race between WinHTTP connect timeout (~2s) and bgworker
shutdown (2s). Restore the flush and pass the full timeout to both
flush and shutdown — after flush drains in-flight work, shutdown
completes near-instantly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Make retry count an internal constant (SENTRY_RETRY_ATTEMPTS = 5) and
expose only a boolean toggle. Enabled by default.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0 means infinite, not default. Pass 30000ms to match WinHTTP defaults.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use a 'scheduled' flag with atomic compare-and-swap to ensure at most
one retry_poll_task is queued at a time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move `sealed = 1` before `foreach_matching` in `retry_dump_queue` to
prevent the detached worker from writing duplicate envelopes via
`retry_enqueue` while the main thread is dumping the queue.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Drop the delayed retry_poll_task before bgworker_flush to prevent it
from delaying the flush_task by min(retry_interval, timeout). Subtract
elapsed flush time from the shutdown timeout so the total is bounded.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the bgworker is detached after shutdown timeout, retry_dump_queue
writes retry files and sets sealed=1. The detached thread could then
run retry_flush_task and re-send those files, causing duplicates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The retry system writes cache files directly via its own paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jpnurmi jpnurmi force-pushed the jpnurmi/feat/http-retry branch from d6aa792 to fbbffb2 Compare February 16, 2026 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant