feat: add retry backoff for connection errors in retryOperation#775
feat: add retry backoff for connection errors in retryOperation#775leshniak wants to merge 8 commits intoExpensify:mainfrom
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Errors like lost IDB connections, closing databases, and backing store failures now wait with exponential backoff (100ms * 2^attempt +/- 25% jitter) before retrying, giving the DB connection time to recover. Capacity errors (QuotaExceeded, disk full) keep immediate retry with eviction. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Existing retry tests now use fake timers to handle backoff delays. New tests verify: exponential delay progression, connection error logging, and capacity errors remaining immediate. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Recovery and exhaustion logs needed to measure whether backoff actually helps connection errors recover. Without these, we can only see retry attempts but not outcomes — making the experiment unmeasurable. - Log recovery success with attempt number when connection error resolves after backoff - Classify exhaustion logs: connection errors get distinct message from generic failures - Extract isConnectionError earlier so both exhaustion and retry paths can use it - Add 2 new tests for recovery and exhaustion logging Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Wrap onyxMethod return with Promise.resolve() to handle undefined returns (fixes perf-test TypeError: Cannot read properties of undefined) - Use early-return guard for non-connection errors (fixes prefer-early-return lint) - Rebuild API-INTERNAL.md (fixes verify check) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The directive was on the wrong line after restructuring — TS saw it as unused since the actual type error is inside the .then() callback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bc0b381ddb
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| Promise.resolve(onyxMethod(defaultParams, nextRetryAttempt)).then(() => { | ||
| if (!isConnectionError) { | ||
| return; | ||
| } | ||
| Logger.logInfo(`Connection error recovered after backoff on attempt ${nextRetryAttempt}/${MAX_STORAGE_OPERATION_RETRY_ATTEMPTS}. onyxMethod: ${onyxMethod.name}.`); |
There was a problem hiding this comment.
Avoid logging recovery when connection retries still exhaust
The new recovery log is emitted whenever onyxMethod(defaultParams, nextRetryAttempt) resolves, but retryOperation() itself resolves even after exhausting all retries, so a permanently failing connection error will still produce Connection error recovered... before/alongside the exhaustion alert. In practice this creates false-positive recovery telemetry for the experiment and can mislead rollout decisions, because exhausted retry chains are counted as successful recoveries.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
This sounds valid in the case where we fail multiple times rather than just once?
Reviewer Checklist
Screenshots/VideosAndroid: HybridAppAndroid: mWeb ChromeiOS: HybridAppiOS: mWeb SafariMacOS: Chrome / Safari |
Julesssss
left a comment
There was a problem hiding this comment.
Mostly looking good. One bot comment to resolve.
- Could you re-add the author checklist, i think it's missing some fields
- Also not sure what is going on here:
Duration deviation of 200.86 ms (1535365.98%
Details
Add exponential backoff with jitter to
OnyxUtils.retryOperationfor non-capacity storage errors, framed as an instrumented experiment to determine the right mitigation strategy.Context: Connection-class errors — Chromium backing store failures (26.3%), WebKit connection drops (19.0%), and closing-database errors (6.4%) — account for 51.7% of all storage failures (investigation). Analysis of 7-day production logs via VictoriaLogs shows:
We have no data on whether introducing a delay (100ms-1600ms) would allow recovery. This PR adds backoff as a low-risk experiment to collect that data.
Changes:
lib/OnyxUtils.ts: AddedCONNECTION_ERRORSconstants (IDB + SQLite), backoff config (RETRY_BASE_DELAY_MS=100,RETRY_JITTER_FACTOR=0.25),wait()/getRetryDelay()helpers, wired backoff into non-capacity error branch ofretryOperation100ms → 200ms → 400ms → 800ms → 1600ms(±25% jitter, ~3.1s total max)Connection error detected, retrying with backoff)Connection error recovered after backoff on attempt N/5)Connection error exhausted all retries with backoff)Measurement plan (discussion):
Next steps based on production data:
Related Issues
Expensify/App#87782
Automated Tests
Updated 2 existing retry tests to use fake timers (backoff delays require timer advancement). Added 5 new tests:
should apply exponential backoff delay for non-capacity errors— verifies delay count and exponential growth patternshould log connection error with backoff delay info— verifies connection-specific log messageshould log recovery when connection error succeeds after backoff— verifies recovery log fires on successful retryshould log connection-specific exhaustion message when all retries fail— verifies distinct exhaustion log for connection errorsshould NOT apply backoff delay for capacity errors (immediate retry with eviction)— verifies capacity errors remain immediateAll 439 tests pass.
Manual Tests
npm run typecheckpassesnpm run lintpassesnpm testpasses (439/439)Author Checklist
### Related Issuessection aboveTestssectiontoggleReportand notonIconClick)myBool && <MyComponent />.STYLE.md) were followedAvatar, I verified the components usingAvatarare working as expected)mainbranch was merged into this PR after a review, I tested again and verified the outcome was still expected according to theTeststeps.Screenshots/Videos
Android: Native
N/A — library-level change, no UI
Android: mWeb Chrome
N/A — library-level change, no UI
iOS: Native
N/A — library-level change, no UI
iOS: mWeb Safari
N/A — library-level change, no UI
MacOS: Chrome / Safari
N/A — library-level change, no UI