Skip to content

[No QA] Fix lost/duplicated requests in PersistedRequests (Issues 2, 3, 4, 5)#84622

Merged
mjasikowski merged 17 commits intoExpensify:mainfrom
callstack-internal:callstack-internal/szymonzalarski/investigate-and-fix-sequential-queue-issues-issue-4
Mar 18, 2026
Merged

[No QA] Fix lost/duplicated requests in PersistedRequests (Issues 2, 3, 4, 5)#84622
mjasikowski merged 17 commits intoExpensify:mainfrom
callstack-internal:callstack-internal/szymonzalarski/investigate-and-fix-sequential-queue-issues-issue-4

Conversation

@szymonzalarski98
Copy link
Contributor

@szymonzalarski98 szymonzalarski98 commented Mar 9, 2026

Explanation of Change

Sequential Queue Issues — Status Report

Fixed: Issues 2, 3, 4, 5

Issue 3: processNextRequest() and rollbackOngoingRequest() don't persist to disk

File: src/libs/actions/PersistedRequests.ts

Problem: When a request moves from the queue to ongoingRequest, neither the updated queue nor the ongoing request is written to disk. If the app crashes mid-flight, the ongoing request is lost (3a) and the queue on disk is stale (3c).

Fix: Both processNextRequest() and rollbackOngoingRequest() now use Onyx.multiSet() to atomically persist the updated queue and the ongoing request to disk.

Issue 4: Connect callback overwrites in-memory state with stale disk data

File: src/libs/actions/PersistedRequests.ts

Problem: The connectWithoutView callback blindly overwrites the in-memory persistedRequests array with whatever Onyx returns from disk. When multiple Onyx.set() calls resolve out of order, the last callback wins — potentially reverting the in-memory state to a stale snapshot and permanently losing requests.

Fix: After initialization, the callback is a no-op for non-null values: if (isInitialized && val != null) { return; }. In-memory state is authoritative. Onyx.clear() (which sends null) is still allowed through so the module resets properly for tests and app restarts.

Issue 5: Conflict resolver sees stale queue

Fixed automatically by Issue 4. Since in-memory state is no longer overwritten by stale disk callbacks, the conflict resolver in prepareRequest() and push() always sees the complete, up-to-date queue.

Issue 2: Nothing in the write pipeline is awaited

Files: src/libs/actions/PersistedRequests.ts, src/libs/Network/SequentialQueue.ts, src/libs/API/index.ts

Problem: API.write() returned Promise.resolve() immediately. PersistedRequests.save() was void (didn't return the Onyx.set() promise). The caller had zero guarantee that the request was persisted to disk.

Fix:

  • save() now returns Promise<void> (the Onyx.set() promise)
  • push() returns a persistencePromise from save() (but calls flush() synchronously to preserve write-before-read ordering)
  • processRequest() is now async and awaits pushToSequentialQueue()
  • await API.write() now guarantees the request is persisted to disk

Not Fixed: Issue 1 — Optimistic data is applied before the request is persisted

The Problem

In API.write(), prepareRequest() applies Onyx.update(optimisticData) BEFORE the request is persisted to disk. If the app crashes after the optimistic UI update but before Onyx.set(PERSISTED_REQUESTS) completes, the user sees stale optimistic UI with no backing request to reconcile it.

Why We Can't Fix It Now

The fix requires moving Onyx.update(optimisticData) to AFTER push() in write(). However, conflict resolvers have Onyx side effects inside checkAndFixConflictingRequest.

Specifically, resolveCommentDeletionConflicts in src/libs/actions/RequestConflictUtils.ts:151 calls Onyx.update(rollbackData) — which sets a report action to nullinside the conflict resolver callback.

Current flow (works):

Step What happens Onyx batch
1 prepareRequest() → conflict resolver → Onyx.update(rollback) — sets action to null batch A
2 prepareRequest()Onyx.update(optimisticData) — sets pendingAction: "delete" batch A
3 push() → conflict resolver again → Onyx.update(rollback) batch B

Steps 1+2 are in the same synchronous tick → Onyx batches them together → null wins → action is deleted. Tests pass.

Flow after moving optimistic data after push (breaks):

Step What happens Onyx batch
1 prepareRequest() → conflict resolver → Onyx.update(rollback) — sets action to null batch A
2 push()save()Onyx.set(PERSISTED_REQUESTS) batch B
3 push() → conflict resolver again → Onyx.update(rollback) batch B
4 write()Onyx.update(optimisticData) — sets pendingAction: "delete" batch C

Steps 1 and 4 are now in different Onyx batches (separated by push operations). The null from step 1 is consumed in batch A. The optimistic data in step 4 creates the action anew with {pendingAction: "delete"} instead of it staying deleted. 4 ReportTest tests fail.

Plan for Future Fix

The root cause is that resolveCommentDeletionConflicts (and potentially other conflict resolvers) call Onyx.update() as a side effect inside the callback. This couples Onyx state mutations to the conflict resolution logic, making it impossible to reorder the pipeline.

Required refactoring:

  1. Change the ConflictActionData type to include an optional rollbackData field
  2. Modify conflict resolvers to return rollback data instead of calling Onyx.update() directly
  3. Have write() apply rollback data and optimistic data together in a controlled order, after persist

This changes the conflict resolver interface and affects all resolvers — it should be done in a separate PR.

Fixed Issues

$ #80759
PROPOSAL:

Tests

  1. Run npx jest tests/unit/PersistedRequests.ts --no-coverage — all pass, including:
    • Issue 3a: verifies PERSISTED_ONGOING_REQUESTS is written to disk after processNextRequest() (via OnyxUtils.get())
    • Issue 3c: verifies PERSISTED_REQUESTS on disk matches in-memory after processNextRequest()
    • Issue 4: verifies rapid concurrent saves don't lose requests due to out-of-order Onyx.set() resolution
  2. Run npx jest tests/unit/APITest.ts --no-coverage — all pass, including:
    • Issue 5: verifies conflict resolver sees complete queue state after out-of-order writes
  3. Run npx jest tests/unit/MiddlewareTest.ts tests/unit/SequentialQueueTest.ts --no-coverage — all pass (no regressions)
  • Verify that no errors appear in the JS console

Offline tests

  1. Open the app and go to any chat
  2. Go offline
  3. Send a message — it appears optimistically, request is queued
  4. Go back online — the queue starts processing
  5. Force-close the app during processing
  6. Reopen — verify the message was sent exactly once (no duplicates, no lost messages)

QA Steps

PR Author Checklist

  • I linked the correct issue in the ### Fixed Issues section above
  • I wrote clear testing steps that cover the changes made in this PR
    • I added steps for local testing in the Tests section
    • I added steps for the expected offline behavior in the Offline steps section
    • I added steps for Staging and/or Production testing in the QA steps section
    • I added steps to cover failure scenarios (i.e. verify an input displays the correct error message if the entered data is not correct)
    • I turned off my network connection and tested it while offline to ensure it matches the expected behavior (i.e. verify the default avatar icon is displayed if app is offline)
    • I tested this PR with a High Traffic account against the staging or production API to ensure there are no regressions (e.g. long loading states that impact usability).
  • I included screenshots or videos for tests on all platforms
  • I ran the tests on all platforms & verified they passed on:
    • Android: Native
    • Android: mWeb Chrome
    • iOS: Native
    • iOS: mWeb Safari
    • MacOS: Chrome / Safari
  • I verified there are no console errors (if there's a console error not related to the PR, report it or open an issue for it to be fixed)
  • I followed proper code patterns (see Reviewing the code)
    • I verified that any callback methods that were added or modified are named for what the method does and never what callback they handle (i.e. toggleReport and not onIconClick)
    • I verified that comments were added to code that is not self explanatory
    • I verified that any new or modified comments were clear, correct English, and explained "why" the code was doing something instead of only explaining "what" the code was doing.
    • I verified any copy / text shown in the product is localized by adding it to src/languages/* files and using the translation method
      • If any non-english text was added/modified, I used JaimeGPT to get English > Spanish translation. I then posted it in #expensify-open-source and it was approved by an internal Expensify engineer. Link to Slack message:
    • I verified all numbers, amounts, dates and phone numbers shown in the product are using the localization methods
    • I verified any copy / text that was added to the app is grammatically correct in English. It adheres to proper capitalization guidelines (note: only the first word of header/labels should be capitalized), and is either coming verbatim from figma or has been approved by marketing (in order to get marketing approval, ask the Bug Zero team member to add the Waiting for copy label to the issue)
    • I verified proper file naming conventions were followed for any new files or renamed files. All non-platform specific files are named after what they export and are not named "index.js". All platform-specific files are named for the platform the code supports as outlined in the README.
    • I verified the JSDocs style guidelines (in STYLE.md) were followed
  • If a new code pattern is added I verified it was agreed to be used by multiple Expensify engineers
  • I followed the guidelines as stated in the Review Guidelines
  • I tested other components that can be impacted by my changes (i.e. if the PR modifies a shared library or component like Avatar, I verified the components using Avatar are working as expected)
  • I verified all code is DRY (the PR doesn't include any logic written more than once, with the exception of tests)
  • I verified any variables that can be defined as constants (ie. in CONST.ts or at the top of the file that uses the constant) are defined as such
  • I verified that if a function's arguments changed that all usages have also been updated correctly
  • If any new file was added I verified that:
    • The file has a description of what it does and/or why is needed at the top of the file if the code is not self explanatory
  • If a new CSS style is added I verified that:
    • A similar style doesn't already exist
    • The style can't be created with an existing StyleUtils function (i.e. StyleUtils.getBackgroundAndBorderStyle(theme.componentBG))
  • If new assets were added or existing ones were modified, I verified that:
    • The assets are optimized and compressed (for SVG files, run npm run compress-svg)
    • The assets load correctly across all supported platforms.
  • If the PR modifies code that runs when editing or sending messages, I tested and verified there is no unexpected behavior for all supported markdown - URLs, single line code, code blocks, quotes, headings, bold, strikethrough, and italic.
  • If the PR modifies a generic component, I tested and verified that those changes do not break usages of that component in the rest of the App (i.e. if a shared library or component like Avatar is modified, I verified that Avatar is working as expected in all cases)
  • If the PR modifies a component related to any of the existing Storybook stories, I tested and verified all stories for that component are still working as expected.
  • If the PR modifies a component or page that can be accessed by a direct deeplink, I verified that the code functions as expected when the deeplink is used - from a logged in and logged out account.
  • If the PR modifies the UI (e.g. new buttons, new UI components, changing the padding/spacing/sizing, moving components, etc) or modifies the form input styles:
    • I verified that all the inputs inside a form are aligned with each other.
    • I added Design label and/or tagged @Expensify/design so the design team can review the changes.
  • If a new page is added, I verified it's using the ScrollView component to make it scrollable when more elements are added to the page.
  • I added unit tests for any new feature or bug fix in this PR to help automatically prevent regressions in this user flow.
  • If the main branch was merged into this PR after a review, I tested again and verified the outcome was still expected according to the Test steps.

Screenshots/Videos

Android: Native
Android: mWeb Chrome
iOS: Native
iOS: mWeb Safari
MacOS: Chrome / Safari

Log.info('[PersistedRequests] hit Onyx connect callback', false, {isValNullish: val == null});
Log.info('[PersistedRequests] hit Onyx connect callback', false, {isValNullish: val == null, isInitialized});

if (isInitialized) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is interesting, makes sense to me tbh

@szymonzalarski98 szymonzalarski98 changed the title Ensure in-memory state is authoritative after initialization in Persi… [No QA] Ensure in-memory state is authoritative after initialization in Persi… Mar 11, 2026
@szymonzalarski98 szymonzalarski98 force-pushed the callstack-internal/szymonzalarski/investigate-and-fix-sequential-queue-issues-issue-4 branch from 3a608e0 to 8cfba53 Compare March 11, 2026 08:15
@szymonzalarski98 szymonzalarski98 changed the title [No QA] Ensure in-memory state is authoritative after initialization in Persi… [No QA] Fix lost/duplicated requests in PersistedRequests (Issues 3, 4, 5) Mar 13, 2026
@szymonzalarski98
Copy link
Contributor Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6b867738cf

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@szymonzalarski98 szymonzalarski98 marked this pull request as ready for review March 13, 2026 16:41
@szymonzalarski98 szymonzalarski98 requested review from a team as code owners March 13, 2026 16:41
@melvin-bot melvin-bot bot requested review from ChavdaSachin and removed request for a team March 13, 2026 16:41
@melvin-bot
Copy link

melvin-bot bot commented Mar 13, 2026

@ChavdaSachin Please copy/paste the Reviewer Checklist from here into a new comment on this PR and complete it. If you have the K2 extension, you can simply click: [this button]

@melvin-bot melvin-bot bot requested review from heyjennahay and removed request for a team March 13, 2026 16:41
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6b867738cf

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +31 to +35
if (isInitialized) {
// After initialization, in-memory is authoritative.
// Ignore disk callbacks to prevent out-of-order Onyx.set() from
// overwriting the correct in-memory state (Bug #80759 Issue 4).
return;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Continue syncing queue updates after initialization

Returning early once isInitialized is true makes this connection ignore every later PERSISTED_REQUESTS update, so leader tabs stop seeing writes enqueued by other tabs. Fresh evidence in this commit is that request lifecycle paths now persist queue changes via Onyx.multiSet, while queue execution is still leader-only in src/libs/Network/SequentialQueue.ts (isClientTheLeader() gate), which means non-leader writes can remain unprocessed until reload or leader change because the leader’s in-memory persistedRequests is never updated.

Useful? React with 👍 / 👎.

@ChavdaSachin
Copy link
Contributor

I could see the previous PR was reviewed by @mountiny, so do I need to review this?

@mountiny
Copy link
Contributor

@ChavdaSachin yes please, can you also do the testing?

@mjasikowski mjasikowski merged commit 46fc401 into Expensify:main Mar 18, 2026
33 of 38 checks passed
@github-actions
Copy link
Contributor

🚧 @mjasikowski has triggered a test Expensify/App build. You can view the workflow run here.

@OSBotify
Copy link
Contributor

✋ This PR was not deployed to staging yet because QA is ongoing. It will be automatically deployed to staging after the next production release.

@OSBotify
Copy link
Contributor

🚀 Deployed to staging by https://github.com/mjasikowski in version: 9.3.40-0 🚀

platform result
🕸 web 🕸 success ✅
🤖 android 🤖 failure ❌
🍎 iOS 🍎 success ✅

@OSBotify
Copy link
Contributor

🚀 Deployed to staging by https://github.com/mjasikowski in version: 9.3.40-0 🚀

platform result
🕸 web 🕸 success ✅
🤖 android 🤖 success ✅
🍎 iOS 🍎 success ✅

adhorodyski added a commit to callstack-internal/Expensify-App that referenced this pull request Mar 18, 2026
The Promise.resolve().then(() => flush(false)) workaround was needed
because stale Onyx.set() callbacks could overwrite in-memory
persistedRequests during process() (Issues 3c/4 in Expensify#80759).

PR Expensify#84622 fixed this at the source: PersistedRequests now treats
in-memory state as authoritative after initialization, making the
deferral unnecessary. Call flush(false) directly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jponikarchuk
Copy link

Deploy Blocker #85669 was identified to be related to this PR.

@jponikarchuk
Copy link

Deploy Blocker #85673 was identified to be related to this PR.

@lanitochka17
Copy link

Deploy Blocker #85671 was identified to be related to this PR

@szymonzalarski98
Copy link
Contributor Author

@jponikarchuk @lanitochka17 on it

@szymonzalarski98
Copy link
Contributor Author

These 3 deploy blockers are all about file upload/attachment failures, not about the sequential queue persistence pipeline.

The root cause is PR #70740 ("refactor: Improve and simplify attachment and file validation"), merged today, which modifies AttachmentUtils.ts, validateAttachmentFile.ts, FileUtils.ts, and attachment modal components.

Issue #85671 directly links PR #70740 in its "regression testing" field.

Our PR #84622 only touches PersistedRequests.ts, SequentialQueue.ts, and API/index.ts — none of which handle file validation, attachment rendering, or upload UI. Our changes affect when a request is persisted to disk and how Onyx callbacks are handled, not the file handling pipeline.

cc: @jponikarchuk @lanitochka17 @adhorodyski @mountiny

@szymonzalarski98 szymonzalarski98 changed the title [No QA] Fix lost/duplicated requests in PersistedRequests (Issues 3, 4, 5) [No QA] Fix lost/duplicated requests in PersistedRequests (Issues 2, 3, 4, 5) Mar 18, 2026
@mountiny
Copy link
Contributor

@szymonzalarski98 QA found these regressions using the binary search method. This is the first PR based on the chronological order of being merged that has this regression

@mountiny
Copy link
Contributor

so either there is some issue with the builds cc @LukasMod or order, or not sure what is wrong here 😄

@lanitochka17
Copy link

Deploy Blocker ##85708 was identified to be related to this PR.

@ChavdaSachin
Copy link
Contributor

I reverted the current PR locally and could confirm #85708 is caused by the current PR, haven't tested the other blockers but they are likely to be related to current PR as well based on the similarity(file upload related issues).

My copilot came up with the changes which fixed the issue #85708, but I haven't reviewed the code (gotta go offline now 💤).

@szymonzalarski98
Copy link
Contributor Author

@ChavdaSachin thank you for letting me know, I'm working on a fix for those issues

@jponikarchuk
Copy link

Deploy Blocker #85755 was identified to be related to this PR.

@szymonzalarski98
Copy link
Contributor Author

Hey, here is a PR with fix #85920

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants