Skip to content

Fix state not showing up fast enough with TestDelayedEvents (de-flake v2)#869

Open
MadLittleMods wants to merge 7 commits intomainfrom
madlittlemods/more-robust-delayed-event-state-change2
Open

Fix state not showing up fast enough with TestDelayedEvents (de-flake v2)#869
MadLittleMods wants to merge 7 commits intomainfrom
madlittlemods/more-robust-delayed-event-state-change2

Conversation

@MadLittleMods
Copy link
Copy Markdown
Collaborator

@MadLittleMods MadLittleMods commented Apr 30, 2026

Fix state not showing up fast enough with TestDelayedEvents (de-flake tests)

Re-introducing the changes from #865 which were reverted in #867

Follow-up to #830

As experienced when running this test against the worker-based Synapse setup we use alongside the Synapse Pro Rust apps, https://github.com/element-hq/synapse-rust-apps/actions/runs/24910122124/job/72949760158?pr=360 (https://github.com/element-hq/synapse-rust-apps/pull/360)

❌ TestDelayedEvents/delayed_state_events_are_kept_on_server_restart (10.12s)
      delayed_event_test.go:425: StopServer hs1
      delayed_event_test.go:429: StartServer hs1
      delayed_event_test.go:443: CSAPI.MustDo GET http://127.0.0.1:32978/_matrix/client/v3/rooms/%21MbDncghrqxTzEmQhCP:hs1/state/com.example.test/1 returned non-2xx code: 404 Not Found - body: {"errcode":"M_NOT_FOUND","error":"Event not found."}

Also updating the rest of the TestDelayedEvents tests to better account for state changes that might not show up immediately because processing/worker/replication delay

Why does this flake happen?

Discussed in #865 (comment)

Dev notes

MSC4140 Synapse implementation added in element-hq/synapse#17326

Testing with Synapse:

COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh -run TestDelayedEvents

Pull Request Checklist

Follow-up to #830

As experienced in https://github.com/element-hq/synapse-rust-apps/actions/runs/24910122124/job/72949760158?pr=360 (element-hq/synapse-rust-apps#360)

```
❌ TestDelayedEvents/delayed_state_events_are_kept_on_server_restart (10.12s)
      delayed_event_test.go:425: StopServer hs1
      delayed_event_test.go:429: StartServer hs1
      delayed_event_test.go:443: CSAPI.MustDo GET http://127.0.0.1:32978/_matrix/client/v3/rooms/%21MbDncghrqxTzEmQhCP:hs1/state/com.example.test/1 returned non-2xx code: 404 Not Found - body: {"errcode":"M_NOT_FOUND","error":"Event not found."}
```

This is most likely caused because the main process handles processing delayed events
(and serving `/state`) but the state will be persisted on one of event_persister workers so the main process
might be serving stale data until the invalidation comes through.
Traditional `/sync` expects you to use `state` as the base
and layer state from the `timeline` on top (flawed because of state res).
So we have to check in the `timeline` instead of `state` for the update.

We can also consider this good here because we don't expect any state
to be rejected because of state res issues.
Use a sync filter so `timeline` doesn't mess with us

Don't use incremental sync because no need
Comment thread client/sync.go
Comment on lines +218 to 223
// Check that the `state` section for `roomID` has an event which passes the check function.
//
// Note that the `state` section of a sync response only contains the change in state up
// to the start of the `timeline` and will not contain the entire state of the room for
// incremental or `lazy_load_members` syncs.
func SyncStateHas(roomID string, check func(gjson.Result) bool) SyncCheckOpt {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the reason the previous PR didn't work, #865

Just re-formatted this a bit 🤷 (can revert if not useful)


// And includes the correct content
//
// FIXME: This assertion seems superfluous to this test and should be it's own test
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some FIXME for the messiness I'd like to see cleaned up but not doing in this PR so the purpose is more focused and understandable.


stateKey := "to_send_on_timeout"

// Schedule a delayed event
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments to understand these tests better

Comment on lines +208 to +209
// No more delayed events
matchDelayedEvents(t, user, delayedEventsNumberEqual(0))
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this check until after we've actually waited for the state to show up and we know the delayed event was processed.

Comment on lines +196 to +200
// Check for the state change from the delayed state event (using `MustSyncUntil` to
// account for any processing or worker replication delays)
user.MustSyncUntil(t, client.SyncReq{Filter: NoTimelineSyncFilter}, client.SyncStateHas(roomID, func(ev gjson.Result) bool {
return ev.Get("type").Str == eventType && ev.Get("state_key").Str == stateKey
}))
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main fix to all of these tests is using MustSyncUntil to wait and account for any processing or worker/replication delays.

Comment thread client/sync.go
@@ -215,10 +215,11 @@ func SyncTimelineHasEventID(roomID string, eventID string) SyncCheckOpt {
})
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reivilibre assigning you since you have the context from #865

Same sort of fix but applied to all of tests here

@MadLittleMods MadLittleMods marked this pull request as ready for review May 5, 2026 13:54
@MadLittleMods MadLittleMods requested review from a team as code owners May 5, 2026 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant