Skip to content

feat(unreal): Dedicated worker thread, bounded queues, zero-copy diff pipeline, native table listeners#4951

Open
brougkr wants to merge 2 commits intoclockworklabs:masterfrom
brougkr:feat/unreal-sdk-perf-reliability-overhaul
Open

feat(unreal): Dedicated worker thread, bounded queues, zero-copy diff pipeline, native table listeners#4951
brougkr wants to merge 2 commits intoclockworklabs:masterfrom
brougkr:feat/unreal-sdk-perf-reliability-overhaul

Conversation

@brougkr
Copy link
Copy Markdown
Contributor

@brougkr brougkr commented May 5, 2026

Summary

While developing an MMO-scale Unreal client with 1000+ players and 3000+ AI mobs in one area, I ran into inbound SDK bottlenecks that had to be removed to keep a 120 FPS frame budget. This PR replaces the Unreal SDK's inbound message pipeline with a dedicated worker thread, bounded queues with backpressure, a zero-copy diff apply path, and optional native table listeners.

No breaking API changes are intended. Existing OnInsert, OnDelete, and OnUpdate dynamic multicast delegates continue to work unchanged for tables that do not opt into native listeners.

This PR changes 11 files. The original local implementation plan listed 6 copied files, but audit found compile-time support dependencies in Websocket.*, UEBSATNHelpers.h, UESpacetimeDB.h, and WithBsatn.h. Those support files are included here so the branch is source-complete rather than preserving a stale file count.

Changes

1. Dedicated Inbound Worker Thread

Current: Each WebSocket message dispatches via Async(EAsyncExecution::Thread), a fire-and-forget task on UE's global thread pool. The lambda captures the full TArray<uint8> payload by value. A TMap<int32, FServerMessageType> reorder buffer sequences results back in order.

Rationale: At high message rates, thread pool slots can saturate and compete with rendering, physics, audio, and other engine tasks. The reorder buffer can grow unbounded under head-of-line blocking. TWeakObjectPtr lifetime checks verify object liveness, not connection identity, so stale async tasks from a destroyed connection can still arrive against a later session.

Improvement: Adds a single connection-owned FRunnable thread (FSpacetimeDbInboundWorker) with FEvent signaling. Complete binary messages are moved into a bounded raw queue. The worker drains FIFO, so no reorder buffer is needed. StopAndJoin() deterministically joins in-flight work during disconnect/error/teardown. Idle cost is zero-spin because the worker blocks on FEvent::Wait.

2. Bounded Queues with Backpressure

Current: PendingMessages is an unbounded TArray. If the server produces messages faster than the game thread consumes them, the array grows without limit until memory pressure or OOM. There is no diagnostic at the failure boundary.

Rationale: Sustained server spikes or client hitches should fail with actionable protocol diagnostics, not silent memory growth.

Improvement: Adds hard limits at both queue stages: raw queue 8192 messages / 128 MB, parsed queue 8192 messages / 128 MB. Overflow queues a fatal protocol error with sequence ID, payload size, compression tag, queue depth, and queued bytes. Worker-side backpressure checks parsed-queue capacity before pulling additional raw messages.

3. Configurable Per-Frame Apply Budget

Current: FrameTick() moves all pending messages to a local array and processes all of them. If a hitch accumulates hundreds of messages, the next frame processes the entire backlog and can cascade the stall.

Rationale: Games need explicit control over how much network work is applied per frame.

Improvement: Adds FSpacetimeDBInboundApplyBudget with MaxMessagesPerFrame, MaxPayloadBytesPerFrame, MinMessagesPerFrame, SoftTimeBudgetMicros, and bDrainAllPendingMessages. An incremental PendingMessageReadIndex avoids moving the entire pending array each frame. Compaction after 512 consumed messages prevents unbounded consumed-prefix growth. MakeDrainAllPendingMessages() provides a named drain-all configuration.

4. Connection Epoch Guards

Current: There is no session identity on queued or async inbound work. After disconnect/reconnect, stale work from the prior connection can potentially apply to the new connection's cache.

Rationale: A reconnect without epoch guards can corrupt client cache state with rows from a previous session.

Improvement: Adds InboundConnectionEpoch and increments it on every worker start/stop. Raw and parsed messages carry the epoch. IsInboundEpochCurrentAndAccepting() rejects stale-epoch work at queue and worker boundaries.

5. Full Lifecycle Cleanup

Current: Disconnect() only disconnects the WebSocket. Error/close paths clear pending operations but not all inbound parsed/raw/preprocessed state.

Rationale: Queued and preprocessed data must not survive connection boundaries.

Improvement: Disconnect, HandleWSError, HandleWSClosed, HandleProtocolViolation, and BeginDestroy stop/join the inbound worker and clear raw queues, parsed queues, indices, byte counters, protocol-error state, and active message-scoped preprocessing.

6. Message-Scoped Preprocessed Data

Current: Preprocessed table data is stored in a persistent TMap<FPreprocessedTableKey, ...> protected by PreprocessedDataMutex. Missing data falls back to re-parsing raw BSATN on the game thread.

Rationale: The fallback double-deserializes bytes that were already parsed on the worker, can hide preprocessing bugs, and keeps a persistent map that can leak state across messages.

Improvement: ActivePreprocessedTableData points only to the current FInboundParsedMessage during apply. Missing preprocessing is a checkf, not a silent fallback. This removes the persistent preprocessing map and per-table mutex acquisition during game-thread apply.

7. Zero-Copy Diff Data Structures

Current:

TMap<TArray<uint8>, RowType> Deletes;
TMap<TArray<uint8>, RowType> Inserts;

Every diff entry copies BSATN key bytes and copies the row by value. TMap iteration is hash-table order.

Rationale: For high-cardinality row diffs, key copies, row copies, and hash-map iteration add avoidable memory traffic and cache misses.

Improvement:

TArray<TSharedPtr<RowType>> Deletes;
TArray<TSharedPtr<RowType>> Inserts;

Diffs share row pointers from the table cache. This eliminates key copies from the broadcast diff and makes diff iteration contiguous. DeriveUpdatesByPrimaryKey now uses index-based matching with bit arrays and a single compaction pass instead of per-key TMap::Remove() rehashing.

8. Zero-Copy Cache Apply - Generic Path

Current: ApplyDiff takes const copied containers. Insert rows are copied into MakeShared<RowType>(Row). Index updates create new shared pointers for delete/index work rather than reusing the cached row pointer.

Rationale: At thousands of rows per diff, redundant MakeShared, row copies, key copies, and index pointer churn become a hot-path cost.

Improvement: ApplyDiff now takes TArray<FWithBsatn<RowType>>&& and moves insert rows into cache storage. Index update lambdas reuse the existing TSharedPtr from cache. Diff containers are pre-reserved. Refcount and row validity are enforced with checkf at cache boundaries.

9. Compact Primary Key Cache Apply

Current: Tables with simple integer primary keys still pay full BSATN key storage and hashing costs.

Rationale: Rows keyed by generated uint64 fields can use an 8-byte native key instead of serializing and hashing the full BSATN row bytes.

Improvement: Adds TCompactPrimaryKeyTraits<RowType> with SFINAE detection for generated FrameKey and BatchKey uint64 fields. Eligible rows use compact 8-byte cache keys, inline update classification, and bPrimaryKeyUpdatesClassified = true. Ineligible rows use the generic path unchanged. The trait is intentionally extensible if maintainers prefer user-defined specializations later.

10. Move-Semantic Forwarding at RemoteTable -> ClientCache

Current: RemoteTable::BaseUpdate copies FWithBsatn arrays into new containers before forwarding to ClientCache::ApplyDiff.

Rationale: The worker has already materialized row+BSATN arrays. Rebuilding them at the table/cache boundary is unnecessary.

Improvement: RemoteTable::BaseUpdate now takes mutable arrays, forwards them with MoveTemp, and uses if constexpr to route compact-primary-key rows to ApplyDiffByPrimaryKey.

11. Multi-Diff Table Update Handler

Current: Each table handler stores a single FTableAppliedDiff<RowType> LastDiff. If one server message contains multiple updates for the same table, only the last diff survives for broadcast.

Rationale: Multi-update-per-table messages must preserve ordered broadcasts.

Improvement: Adds TArray<FTableAppliedDiff<RowType>> PendingDiffs with PendingDiffReadIndex. Multiple diffs are queued and broadcast in order. UpdateCache returns bool so empty diffs can skip broadcast work.

12. Native Table Listeners

Current: All table broadcasts go through UE dynamic multicast delegates (OnInsert, OnDelete, OnUpdate), which use reflection dispatch.

Rationale: Reflection dispatch is expensive on high-cardinality table diffs. At 3000 entities with insert/update/delete-style callbacks, dynamic dispatch can consume a meaningful part of an 8.33 ms frame budget.

Improvement: Adds FNativeTableListenerBinding and templated registration helpers for direct member-function dispatch. Existing dynamic multicast delegates remain supported. If native listeners are registered for a table, the table can bypass ProcessEvent for that hot path.

13. Broadcast Validation

Current: Diff rows are broadcast without structural validity checks. Mismatched update arrays are silently truncated with FMath::Min.

Rationale: Silent truncation can hide data corruption. A malformed diff should fail at the source.

Improvement: Adds checkf(Row.IsValid()) for every insert/delete/update row and checkf(Diff.UpdateDeletes.Num() == Diff.UpdateInserts.Num()) before update broadcast.

14. Pre-Classified Update Diff Guard

Current: DeriveUpdatesByPrimaryKey always runs the primary-key matching scan.

Rationale: Compact-primary-key apply already classifies updates inline. Running the generic derivation again duplicates work.

Improvement: ApplyDiffByPrimaryKey sets bPrimaryKeyUpdatesClassified = true. DeriveUpdatesByPrimaryKey early-returns after checking update array pairing.

15. DeriveUpdatesByPrimaryKey Early-Out Fix

Current: The function only early-outs when deletes are empty.

Improvement: It now returns when deletes or inserts are empty. If there are inserts but no deletes, updates are impossible, so the map build and scan are skipped.

16. Lock-Free Registered Table Snapshot Reads

Current: Applying registered table updates locks RegisteredTablesMutex per table lookup and allocates a local handler array per call.

Improvement: Registration rebuilds RegisteredTablesSnapshot, and apply reads that snapshot for lookup. TableUpdateHandlersScratch is reused to avoid per-frame heap churn.

17. Ticking Mechanism Change

Current: Auto ticking uses FTSTicker, outside the standard tickable-object visibility path.

Change: UDbConnectionBase now implements FTickableGameObject with Tick, GetStatId, IsTickable, and IsTickableInEditor.

Note: This is a trade-off. FTSTicker avoids tickable object count inflation with multiple connections. FTickableGameObject gives standard Unreal tick/profiler visibility. The default can be adjusted based on maintainer preference.

18. CPU Profiler Instrumentation

Current: The inbound SDK path has little profiler visibility in Unreal Insights.

Improvement: Adds TRACE_CPUPROFILER_EVENT_SCOPE markers for inbound enqueue, worker drain/preprocess, game-thread apply, table cache apply, and table broadcast. Per-message/table timing is used for soft-budget diagnostics.

19. Named Budget Factory

Improvement: Adds FSpacetimeDBInboundApplyBudget::MakeDrainAllPendingMessages() so callers can opt into old drain-all behavior explicitly.

20. Spurious Insert Diff on Refcount Bumps

Current on recent upstream master: ClientCache::ApplyDiff was already fixed to avoid emitting insert diffs for subscription refcount bumps.

This PR: Preserves that behavior in the rewritten generic and compact-primary-key paths. Existing entries with positive refcount increment the refcount and skip Diff.Inserts.

Support Files Included by Audit

The original implementation plan listed only the largest six files. During implementation audit, those files did not compile against current upstream without these source dependencies:

  • Connection/Websocket.h and Connection/Websocket.cpp: native binary-message target and move dispatch into UDbConnectionBase::HandleWSBinaryMessageOwned.
  • BSATN/UEBSATNHelpers.h: message-scoped row preprocessing, query-row apply mode, row counts/byte counts, and zero-copy row-list parsing helpers.
  • BSATN/UESpacetimeDB.h: DeserializeView for pointer/view based deserialization without creating temporary byte arrays.
  • DBCache/WithBsatn.h: move constructor for row+BSATN ownership transfer.

Performance Estimate

Conservative combined hot-path savings at full load (1000 players + 3000 mobs):

Area Estimated savings
Native listener dispatch vs dynamic multicast ~2.6 ms/frame
Compact PK hashing vs CRC32 over BSATN ~0.6 ms/frame
Eliminated copies and allocations ~0.5-1.0 ms/frame
Total ~3-5 ms/frame

At 120 FPS (8.33 ms/frame), this represents roughly 36-60% of the frame budget in the intended high-load case. These are estimates from the motivating workload; measured payload sizes and profiler captures should supersede them where available.

Files Changed

File Additions Deletions
sdks/unreal/src/SpacetimeDbSdk/Source/SpacetimeDbSdk/Private/Connection/DbConnectionBase.cpp 1077 338
sdks/unreal/src/SpacetimeDbSdk/Source/SpacetimeDbSdk/Private/Connection/DbConnectionBuilder.cpp 8 7
sdks/unreal/src/SpacetimeDbSdk/Source/SpacetimeDbSdk/Private/Connection/Websocket.cpp 71 47
sdks/unreal/src/SpacetimeDbSdk/Source/SpacetimeDbSdk/Public/BSATN/UEBSATNHelpers.h 99 29
sdks/unreal/src/SpacetimeDbSdk/Source/SpacetimeDbSdk/Public/BSATN/UESpacetimeDB.h 53 7
sdks/unreal/src/SpacetimeDbSdk/Source/SpacetimeDbSdk/Public/Connection/DbConnectionBase.h 703 320
sdks/unreal/src/SpacetimeDbSdk/Source/SpacetimeDbSdk/Public/Connection/Websocket.h 27 23
sdks/unreal/src/SpacetimeDbSdk/Source/SpacetimeDbSdk/Public/DBCache/ClientCache.h 461 61
sdks/unreal/src/SpacetimeDbSdk/Source/SpacetimeDbSdk/Public/DBCache/TableAppliedDiff.h 67 28
sdks/unreal/src/SpacetimeDbSdk/Source/SpacetimeDbSdk/Public/DBCache/WithBsatn.h 8 5
sdks/unreal/src/SpacetimeDbSdk/Source/SpacetimeDbSdk/Public/Tables/RemoteTable.h 13 13

Validation

  • git diff --check
  • /Users/brougkr/Documents/Unreal Projects/FACTIONS/Scripts/full_rebuild.sh
  • SpacetimeDB module publish completed successfully in the validation project
  • Unreal binding regeneration completed successfully in the validation project
  • Unreal C++ build completed successfully in the validation project
  • Existing table subscriptions and reducer calls work unchanged in upstream maintainers' test projects
  • Dynamic multicast delegates (OnInsert, OnDelete, OnUpdate) fire correctly in upstream maintainers' test projects
  • Bounded queue overflow produces fatal protocol diagnostics rather than silent OOM
  • Disconnect/reconnect does not apply stale messages from a previous connection epoch
  • Unreal Insights shows SpacetimeDB_* profiler scopes

Notes

  • No SpacetimeDB schema, reducer, subscription query, chunking, interest, or replicated-row shape is changed by this PR.
  • No new tests or test framework are added.
  • The compact primary-key trait currently detects generated FrameKey and BatchKey uint64 fields. The infrastructure can be extended to user-defined specializations if maintainers prefer a more general primary-key trait surface.

Replace the Unreal SDK inbound message path with a connection-owned worker thread, bounded raw and parsed queues, connection epoch guards, and deterministic lifecycle cleanup. This removes the previous fire-and-forget thread-pool preprocessing and reorder buffer, which could grow without backpressure and could allow stale async work to outlive a connection boundary.

Move table preprocessing to message-scoped data, add zero-copy BSATN row parsing through DeserializeView, move-enabled FWithBsatn rows, and rvalue forwarding from remote tables into client caches. Cache apply now reuses shared row storage for diffs, validates structural invariants with checkf, and avoids spurious insert diffs on refcount bumps.

Add compact uint64 primary-key apply support for generated FrameKey and BatchKey rows, native table listener bindings for reflection-free hot-path dispatch, multi-diff table broadcast ordering, and profiler scopes for inbound enqueue, worker preprocess, game-thread apply, cache apply, and broadcasts.

Include the WebSocket native binary-message target and BSATN helper APIs required by the worker and zero-copy preprocessing path. The original plan listed only six files, but audit found those omitted files were compile dependencies for the copied implementation, so this commit keeps the PR source-complete rather than preserving a stale file count.

Validation: ran git diff --check successfully; ran the required ./Scripts/full_rebuild.sh from /Users/brougkr/Documents/Unreal Projects/FACTIONS successfully, including SpacetimeDB publish, binding regeneration, and Unreal C++ build.
@brougkr brougkr changed the title feat(unreal): Dedicated worker thread, bounded queues, zero-copy diff pipeline, native listeners feat(unreal): Dedicated worker thread, bounded queues, zero-copy diff pipeline, native table listeners May 5, 2026
Remove FACTIONS-specific table policy from the generic Unreal SDK cache path. Direct native diffs now require explicit ETableCacheApplyMode::DirectNativeDiff, runtime B-tree index application is controlled through explicit cache policy hooks, and compact cache keys are driven by generated TCompactPrimaryKeyTraits specializations for schema-declared uint64 primary keys instead of FrameKey/BatchKey field-name inference.

Make native listener dispatch lifetime-safe and explicit. Native listeners now store weak UObject owners plus an unregister identity key, expired owners are logged and removed before dispatch continues, and the default dispatch path invokes native listeners before dynamic delegates unless a registration explicitly opts into NativeOnly suppression for hot tables.

Add decoded and parsed-memory backpressure. Gzip payloads whose declared decoded size exceeds the inbound cap are rejected before allocation, parsed queue accounting now uses estimated decoded/preprocessed memory instead of only raw payload bytes, parsed queue overload reports a protocol error, and raw inbound queue draining uses a read index with periodic compaction instead of front-removing every drain.

Update public diff documentation to describe the lower-copy TSharedPtr-backed C++ diff representation and clarify that generated dynamic delegates remain value-reference based. This documents the direct FTableAppliedDiff source-shape change rather than claiming a purely additive or zero-copy API.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant