fix(distributed): broadcast file-staging progress across replicas#10440
Merged
Conversation
File-staging progress lived only in the SmartRouter's in-memory
StagingTracker on the replica performing the transfer. In a multi-replica
deployment behind a round-robin load balancer, a /api/operations poll
that lands on any other replica saw no staging row, so the progress
("processing file ... Total ... Current ...") flickered in and out as
polls rotated between frontends.
Mirror the pattern already used for gallery-install progress: the origin
replica broadcasts staging ticks over NATS (SubjectStagingProgress, a
new staging.<model>.progress subject), and peers merge them via
ApplyRemote (SubscribeBroadcasts on the wildcard). Byte-level ticks are
leading-edge debounced (~1/s); Start/FileComplete/Complete always
publish. A locally-owned op stays authoritative so the origin's own echo
and stray peer events can't clobber it, and mirrored remote ops expire
after a TTL so a missed Done event can't leave a phantom row. The UI read
path (StagingTracker.GetAll) is unchanged.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
File-staging progress lived only in the
SmartRouter's in-memoryStagingTrackeron the replica performing the transfer. In a multi-replica deployment behind a round-robin load balancer, a/api/operationspoll that lands on any other replica saw no staging row, so the progress line (processing file ... Total ... Current ...) flickered in and out as polls rotated between frontends.This is the same cross-replica class as gallery-install progress (already solved via NATS broadcast + merge), but staging never got the equivalent treatment.
Fix
Mirror the gallery-install pattern:
staging.<model>.progresssubject (SubjectStagingProgress).SubscribeBroadcasts) and merge viaApplyRemote.Start/FileComplete/Completealways publish so peers never miss a transition.Doneevent (NATS is fire-and-forget) can't leave a phantom row.The UI read path (
StagingTracker.GetAll, consumed by/api/operations) is unchanged.Test
staging_progress_broadcast_test.go: a peer tracker surfaces an op it did not originate after merging broadcasts; the op is removed on completion; a locally-owned op is not clobbered by peer events; standalone mode (no publisher) does not broadcast. Fullcore/services/nodessuite passes;golangci-lint --new-from-merge-base=origin/masterreports 0 issues.Related
Companion to #10438 (staging context detach). Both came out of the same multi-replica deployment investigation; this one is the cosmetic flicker, #10438 is the model-load outage.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]