fix(distributed): detach cold-load staging from the request context by localai-bot · Pull Request #10438 · mudler/LocalAI

localai-bot · 2026-06-21T23:26:43Z

Problem

In distributed mode, a model not yet loaded on a worker is staged lazily on the inference request path (SmartRouter.Route → scheduleAndLoad → stageModelFiles). The staging upload is bound to the triggering request's context. Staging a multi-GB model takes minutes - far longer than any client keeps its HTTP connection open - so a browser refresh, an ingress/LB idle-timeout, or a round-robined retry landing on another frontend replica cancels the request context and aborts the upload with context canceled mid-transfer. Large models then never finish staging, so they never load.

Observed in a 2-replica deployment: both frontends repeatedly failed to stage a 15.7 GB GGUF, each attempt dying at a different offset:

ERROR File upload failed  file=...Q4_K_M.gguf size=15.7 GB offset=12409749364
      error=Put ".../...gguf": context canceled
ERROR Failed to load model modelID=... error=... staging model file: ... context canceled

(context canceled, not deadline exceeded → the caller's context was cancelled, i.e. the request was abandoned.)

Fix

Bind the cold load (staging + LoadModel + the per-model advisory lock) to context.WithoutCancel(ctx): it preserves the request's values (prefix-cache chain, etc.) but drops its cancellation/deadline. Each long step keeps its own bound - the file stager's resume budget and LoadModel's 5-minute timeout - so nothing leaks, and the per-model Postgres advisory lock still de-dupes concurrent loaders across replicas.

Test

router_staging_context_test.go reproduces the outage: the fake stager cancels the request context the moment staging begins and asserts the context the stager receives is not cancelled. Red before the fix (context canceled), green after. Full core/services/nodes suite passes; golangci-lint --new-from-merge-base=origin/master reports 0 issues.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

A model not yet loaded on a worker is staged lazily on the inference request path. Staging a multi-GB model takes minutes - far longer than any client keeps its HTTP request open - so a browser refresh, an ingress/LB idle-timeout, or a round-robined retry landing on another frontend replica cancels the request context and aborts the upload with "context canceled" mid-transfer. Large models then never finish staging, so they never load (observed in a 2-replica deployment: both frontends repeatedly failed to stage a 15.7 GB GGUF, each attempt dying at a different offset). Bind the cold load (staging + LoadModel + the per-model advisory lock) to context.WithoutCancel(ctx): it keeps the request's values (prefix chain) but drops cancellation/deadline. Each long step keeps its own bound (the file stager's resume budget, LoadModel's 5m timeout), and the advisory lock still de-dupes concurrent loaders across replicas. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

localai-bot mentioned this pull request Jun 22, 2026

fix(distributed): broadcast file-staging progress across replicas #10440

Merged

mudler merged commit 682fb27 into master Jun 22, 2026
59 checks passed

mudler deleted the fix/distributed-staging-detach-context branch June 22, 2026 07:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(distributed): detach cold-load staging from the request context#10438

fix(distributed): detach cold-load staging from the request context#10438
mudler merged 1 commit into
masterfrom
fix/distributed-staging-detach-context

localai-bot commented Jun 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

localai-bot commented Jun 21, 2026

Problem

Fix

Test

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants