Skip to content

fix(distributed): detach cold-load staging from the request context#10438

Merged
mudler merged 1 commit into
masterfrom
fix/distributed-staging-detach-context
Jun 22, 2026
Merged

fix(distributed): detach cold-load staging from the request context#10438
mudler merged 1 commit into
masterfrom
fix/distributed-staging-detach-context

Conversation

@localai-bot

Copy link
Copy Markdown
Collaborator

Problem

In distributed mode, a model not yet loaded on a worker is staged lazily on the inference request path (SmartRouter.RoutescheduleAndLoadstageModelFiles). The staging upload is bound to the triggering request's context. Staging a multi-GB model takes minutes - far longer than any client keeps its HTTP connection open - so a browser refresh, an ingress/LB idle-timeout, or a round-robined retry landing on another frontend replica cancels the request context and aborts the upload with context canceled mid-transfer. Large models then never finish staging, so they never load.

Observed in a 2-replica deployment: both frontends repeatedly failed to stage a 15.7 GB GGUF, each attempt dying at a different offset:

ERROR File upload failed  file=...Q4_K_M.gguf size=15.7 GB offset=12409749364
      error=Put ".../...gguf": context canceled
ERROR Failed to load model modelID=... error=... staging model file: ... context canceled

(context canceled, not deadline exceeded → the caller's context was cancelled, i.e. the request was abandoned.)

Fix

Bind the cold load (staging + LoadModel + the per-model advisory lock) to context.WithoutCancel(ctx): it preserves the request's values (prefix-cache chain, etc.) but drops its cancellation/deadline. Each long step keeps its own bound - the file stager's resume budget and LoadModel's 5-minute timeout - so nothing leaks, and the per-model Postgres advisory lock still de-dupes concurrent loaders across replicas.

Test

router_staging_context_test.go reproduces the outage: the fake stager cancels the request context the moment staging begins and asserts the context the stager receives is not cancelled. Red before the fix (context canceled), green after. Full core/services/nodes suite passes; golangci-lint --new-from-merge-base=origin/master reports 0 issues.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

A model not yet loaded on a worker is staged lazily on the inference
request path. Staging a multi-GB model takes minutes - far longer than
any client keeps its HTTP request open - so a browser refresh, an
ingress/LB idle-timeout, or a round-robined retry landing on another
frontend replica cancels the request context and aborts the upload with
"context canceled" mid-transfer. Large models then never finish staging,
so they never load (observed in a 2-replica deployment: both frontends
repeatedly failed to stage a 15.7 GB GGUF, each attempt dying at a
different offset).

Bind the cold load (staging + LoadModel + the per-model advisory lock) to
context.WithoutCancel(ctx): it keeps the request's values (prefix chain)
but drops cancellation/deadline. Each long step keeps its own bound (the
file stager's resume budget, LoadModel's 5m timeout), and the advisory
lock still de-dupes concurrent loaders across replicas.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
@mudler mudler merged commit 682fb27 into master Jun 22, 2026
59 checks passed
@mudler mudler deleted the fix/distributed-staging-detach-context branch June 22, 2026 07:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants