Skip to content

Ollama returns empty-body responses under sustained load in generate_from_raw #599

@planetf1

Description

@planetf1

Description

Under sustained load on a 32GB M1 Mac with granite4:micro, generate_from_raw occasionally returns ModelOutputThunk(value="") for one or more of its parallel requests — not due to a caught exception, but as a legitimate empty response from Ollama itself.

Evidence: In a 20-run soak test, 18/20 runs had at least one empty result. The FancyLogger.warning added in #598 did not fire during these runs, confirming the empty string was Ollama's actual response, not a swallowed exception.

Current behaviour

generate_from_raw uses asyncio.gather(return_exceptions=True), so non-exception empty responses pass through silently. Tests now use assert all(r.value for r in results) to surface this clearly (added in #598).

Suspected causes

  1. Machine exhaustion — 32GB M1 running other workloads, Ollama NUM_PARALLEL≥4 with 4.6GB model, context auto-capped at 32K. May not reproduce on idle/cold machine.
  2. Ollama bug — server returns empty body for some requests at high concurrency, not reflecting a real OOM or timeout.

Next steps

  • Reproduce on idle/cold machine to isolate machine-exhaustion vs Ollama bug
  • Check if CONTEXT_WINDOW: 2048 (added in test: fix flaky ollama tests, remove stale xfails, add diagnostic logging #598) reduces or eliminates the issue
  • Consider exposing a configurable timeout on OllamaModelBackend.__init__ so tests can set a ceiling
  • If confirmed Ollama bug, open upstream issue with repro
  • If machine exhaustion, document in test infrastructure notes

Related

/label bug, ollama

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions