Super Ollama Load balancer & Orchestration Layer
Intelligent routing, observability, and distributed inference for Ollama clusters.
SOLLOL sits between your applications and a collection of Ollama nodes. It discovers them, monitors their health, scores them by GPU/CPU capacity and current load, then routes each request to the best available node. If a node dies, it fails over automatically.
Think of it as a drop-in replacement for talking to Ollama directly — same API, but with intelligent routing, observability, and cluster management layered on top.
pip install sollolfrom sollol import OllamaPool
# Auto-discover all Ollama nodes on the network
pool = OllamaPool.auto_configure()
# Make a request — SOLLOL routes it to the best node
response = pool.chat(
model="llama3.2",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response['message']['content'])That's it. No async/await, no config files. It finds nodes, picks the best one, routes the request.
Run SOLLOL as a gateway that replaces Ollama on port 11434:
sollol upApplications talking to localhost:11434 now go through SOLLOL's routing engine instead of hitting a single node directly.
SOLLOL scores every node on multiple factors before routing:
| Factor | What It Checks |
|---|---|
| Success rate | Historical reliability of each node |
| Latency | Response time from recent requests |
| GPU availability | Whether the node has a GPU (via gpustat + Redis) |
| Current load | Queue depth and active requests |
| Task type | Generation, embedding, or classification workloads |
| Priority | Request priority (CRITICAL → BATCH) |
| Specialization | Nodes that historically perform well for specific models |
Scoring formula:
Score = 100 (baseline)
× success_rate
÷ (1 + latency_penalty)
× gpu_bonus (1.5x if GPU available & needed)
÷ (1 + load_penalty)
× priority_alignment
× task_specialization
Scans the local network for Ollama instances. No need to configure node addresses manually.
Install the GPU reporter on each node:
sollol install-gpu-reporter --redis-host 192.168.1.10Publishes real-time VRAM stats to Redis every 5 seconds. SOLLOL uses this to avoid routing heavy models to nodes that don't have the memory.
python3 -m sollol.dashboard_service &Web UI at http://localhost:8080 showing:
- Node health and status
- P50/P95/P99 latency metrics
- Active applications using the cluster
- GPU memory usage (if reporter installed)
- Live request/activity logs
Ray — Parallel request execution via Ray actors for multi-agent workloads.
Dask — Batch processing for embeddings and bulk inference with work stealing.
No async/await needed:
from sollol.sync_wrapper import OllamaPool
from sollol.priority_helpers import Priority
pool = OllamaPool.auto_configure()
response = pool.chat(
model="llama3.2",
messages=[{"role": "user", "content": "Hello!"}],
priority=Priority.HIGH,
timeout=60
)Duplicate slow requests to a second node and take the first response back. Configurable threshold.
Nodes that start failing get temporarily removed from the pool until they recover.
┌─────────────────────┐
│ Your Application │
└──────────┬──────────┘
│
▼
┌──────────────────────────────────┐
│ SOLLOL Gateway (:11434) │
│ ┌────────────────────────────┐ │
│ │ Intelligent Routing Engine │ │
│ │ Scores all nodes, picks │ │
│ │ the best, routes request │ │
│ └────────────┬───────────────┘ │
│ ┌────────────┴───────────────┐ │
│ │ Priority Queue + Hedging │ │
│ │ + Circuit Breakers │ │
│ └────────────┬───────────────┘ │
└───────────────┼──────────────────┘
│
┌───────────┼───────────┐
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ GPU 24 │ │ GPU 16 │ │ CPU │
└───────┘ └───────┘ └───────┘
# From PyPI
pip install sollol
# From source
git clone https://github.com/B-A-M-N/SOLLOL.git
cd SOLLOL
pip install -e .| Command | Description |
|---|---|
sollol up |
Start the SOLLOL gateway on port 11434 |
sollol install-gpu-reporter --redis-host <ip> |
Set up GPU monitoring on a node |
All settings work via environment variables:
| Variable | Default | Description |
|---|---|---|
SOLLOL_PORT |
11434 | Gateway port |
SOLLOL_RAY_WORKERS |
4 | Ray actor count |
SOLLOL_DASK_WORKERS |
2 | Dask worker count |
OLLAMA_NODES |
auto-discover | Comma-separated node addresses |
RPC_BACKENDS |
none | Comma-separated llama.cpp RPC backends |
SOLLOL_BATCH_PROCESSING |
true | Enable Dask batch mode |
Route to the right node, fail over fast, and never lose a request to a dead endpoint.
- Dallan Loomis — for the interactions and guidance that kept this project on track
- My parents — for the support that made all of this possible
- My son — the reason I build