The First LLM Inference Engine Designed Specifically for Deterministic Edge Computing
Ollama is like a sports car:
- Super fast on the highway (136 tok/s)
- But sometimes it randomly stops for gas (garbage collection)
- Needs a big garage ($800+ computer)
- Uses lots of fuel (91MB model, 8GB+ RAM)
EdgeLLM is like a reliable scooter:
- Slower on the highway (8 tok/s currently)
- But NEVER randomly stops - completely predictable
- Fits in a tiny shed ($15 Raspberry Pi)
- Sips fuel (40MB model, 256MB RAM)
Imagine you're asking a question:
Ollama Response Time:
Ask #1: 0.2 seconds ✓ Fast!
Ask #2: 8.8 seconds 😐 Okay...
Ask #3: 19.7 seconds 😤 Why so slow?!
EdgeLLM Response Time:
Ask #1: 4.6 seconds ✓
Ask #2: 4.7 seconds ✓
Ask #3: 4.6 seconds ✓ Always the same!
Why this matters:
- A robot arm needs to know EXACTLY when a response comes
- A voice assistant can't randomly take 20 seconds
- Factory machines need consistent timing
Analogy: Would you rather have a bus that arrives "sometime between 1 and 20 minutes" or one that ALWAYS arrives in 5 minutes?
Same AI brain, different sizes:
Original Model: [████████████████████████████] 256 MB
Ollama Version: [███████████] 91 MB
EdgeLLM Version: [████] 40 MB
How? We use a clever trick called BitNet:
- Normal AI: Each "thought" stored as a complex number
- Our AI: Each "thought" is just -1, 0, or +1 (three options)
Analogy: Instead of writing "2.7834729", we just write "+" or "-" or "0". Way less space, still works!
| To Run Ollama | To Run EdgeLLM |
|---|---|
| Gaming PC: $800+ | Raspberry Pi Zero: $15 |
| 8GB RAM minimum | 256MB RAM enough |
| Big power supply | USB phone charger |
Smart Doorbell:
- Ollama: Needs a computer inside your wall 💸
- EdgeLLM: Runs on a $15 chip in the doorbell ✓
Farm Sensor:
- Ollama: Needs internet to work 🌐
- EdgeLLM: Works offline in the middle of nowhere ✓
Robot Arm in Factory:
- Ollama: Response in 0.2 to 20 seconds (unpredictable)
⚠️ - EdgeLLM: Response in 4.5 seconds (always) ✓
EdgeLLM trades raw speed for PREDICTABILITY and the ability to run on devices the size of a credit card.
It's not about being the fastest car. It's about being the only car that fits in your pocket and never breaks down.
Traditional LLM inference engines like Ollama, llama.cpp, and vLLM were designed for cloud and desktop environments where:
- Resources are abundant
- Latency variability is acceptable
- Hardware costs are not a concern
But edge computing has fundamentally different requirements:
| Requirement | Cloud AI | Edge AI |
|---|---|---|
| Latency | Variable OK | Deterministic Required |
| Hardware | $800+ servers | $15-100 devices |
| Memory | 16-256 GB | 256 MB - 4 GB |
| Power | Unlimited | 5-15W max |
| Connectivity | Always online | Offline capable |
| Model Size | 91 MB+ | 40 MB or less |
| Metric | Ollama | EdgeLLM | EdgeLLM Advantage |
|---|---|---|---|
| Latency Jitter | 5,799 ms | 373 ms | 15.5x lower |
| P99 Latency | 19.7 sec | 5.5 sec | 3.6x faster |
| Model Size (SmolLM-135M) | 91 MB | 40 MB | 2.3x smaller |
| Minimum RAM | 8+ GB | 256 MB | 32x less |
| Minimum Hardware Cost | $800+ | $15 | 53x cheaper |
| Garbage Collection Pauses | Yes | None | Deterministic |
| Offline Capable | Limited | Full | No network required |
Ollama Latency: ████████████████████████████████████████████████ 5,799 ms jitter
EdgeLLM Latency: ███ 373 ms jitter
(15.5x improvement)
Real-World Impact:
| Use Case | Jitter Requirement | Ollama | EdgeLLM |
|---|---|---|---|
| Robotic Control | < 100 ms | FAIL | Target* |
| Voice Assistants | < 500 ms | FAIL | PASS |
| Industrial IoT | < 1 sec | FAIL | PASS |
| Smart Home | < 2 sec | FAIL | PASS |
| Batch Processing | Any | PASS | PASS |
*EdgeLLM's 373ms jitter is close to robotics threshold; optimization ongoing.
Instead of using 16-bit or even 4-bit weights, EdgeLLM uses ternary weights (-1, 0, +1):
Traditional (FP16): 256.6 MB
Ollama (Q4_0): 91.0 MB (2.8x compression)
EdgeLLM (BitNet 1.58): 39.7 MB (6.5x compression)
Why This Works:
- Research shows ternary weights retain model quality at small scales
- Eliminates floating-point multiplication entirely
- Perfect for resource-constrained devices
EdgeLLM replaces expensive multiply-accumulate operations with table lookups:
Traditional Inference:
y = Σ(weight × activation) ← Requires multiplication
T-MAC Inference:
y = LUT[packed_weights] ← Just table lookup!
Performance Impact:
- 4-bit activations index into 16-entry lookup tables
- ARM NEON
tbland x86 AVX2pshufbinstructions - Deterministic execution time
EdgeLLM is built with Mojo, which provides:
# Python-like syntax
fn inference(model: Model, prompt: String) -> String:
var tokens = tokenize(prompt)
for i in range(max_tokens):
var logits = forward(model, tokens)
var next_token = sample(logits)
tokens.append(next_token)
return decode(tokens)But with:
- Zero garbage collection - no unpredictable pauses
- Ownership model - memory safety without runtime overhead
- Native SIMD - vectorized operations
- C interop - use existing optimized kernels
| Device | Price | RAM | Model | Expected Speed | Use Case |
|---|---|---|---|---|---|
| Raspberry Pi Zero 2 W | $15 | 512 MB | SmolLM-135M | 2-5 tok/s | Smart sensors |
| Raspberry Pi 4 | $35 | 4 GB | SmolLM-360M | 5-10 tok/s | Home automation |
| Raspberry Pi 5 | $80 | 8 GB | Qwen2-0.5B | 10-20 tok/s | Voice assistants |
| NVIDIA Jetson Nano | $99 | 4 GB | Llama-1B | 15-25 tok/s | Robotics |
| BeagleBone AI-64 | $149 | 4 GB | Phi-3-mini | 12-20 tok/s | Industrial IoT |
| Model | Parameters | EdgeLLM Size | Min RAM | Use Case |
|---|---|---|---|---|
| SmolLM-135M | 135M | 40 MB | 256 MB | IoT sensors, simple tasks |
| SmolLM-360M | 360M | 90 MB | 512 MB | Home automation |
| Qwen2-0.5B | 500M | 125 MB | 1 GB | Voice commands |
| Llama-3.2-1B | 1B | 200 MB | 2 GB | Complex reasoning |
User: "Turn off the bedroom lights"
EdgeLLM Response Time: 1.2 seconds (deterministic)
Ollama Response Time: 0.2 - 19.7 seconds (variable)
Why EdgeLLM wins: Users expect instant response. A 20-second delay breaks the experience.
Sensor: "Pressure reading 847 PSI, vibration 0.3mm"
EdgeLLM Analysis: "Warning: Pressure elevated but within safety margin.
Vibration normal. Continue monitoring."
Response time: 2.1 seconds (consistent)
Why EdgeLLM wins: Safety-critical systems need predictable response times.
Voice Input: "What's on my calendar today?"
EdgeLLM: Processes locally, no cloud connection
Ollama: May require cloud for larger models
Why EdgeLLM wins: All processing stays on-device. No data leaves your network.
Camera Input → Object Detection → EdgeLLM Reasoning → Motor Control
EdgeLLM: Consistent 147ms per-token latency
Ollama: 6.7ms - 11.5ms per-token (high variance)
Why EdgeLLM wins: Robots need predictable response for smooth motion.
Environment: Remote location, no internet
EdgeLLM: Fully functional
Ollama: Limited functionality
Why EdgeLLM wins: Edge AI must work without connectivity.
# Clone the repository
git clone https://github.com/yourusername/edgellm.git
cd edgellm/mojo-gateway
# Build the Docker image
docker build -f Dockerfile.mojo -t edgellm-inference .
# Run inference
docker run --rm -v $(pwd)/models:/workspace/models \
edgellm-inference \
/workspace/bin/edgellm /workspace/models/smollm-135m.tm2.bin -n 20 -t 0.7# Install Mojo (ARM64)
curl -fsSL https://pixi.sh/install.sh | sh
pixi init -c https://conda.modular.com/max-nightly/ -c conda-forge
pixi add mojo
# Build EdgeLLM
cd mojo-gateway
pixi run mojo build -O3 src/bitnet_tmac_lut.mojo -o bin/edgellm
# Run inference
./bin/edgellm models/smollm-135m.tm2.bin -n 32 -t 0.7# Step 1: Quantize to BitNet format
python scripts/quantize/quantize_bitnet.py \
--input HuggingFaceTB/SmolLM-135M \
--output models/smollm-135m.tmac.bin
# Step 2: Convert to TM2 format
python scripts/convert_tmac_to_tm2.py \
models/smollm-135m.tmac.bin \
models/smollm-135m.tm2.bin# Run 100 inference benchmarks
docker run --rm -v $(pwd)/models:/workspace/models \
edgellm-inference \
python3 benchmarks/edgellm_benchmark.py \
--backend edgellm \
--model /workspace/models/smollm-135m.tm2.bin \
--runs 100 \
--output /workspace/results/benchmark.json
# View results
cat results/benchmark.json | python3 -m json.toolKey Metrics to Look For:
- Jitter: Lower is better (target < 500ms)
- P99 Latency: Worst-case latency for 99% of requests
- Throughput: Tokens per second
- Memory: Peak RAM usage
| Feature | Ollama | EdgeLLM |
|---|---|---|
| Focus | Ease of use | Edge performance |
| Quantization | Q4_0/Q5_K | BitNet 1.58-bit |
| Minimum Hardware | Desktop PC | Raspberry Pi Zero |
| GC Pauses | Yes (Go runtime) | None (Mojo) |
| Model Management | Built-in | Manual |
| API Compatibility | OpenAI-like | Custom |
When to use Ollama: Desktop development, model experimentation When to use EdgeLLM: Production edge deployment, real-time systems
| Feature | llama.cpp | EdgeLLM |
|---|---|---|
| Language | C/C++ | Mojo + C FFI |
| Focus | CPU inference | Deterministic latency |
| Quantization | Various (Q4-Q8) | BitNet 1.58-bit |
| Memory Predictability | Good | Excellent |
| Latency Jitter | Medium | Very Low |
When to use llama.cpp: Maximum throughput on desktop When to use EdgeLLM: Minimum latency variance on edge
| Feature | vLLM | EdgeLLM |
|---|---|---|
| Focus | High throughput serving | Edge determinism |
| Hardware | GPU clusters | CPU-only edge |
| Batching | PagedAttention | Single request |
| Target Latency | Variable | Deterministic |
When to use vLLM: Cloud inference at scale When to use EdgeLLM: Resource-constrained edge devices
- Native ARM64 build without Docker
- Raspberry Pi 5 optimized NEON kernel
- WebSocket streaming API
- CUDA kernel development begins (Jetson Nano/Orin)
- CUDA T-MAC kernels for NVIDIA GPUs (80-120 tok/s target)
- Jetson Nano/Orin full support
- Metal kernels for Apple Silicon GPUs
- Multimodal (vision + language)
- Apple Neural Engine (ANE) support via Core ML
- M1/M2/M3/M4 optimized inference (150-250 tok/s target)
- Custom fine-tuning pipeline
- Model distillation tools
- Qualcomm Hexagon DSP support
- Intel NPU support (Meteor Lake)
- Federated learning support
- Enterprise edge deployment
| Platform | Current (CPU) | With GPU | vs Ollama |
|---|---|---|---|
| Jetson Nano | 15-25 tok/s | 80-120 tok/s | 1.5-2x faster |
| Jetson Orin | 30-50 tok/s | 200-400 tok/s | 2-3x faster |
| Apple M4 | 30-50 tok/s | 150-250 tok/s | 1.5-2x faster |
| Raspberry Pi 5 | 10-20 tok/s | N/A (no GPU) | Jitter advantage |
The Mojo + C/CUDA combo will deliver both speed AND determinism.
We welcome contributions! Key areas:
- Kernel optimization - ARM NEON, RISC-V
- Model support - New quantization formats
- Documentation - Examples and tutorials
- Testing - Hardware validation
See CONTRIBUTING.md for guidelines.
-
T-MAC: Table Lookup for LLM Inference - EuroSys 2025
-
BitNet: 1.58-bit LLMs - Microsoft Research
-
NoMAD-Attention - NeurIPS 2024
-
Mojo Language Documentation
MIT License - See LICENSE for details.
- GitHub Issues: Bug reports and feature requests
- Discussions: Questions and ideas
- Discord: Real-time community chat (coming soon)
EdgeLLM: Bringing AI to the Edge, Predictably.
Built for the next billion AI devices.