This deployment is optimized for CPU-only inference on macOS with a focus on:
- Minimal memory footprint
- Efficient CPU utilization
- Stable performance without thermal throttling
- Fast startup and model loading
- Unified memory architecture (fast CPU-GPU memory)
- High memory bandwidth
- Energy-efficient performance cores
- 8+ CPU cores on most models
# .env configuration for Apple Silicon
OMP_THREADS=4 # Use 4-6 threads (50-75% of P-cores)
CPU_LIMIT=6.0 # Use most cores
MEMORY_LIMIT=12G # M1/M2 typically have 16GB+
# For larger models on M1 Max/Ultra with 32GB+
MODEL_NAME=HuggingFaceTB/SmolLM2-1.7B-Instruct
MAX_MODEL_LEN=4096
MEMORY_LIMIT=16G- M1/M2 have performance (P) and efficiency (E) cores
- Keep
OMP_THREADSat 50-75% of P-cores - This prevents E-core scheduling which hurts inference latency
- Example: M1 Pro (8 cores = 6P + 2E) → use
OMP_THREADS=4
- Separate CPU and system memory
- Higher thermal constraints
- Typically fewer cores (4-8)
- Higher power consumption
# .env configuration for Intel Macs
OMP_THREADS=2 # Conservative threading
CPU_LIMIT=4.0 # Moderate CPU usage
MEMORY_LIMIT=8G # Standard allocation
# Use smaller models
MODEL_NAME=HuggingFaceTB/SmolLM2-360M-Instruct
MAX_MODEL_LEN=2048- Intel CPUs benefit less from high thread counts in inference
- Keep
OMP_THREADS=2-4to avoid thermal throttling - Monitor CPU temperature with Activity Monitor
- Lower limits extend laptop battery life
Total memory = Model size + KV Cache + Working memory + Overhead
SmolLM2-135M: ~500MB model + 500MB KV + 1GB working = ~2GB total
SmolLM2-360M: ~1.3GB model + 1GB KV + 2GB working = ~4GB total
SmolLM2-1.7B: ~6.5GB model + 2GB KV + 4GB working = ~12GB total
KVCACHE_SPACE=0.5 # From default 1GB
MAX_MODEL_LEN=1024 # Shorter context windowImpact: Less memory, but can't handle long conversations/documents
MAX_NUM_SEQS=4 # From default 8Impact: Less memory, but fewer parallel requests
MODEL_NAME=HuggingFaceTB/SmolLM2-135M-Instruct # From 360MImpact: Less memory, but lower quality outputs
# In docker-compose.yml, add to command:
--disable-log-requests
--disable-log-statsImpact: Minimal memory savings, but helps with I/O
- Open Docker Desktop → Settings → Resources
- Set Memory limit:
- Minimum: 4GB for SmolLM2-360M
- Recommended: 8GB for SmolLM2-360M
- Optimal: 12GB+ for SmolLM2-1.7B
- Set CPU limit to match your
.envconfiguration
The key environment variables:
OMP_NUM_THREADS=2 # OpenMP parallelism (main knob)
OPENBLAS_NUM_THREADS=1 # BLAS operations (keep at 1)
MKL_NUM_THREADS=1 # Intel MKL (keep at 1)- Start with 2 threads:
OMP_THREADS=2 docker compose up -d
python test_vllm.py # Note the performance- Try 4 threads:
OMP_THREADS=4 docker compose restart
python test_vllm.py # Compare performance- Try 6 threads (Apple Silicon only):
OMP_THREADS=6 docker compose restart
python test_vllm.py # Compare performance- Use the value with best latency (not necessarily highest throughput)
- Multiple BLAS threads compete with OMP threads
- Causes CPU contention and cache thrashing
- Single-threaded BLAS + multi-threaded OMP is more efficient for inference
For even better performance, pin vLLM to specific cores:
# docker-compose.yml
services:
vllm-cpu:
cpuset: "0-3" # Use first 4 cores onlyThis prevents OS from moving the process between cores.
Models are cached in a Docker volume:
# See cache location
docker volume inspect vllm-cpu_hf-cache
# Pre-download models
docker compose run --rm vllm-cpu \
python -c "from transformers import AutoModelForCausalLM; \
AutoModelForCausalLM.from_pretrained('HuggingFaceTB/SmolLM2-360M-Instruct')"If you have models already downloaded:
# docker-compose.yml
volumes:
- ./models:/workspace/models:ro# .env
MODEL_NAME=/workspace/models/SmolLM2-360M-Instruct# docker-compose.yml
ports:
- "127.0.0.1:8000:8000" # Only accessible from localhost# docker-compose.yml
ports:
- "0.0.0.0:8000:8000" # Accessible from local networkThen access from other devices at http://<your-mac-ip>:8000
Create a smaller final image:
# Dockerfile.optimized
FROM openeuler/vllm-cpu:0.9.1-oe2403lts AS base
# Patch
RUN sed -i 's/cpu_count_per_numa = cpu_count \/\/ numa_size/\
cpu_count_per_numa = cpu_count \/\/ numa_size if numa_size > 0 else cpu_count/g' \
/workspace/vllm/vllm/worker/cpu_worker.py
# Remove unnecessary files
RUN apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
ENV VLLM_TARGET_DEVICE=cpu \
VLLM_CPU_KVCACHE_SPACE=1 \
OMP_NUM_THREADS=2 \
OPENBLAS_NUM_THREADS=1 \
MKL_NUM_THREADS=1Build with cache for faster rebuilds:
# Build with cache
docker compose build
# Build without cache (clean rebuild)
docker compose build --no-cache# Run performance test
python test_vllm.py
# Or use benchmark script
time curl -s http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "HuggingFaceTB/SmolLM2-360M-Instruct",
"prompt": "Test prompt",
"max_tokens": 100}' > /dev/null# Container stats
docker stats vllm-smollm2 --no-stream
# Detailed metrics
docker compose exec vllm-cpu topOn M1 Pro (8-core):
- SmolLM2-135M: ~30-50 tokens/sec
- SmolLM2-360M: ~20-30 tokens/sec
- SmolLM2-1.7B: ~5-10 tokens/sec
On Intel i7 (4-core):
- SmolLM2-135M: ~15-25 tokens/sec
- SmolLM2-360M: ~10-15 tokens/sec
- SmolLM2-1.7B: ~3-5 tokens/sec
# Reduce threads
OMP_THREADS=2
CPU_LIMIT=2.0# Reduce cache and sequences
KVCACHE_SPACE=0.5
MAX_NUM_SEQS=4
MAX_MODEL_LEN=1024- Normal: Model loading takes time
- Solution: Keep container running
- Or use health check endpoint to warm up
- Monitor: Activity Monitor → CPU History
- Reduce
OMP_THREADSandCPU_LIMIT - Ensure good laptop ventilation
- Consider using smaller model
For production deployments on macOS:
# Stable, production settings
MODEL_NAME=HuggingFaceTB/SmolLM2-360M-Instruct
OMP_THREADS=2
CPU_LIMIT=4.0
CPU_RESERVATION=2.0
MEMORY_LIMIT=8G
MEMORY_RESERVATION=4G
MAX_MODEL_LEN=2048
MAX_NUM_SEQS=8
KVCACHE_SPACE=1These settings provide:
- Predictable performance
- No thermal issues
- Reasonable quality
- Good concurrency
- Memory safety
For expert users, you can pass additional vLLM flags:
# docker-compose.yml
command: >
vllm serve ${MODEL_NAME}
--host 0.0.0.0
--port 8000
--dtype auto
--max-model-len ${MAX_MODEL_LEN}
--max-num-seqs ${MAX_NUM_SEQS}
--gpu-memory-utilization 0.0
--swap-space 0
--enforce-eager
--disable-custom-all-reduceSee vLLM docs for all options.