This project provides tools and configurations for setting up your own AI HomeLab using NVIDIA DGX Spark technology, featuring multiple LLM inference servers and AI-powered development tools.
This repository contains scripts and Docker configurations for running multiple LLM inference servers on NVIDIA's GB10 DGX Spark GPU. The setup supports llama.cpp, vLLM, SGLang, and MAX inference servers, along with Claude Code Router and Vibe Kanban for AI-assisted development.
- Docker-based deployments with NVIDIA GPU support
- Multiple LLM inference servers: llama.cpp, vLLM, SGLang, MAX
- Optimized for NVIDIA GB10 DGX Spark
- Claude Code Router for intelligent model routing
- Vibe Kanban for AI-enhanced project management
- Support for long-context models (up to 1M tokens)
- Multiple quantization formats: FP8, MXFP4, Q4_K_M, Q8_0
# Create spark network
docker network create spark-network
# Create .env file with required variables (see Environment Variables below)# Start all services via Docker Compose
docker compose up -d
# View logs
docker compose logs -f./run-claude-code-router.shThis will:
- Start the configured inference server (llama.cpp, vLLM, SGLang, or MAX)
- Configure Claude Code Router with local and remote model providers
- Start the router via
screen
./run-vibe-kanban.shThis will:
- Configure Vibe Kanban to use Claude Code Router
- Start Vibe Kanban via
screen
| Service | Port | Description |
|---|---|---|
llama.cpp |
8090 | llama.cpp server with INI-based model config |
vllm |
30001 | vLLM with fastsafetensors support |
sglang |
30000 | SGLang inference server |
max-inference |
8100 | MAX inference server |
Models are configured in llama.cpp.models.ini:
- mradermacher/Nemotron-Cascade-2-30B - 60-70t/s generation, 2900t/s context parsing
- unsloth/Qwen3.5-35B - 49-51t/s generation, 1800-2500t/s context parsing
- noctrex/Qwen3.5-35B (MXFP4) - 39-50t/s generation, 1500-2000t/s context parsing
- Exil01/Qwen3.5-35B-1M - 1M context support with YaRN scaling
- noctrex/Qwen3-Next-80B - 1M context support
- unsloth/GLM-4.7-Flash-30B - Fast inference model
- unsloth/Qwen3.5-27B - 10t/s generation, 500t/s context parsing
- seer-marketing (8889) - MX Seer marketing tools
- aider-desk (24337) - Aider IDE integration
- b4 - Network tunneling tool
Create a .env file in the project root:
# Inference server selection
AGENT_INFERENCE_SERVER=llama.cpp # llama.cpp, vllm, sglang, max-inference, openrouter, z-ai
AGENT_MAIN_MODEL=noctrex/Qwen3.5-35B
AGENT_BACKGROUND_MODEL=noctrex/Qwen3.5-35B
# Model paths
VLLM_MODEL=Qwen/Qwen3.5-35B-A3B-FP8
SGLANG_MODEL=Qwen/Qwen3.5-27B-FP8
# API keys
OPENROUTER_API_KEY=your_openrouter_api_key
DEEPSEEK_API_KEY=your_deepseek_api_key
ZAI_API_KEY=your_zai_api_key
HF_TOKEN=your_huggingface_token
# GitHub (for Vibe Kanban)
GITHUB_USER=your_github_username
GITHUB_EMAIL=your_github_email
# Port overrides
VIBE_KANBAN_PORT=8888
SEER_PORT=8889
AIDER_DESK_PORT=24337The router supports multiple providers with intelligent routing:
- Local models (llama.cpp, vLLM, SGLang, MAX) as primary
- OpenRouter for fallback and diverse model access
- DeepSeek for specific tasks
- Z.AI for GLM models
Router config is generated in ~/.claude-code-router/config.json.
- NVIDIA GB10 DGX Spark (80GB VRAM)
- Docker and Docker Compose with NVIDIA Container Toolkit
- Bun JavaScript runtime (installed automatically)
- At least 64GB RAM recommended (128GB+ for large models)
| Service | URL |
|---|---|
| llama.cpp | http://127.0.0.1:8090 |
| vLLM | http://127.0.0.1:30001 |
| SGLang | http://127.0.0.1:30000 |
| MAX | http://127.0.0.1:8100 |
| Vibe Kanban | http://127.0.0.1:8888 |
| Seer | http://127.0.0.1:8889 |
| Aider Desk | http://127.0.0.1:24337 |
run-claude-code-router.sh- Start Claude Code Routerrun-vibe-kanban.sh- Start Vibe Kanbandownload-hf-models.sh- Download models from Hugging Facebuild-vllm-cuda13-docker.sh- Build vLLM Docker imagebuild-llama.cpp-cuda13-docker.sh- Build llama.cpp Docker imageinstall-code-agents.sh- Install code agent tools
MIT License