Un-official llama.cpp β run community-quantized models before they hit upstream
Uno.cpp is a ready-to-use desktop application that lets you run GGUF models locally β especially models with new architectures that haven't been merged into the official llama.cpp yet.
The problem: When a new model architecture is quantized to GGUF, the architecture support PR can take weeks to get reviewed and merged into llama.cpp. During this time, users have no easy way to run these models β they'd need to manually clone a fork, set up a C++ build environment, compile from source, and use the command line.
The solution: Uno.cpp packages a custom-built llama-server (with the new architecture support baked in) into a simple Windows installer with a GUI launcher. Download, install, pick your model file, chat. That's it.
| Model | Architecture | Parameters | HuggingFace | Upstream PR |
|---|---|---|---|---|
| Sarvam-30B | sarvam_moe |
~30B (MoE) | GGUF Downloads | #20275 |
Sarvam-30B is an open-source Mixture-of-Experts model by Sarvam AI with 128 routed experts (top-6 routing) + 1 shared expert, 262K vocabulary, and strong multilingual capabilities across Indian languages.
More models will be added as new architectures are quantized.
Download the latest Unocpp-Setup-v*.exe from Releases and run the installer.
Grab a GGUF model file from HuggingFace:
| Quant | Size | VRAM Needed | Best For |
|---|---|---|---|
| Q4_K_M | ~19 GB | ~20 GB | Most users with 24GB+ VRAM |
| Q6_K | ~26 GB | ~27 GB | Higher quality, 32GB VRAM |
| Q8_0 | ~34 GB | ~35 GB | Best quantized quality |
| BF16 | ~64 GB | ~65 GB | Full precision, multi-GPU |
Open Uno.cpp from your desktop β select the .gguf file β your browser opens with a chat UI. Done.
- No terminal required β GUI launcher with model file picker and settings
- Remembers your config β model path, GPU layers, context size saved between sessions
- Built-in chat UI β powered by llama.cpp's web interface, opens in your browser
- GPU accelerated β CUDA support for NVIDIA GPUs, configurable layer offloading
- Adjustable settings β GPU layers, context size (2Kβ32K), and port configuration
- Fully local & private β no cloud, no API keys, everything runs on your machine
- OpenAI-compatible API β
llama-serverexposes/v1/chat/completionsso you can connect other tools
| Component | Minimum | Recommended |
|---|---|---|
| OS | Windows 10 64-bit | Windows 11 |
| GPU | NVIDIA with 20+ GB VRAM | RTX 3090 / 4090 / 5090 |
| CUDA | CUDA 12.x drivers | Latest NVIDIA drivers |
| RAM | 16 GB | 32 GB |
| Disk | Space for model files | SSD for faster loading |
No NVIDIA GPU? Set GPU Layers to
0in the launcher to run on CPU only (significantly slower but works).
If you prefer to build from source instead of using the installer:
- Visual Studio 2022 with C++ workload
- CMake 3.21+
- NVIDIA CUDA Toolkit 12.x (for GPU support)
- Git
# Clone the repo
git clone https://github.com/sumitchatterjee13/uno.cpp.git
cd uno.cpp
# Create build directory
mkdir build && cd build
# Configure with CUDA
cmake .. -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
# Build the server
cmake --build . --config Release --target llama-serverTip: Use the Developer Command Prompt for VS 2022 (not regular cmd) so CMake and the compiler are in PATH.
The binary will be at build/bin/Release/llama-server.exe. Run it directly:
llama-server.exe -m path/to/model.gguf -ngl 99 -c 4096Then open http://127.0.0.1:8080 in your browser.
Uno.cpp is a fork of llama.cpp with added architecture support for models that haven't been merged upstream yet. The app consists of:
llama-server.exeβ custom-built llama.cpp inference server with new architecture support- A GUI launcher β pick your model, configure GPU layers and context size, hit launch
- Built-in chat UI β opens in your browser at
http://127.0.0.1:8080
The server also exposes an OpenAI-compatible API, so you can connect tools like Open WebUI, SillyTavern, or any OpenAI-compatible client.
Q: Is this a fork of llama.cpp? Yes. Uno.cpp is llama.cpp with additional architecture support patches. Once those patches are merged upstream, Uno.cpp will sync with the latest llama.cpp and focus on the next set of unsupported models.
Q: Will my existing models work with this? If your model uses a standard llama.cpp architecture (LLaMA, Mistral, Qwen, Gemma, Phi, etc.), it will work out of the box. Uno.cpp adds support on top β it doesn't remove anything.
Q: Is it safe to install?
The code is fully open source. The installer bundles only llama-server.exe, its runtime DLLs, and launcher scripts. You can verify everything by building from source.
Q: macOS / Linux support? Currently Windows only. macOS and Linux users can build from source β the llama.cpp codebase supports all platforms.
- llama.cpp by Georgi Gerganov and contributors β the foundation this project is built on
- Sarvam AI β creators of the Sarvam-30B model
This project is licensed under the MIT License, same as llama.cpp.
Built by Sumit Chatterjee β because nobody should have to wait for a PR merge to run a model.
