Fast LLM inference on Google Colab Tesla T4 GPUs. CUDA 12 binaries bundled. One-step installation, instant import.
llcuda v2.0.6 is a production-ready CUDA inference backend exclusively designed for Tesla T4 GPUs (Google Colab standard). It provides:
- ✅ Bundled CUDA 12 Binaries (~270 MB) - no runtime downloads
- ✅ Native Tensor API - PyTorch-style GPU operations with custom CUDA kernels
- ✅ Tensor Core Optimization - SM 7.5 targeting for T4 maximum performance
- ✅ FlashAttention Support - 2-3x faster attention for long contexts
- ✅ GGUF Model Support - Compatible with llama.cpp models
- ✅ Unsloth Integration - Direct loading of NF4-quantized fine-tuned models
pip install git+https://github.com/waqasm86/llcuda.gitWhat happens:
- Installs Python package from GitHub
- CUDA binaries (266 MB) auto-download from GitHub Releases on first import
- One-time setup, cached for future use
Requirements:
- Python 3.11+
- Google Colab with Tesla T4 GPU (SM 7.5)
- CUDA 12.x runtime (pre-installed in Colab)
import llcuda
from llcuda.core import get_device_properties
props = get_device_properties(0)
print(f"GPU: {props.name}")
print(f"Compute: SM {props.compute_capability_major}.{props.compute_capability_minor}")import llcuda
# HTTP Server API with GGUF models
engine = llcuda.InferenceEngine()
engine.load_model("gemma-3-1b-Q4_K_M", silent=True)
result = engine.infer("What is artificial intelligence?", max_tokens=100)
print(result.text)
print(f"Speed: {result.tokens_per_sec:.1f} tokens/sec")from llcuda.core import Tensor, DType
# Create tensors on GPU
A = Tensor.zeros([2048, 2048], dtype=DType.Float16, device=0)
B = Tensor.zeros([2048, 2048], dtype=DType.Float16, device=0)
# Matrix multiplication (uses Tensor Cores on T4)
C = A @ B
print(f"Result shape: {C.shape}")| Model | Quantization | Speed (tok/s) | VRAM | Latency | Status |
|---|---|---|---|---|---|
| Gemma 3-1B | Q4_K_M | 134 | 1.2 GB | ~690 ms | ✅ Verified |
| Llama 3.2-3B | Q4_K_M | ~30 | 2.0 GB | - | Estimated |
| Qwen 2.5-7B | Q4_K_M | ~18 | 5.0 GB | - | Estimated |
| Llama 3.1-8B | Q4_K_M | ~15 | 5.5 GB | - | Estimated |
✅ Verified Performance: Gemma 3-1B achieves 134 tok/s on Tesla T4 with Q4_K_M quantization (see executed notebook).
Note: FlashAttention provides 2-3x speedup for contexts > 2048 tokens.
- Version: 2.0.6
- Release Date: January 8, 2026
- Target GPU: Tesla T4 ONLY (SM 7.5)
- CUDA Version: 12.x
- Python: 3.11+
- Platform: Google Colab (primary), compatible Linux with T4
llcuda v2.0.6 works exclusively on Tesla T4 GPU.
✅ Supported:
- Google Colab Tesla T4
- On-premise Tesla T4 with CUDA 12.x
❌ Not Supported:
- A100, H100, L4, RTX GPUs
- Older Tesla GPUs (K80, P100)
- CPU-only systems
For other GPUs, use llcuda v1.2.2 (less optimized).
All binaries are included in the PyPI package - no runtime downloads:
- llama-server - Inference server for GGUF models
- llama-cli - Command-line interface
- libllama.so - Llama core library with CUDA support
- libggml-cuda.so - GGML CUDA kernels with FlashAttention
- libggml-base.so - GGML base library
- libggml-cpu.so - GGML CPU fallback
- libmtmd.so - Multithreading library
- ✅ FlashAttention (GGML_CUDA_FA=ON)
- ✅ CUDA Graphs (GGML_CUDA_GRAPHS=ON)
- ✅ All quantization types (INT4, INT8, FP16)
- ✅ SM 7.5 code generation (Tesla T4 optimized)
- ✅ Tensor Cores support
from llcuda.core import get_device_count, get_device_properties
# Get GPU count
num_gpus = get_device_count()
# Get device info
props = get_device_properties(0)
print(f"Device: {props.name}")
print(f"Compute: SM {props.compute_capability_major}.{props.compute_capability_minor}")from llcuda.core import Tensor, DType
# Create tensors on GPU
A = Tensor.zeros([2048, 2048], dtype=DType.Float16, device=0)
B = Tensor.zeros([2048, 2048], dtype=DType.Float16, device=0)
# Matrix multiplication with Tensor Cores
C = A @ Bpip install git+https://github.com/waqasm86/llcuda.gitpip install https://github.com/waqasm86/llcuda/releases/download/v2.0.6/llcuda-2.0.6-py3-none-any.whlgit clone https://github.com/waqasm86/llcuda.git
cd llcuda
pip install -e .📖 Full installation guide: GITHUB_INSTALL_GUIDE.md
- GitHub Repository: https://github.com/waqasm86/llcuda
- Releases: https://github.com/waqasm86/llcuda/releases
- Installation Guide: GITHUB_INSTALL_GUIDE.md
- Issues: https://github.com/waqasm86/llcuda/issues
- llama.cpp - Core inference engine
- Unsloth - Efficient fine-tuning
- FlashAttention - Optimized attention kernels
MIT License - see LICENSE file
If you use llcuda in your research, please cite:
@software{llcuda2024,
author = {Waqas Muhammad},
title = {llcuda: CUDA Inference Backend for Unsloth},
year = {2024},
url = {https://github.com/waqasm86/llcuda}
}*llcuda v2.0.6 | Tesla T4 Optimized | CUDA 12 | Google Colab Ready
- cuBLAS matmul
- Bootstrap refactor for T4-only
- GGUF parser implementation
- Model loader for GGUF → Tensor
- Custom FA2 CUDA kernels
- Long context optimization
- NF4 quantization kernels
- Direct Unsloth model loading
-
model.save_pretrained_llcuda()export
❌ INCOMPATIBLE GPU DETECTED
Your GPU is not Tesla T4
Required: Tesla T4 (SM 7.5)
llcuda v2.0 requires Tesla T4 GPU.
Compatible environment: Google Colab
Solution: Use Google Colab with Tesla T4
# Download T4 binaries manually
wget https://github.com/waqasm86/llcuda/releases/download/v2.0.6/llcuda-binaries-cuda12-t4.tar.gz
mkdir -p ~/.cache/llcuda
tar -xzf llcuda-binaries-cuda12-t4.tar.gz -C ~/.cache/llcuda/-
Gemma 3-1B + Unsloth Tutorial - Complete guide for llcuda v2.0.6
- ✅ GitHub installation and binary auto-download
- ✅ Loading Gemma 3-1B-IT GGUF from Unsloth
- ✅ Inference examples and batch processing
- ✅ Performance metrics and optimization
- ✅ 134 tok/s on Tesla T4 (verified)
-
Gemma 3-1B Executed Example - Live execution output
- ✅ Real Tesla T4 GPU results from Google Colab
- ✅ Complete output with all metrics
- ✅ Demonstrates 3x faster performance (134 vs 45 tok/s expected)
- ✅ Proof of working binary download and model loading
-
Build llcuda Binaries - Build CUDA binaries on T4
- Compile llama.cpp with FlashAttention
- Create binary packages for release
- Installation Guide: GITHUB_INSTALL_GUIDE.md
- Release Guide: GITHUB_RELEASE_COMPLETE_GUIDE.md
- GitHub Issues: https://github.com/waqasm86/llcuda/issues
MIT License - see LICENSE file for details.
- Built on llama.cpp
- FlashAttention from Dao et al.
- Designed for Unsloth integration
- PyPI: https://pypi.org/project/llcuda/
- GitHub: https://github.com/waqasm86/llcuda
- Unsloth: https://github.com/unslothai/unsloth
Version: 2.0.6 Target GPU: Tesla T4 ONLY (SM 7.5) Platform: Google Colab License: MIT