llcuda v2.0.6: CUDA Inference Backend for Unsloth on Tesla T4

Fast LLM inference on Google Colab Tesla T4 GPUs. CUDA 12 binaries bundled. One-step installation, instant import.

📦 What is llcuda v2.0.6?

llcuda v2.0.6 is a production-ready CUDA inference backend exclusively designed for Tesla T4 GPUs (Google Colab standard). It provides:

✅ Bundled CUDA 12 Binaries (~270 MB) - no runtime downloads
✅ Native Tensor API - PyTorch-style GPU operations with custom CUDA kernels
✅ Tensor Core Optimization - SM 7.5 targeting for T4 maximum performance
✅ FlashAttention Support - 2-3x faster attention for long contexts
✅ GGUF Model Support - Compatible with llama.cpp models
✅ Unsloth Integration - Direct loading of NF4-quantized fine-tuned models

🚀 Quick Start on Google Colab

1. Install

pip install git+https://github.com/waqasm86/llcuda.git

What happens:

Installs Python package from GitHub
CUDA binaries (266 MB) auto-download from GitHub Releases on first import
One-time setup, cached for future use

Requirements:

Python 3.11+
Google Colab with Tesla T4 GPU (SM 7.5)
CUDA 12.x runtime (pre-installed in Colab)

2. Verify GPU

import llcuda
from llcuda.core import get_device_properties

props = get_device_properties(0)
print(f"GPU: {props.name}")
print(f"Compute: SM {props.compute_capability_major}.{props.compute_capability_minor}")

3. Run Inference

import llcuda

# HTTP Server API with GGUF models
engine = llcuda.InferenceEngine()
engine.load_model("gemma-3-1b-Q4_K_M", silent=True)
result = engine.infer("What is artificial intelligence?", max_tokens=100)
print(result.text)
print(f"Speed: {result.tokens_per_sec:.1f} tokens/sec")

4. Custom Tensor Operations

from llcuda.core import Tensor, DType

# Create tensors on GPU
A = Tensor.zeros([2048, 2048], dtype=DType.Float16, device=0)
B = Tensor.zeros([2048, 2048], dtype=DType.Float16, device=0)

# Matrix multiplication (uses Tensor Cores on T4)
C = A @ B
print(f"Result shape: {C.shape}")

🎮 Performance Benchmarks (Tesla T4, CUDA 12)

Model	Quantization	Speed (tok/s)	VRAM	Latency	Status
Gemma 3-1B	Q4_K_M	134	1.2 GB	~690 ms	✅ Verified
Llama 3.2-3B	Q4_K_M	~30	2.0 GB	-	Estimated
Qwen 2.5-7B	Q4_K_M	~18	5.0 GB	-	Estimated
Llama 3.1-8B	Q4_K_M	~15	5.5 GB	-	Estimated

✅ Verified Performance: Gemma 3-1B achieves 134 tok/s on Tesla T4 with Q4_K_M quantization (see executed notebook).

Note: FlashAttention provides 2-3x speedup for contexts > 2048 tokens.

📋 Version Info

Version: 2.0.6
Release Date: January 8, 2026
Target GPU: Tesla T4 ONLY (SM 7.5)
CUDA Version: 12.x
Python: 3.11+
Platform: Google Colab (primary), compatible Linux with T4

⚠️ Supported GPU ONLY

llcuda v2.0.6 works exclusively on Tesla T4 GPU.

✅ Supported:

Google Colab Tesla T4
On-premise Tesla T4 with CUDA 12.x

❌ Not Supported:

A100, H100, L4, RTX GPUs
Older Tesla GPUs (K80, P100)
CPU-only systems

For other GPUs, use llcuda v1.2.2 (less optimized).

📦 What's Included in v2.0.6

Bundled CUDA 12 Binaries (~270 MB)

All binaries are included in the PyPI package - no runtime downloads:

llama-server - Inference server for GGUF models
llama-cli - Command-line interface
libllama.so - Llama core library with CUDA support
libggml-cuda.so - GGML CUDA kernels with FlashAttention
libggml-base.so - GGML base library
libggml-cpu.so - GGML CPU fallback
libmtmd.so - Multithreading library

Features

✅ FlashAttention (GGML_CUDA_FA=ON)
✅ CUDA Graphs (GGML_CUDA_GRAPHS=ON)
✅ All quantization types (INT4, INT8, FP16)
✅ SM 7.5 code generation (Tesla T4 optimized)
✅ Tensor Cores support

🛠️ V2.0 Native API Reference

Device Management

🛠️ API Reference

Device Management

from llcuda.core import get_device_count, get_device_properties

# Get GPU count
num_gpus = get_device_count()

# Get device info
props = get_device_properties(0)
print(f"Device: {props.name}")
print(f"Compute: SM {props.compute_capability_major}.{props.compute_capability_minor}")

Tensor Operations

from llcuda.core import Tensor, DType

# Create tensors on GPU
A = Tensor.zeros([2048, 2048], dtype=DType.Float16, device=0)
B = Tensor.zeros([2048, 2048], dtype=DType.Float16, device=0)

# Matrix multiplication with Tensor Cores
C = A @ B

📦 Installation Options

Method 1: Direct from GitHub (Recommended)

pip install git+https://github.com/waqasm86/llcuda.git

Method 2: From Release Wheel

pip install https://github.com/waqasm86/llcuda/releases/download/v2.0.6/llcuda-2.0.6-py3-none-any.whl

Method 3: Development Install

git clone https://github.com/waqasm86/llcuda.git
cd llcuda
pip install -e .

📖 Full installation guide: GITHUB_INSTALL_GUIDE.md

📚 Documentation

GitHub Repository: https://github.com/waqasm86/llcuda
Releases: https://github.com/waqasm86/llcuda/releases
Installation Guide: GITHUB_INSTALL_GUIDE.md
Issues: https://github.com/waqasm86/llcuda/issues

🔗 Related Projects

llama.cpp - Core inference engine
Unsloth - Efficient fine-tuning
FlashAttention - Optimized attention kernels

📄 License

MIT License - see LICENSE file

📝 Citation

If you use llcuda in your research, please cite:

@software{llcuda2024,
  author = {Waqas Muhammad},
  title = {llcuda: CUDA Inference Backend for Unsloth},
  year = {2024},
  url = {https://github.com/waqasm86/llcuda}
}

*llcuda v2.0.6 | Tesla T4 Optimized | CUDA 12 | Google Colab Ready

cuBLAS matmul

Phase 2: GGUF Integration 🚧 (In Progress)

Bootstrap refactor for T4-only
GGUF parser implementation
Model loader for GGUF → Tensor

Phase 3: Flash Attention 📅 (Planned)

Custom FA2 CUDA kernels
Long context optimization

Phase 4: Unsloth Integration 📅 (Planned)

NF4 quantization kernels
Direct Unsloth model loading
model.save_pretrained_llcuda() export

🔧 Troubleshooting

GPU Not Compatible

❌ INCOMPATIBLE GPU DETECTED

Your GPU is not Tesla T4
Required: Tesla T4 (SM 7.5)

llcuda v2.0 requires Tesla T4 GPU.
Compatible environment: Google Colab

Solution: Use Google Colab with Tesla T4

Binary Download Failed

# Download T4 binaries manually
wget https://github.com/waqasm86/llcuda/releases/download/v2.0.6/llcuda-binaries-cuda12-t4.tar.gz
mkdir -p ~/.cache/llcuda
tar -xzf llcuda-binaries-cuda12-t4.tar.gz -C ~/.cache/llcuda/

📚 Tutorials & Notebooks

Google Colab Notebooks

Gemma 3-1B + Unsloth Tutorial - Complete guide for llcuda v2.0.6
- ✅ GitHub installation and binary auto-download
- ✅ Loading Gemma 3-1B-IT GGUF from Unsloth
- ✅ Inference examples and batch processing
- ✅ Performance metrics and optimization
- ✅ 134 tok/s on Tesla T4 (verified)
Gemma 3-1B Executed Example - Live execution output
- ✅ Real Tesla T4 GPU results from Google Colab
- ✅ Complete output with all metrics
- ✅ Demonstrates 3x faster performance (134 vs 45 tok/s expected)
- ✅ Proof of working binary download and model loading
Build llcuda Binaries - Build CUDA binaries on T4
- Compile llama.cpp with FlashAttention
- Create binary packages for release

Additional Resources

Installation Guide: GITHUB_INSTALL_GUIDE.md
Release Guide: GITHUB_RELEASE_COMPLETE_GUIDE.md
GitHub Issues: https://github.com/waqasm86/llcuda/issues

📄 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

Built on llama.cpp
FlashAttention from Dao et al.
Designed for Unsloth integration

🔗 Links

PyPI: https://pypi.org/project/llcuda/
GitHub: https://github.com/waqasm86/llcuda
Unsloth: https://github.com/unslothai/unsloth

Version: 2.0.6 Target GPU: Tesla T4 ONLY (SM 7.5) Platform: Google Colab License: MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
archive/v1.x		archive/v1.x
core		core
csrc		csrc
docs		docs
examples		examples
llcuda		llcuda
notebooks		notebooks
releases/v2.0.6		releases/v2.0.6
scripts		scripts
tests		tests
.gitignore		.gitignore
BUNDLED_BINARIES_GUIDE.md		BUNDLED_BINARIES_GUIDE.md
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
FINAL_UPLOAD_STEPS.sh		FINAL_UPLOAD_STEPS.sh
GITHUB_INSTALL_GUIDE.md		GITHUB_INSTALL_GUIDE.md
GITHUB_RELEASE_COMPLETE_GUIDE.md		GITHUB_RELEASE_COMPLETE_GUIDE.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
PYPI_DELETED_NOTICE.md		PYPI_DELETED_NOTICE.md
README.md		README.md
pyproject.toml		pyproject.toml
requirements-jupyter.txt		requirements-jupyter.txt
requirements.txt		requirements.txt
setup.py		setup.py

License

waqasm86/llcuda

Folders and files

Latest commit

History

Repository files navigation

llcuda v2.0.6: CUDA Inference Backend for Unsloth on Tesla T4

📦 What is llcuda v2.0.6?

🚀 Quick Start on Google Colab

1. Install

2. Verify GPU

3. Run Inference

4. Custom Tensor Operations

🎮 Performance Benchmarks (Tesla T4, CUDA 12)

📋 Version Info

⚠️ Supported GPU ONLY

📦 What's Included in v2.0.6

Bundled CUDA 12 Binaries (~270 MB)

Features

🛠️ V2.0 Native API Reference

Device Management

🛠️ API Reference

Device Management

Tensor Operations

📦 Installation Options

Method 1: Direct from GitHub (Recommended)

Method 2: From Release Wheel

Method 3: Development Install

📚 Documentation

🔗 Related Projects

📄 License

📝 Citation

Phase 2: GGUF Integration 🚧 (In Progress)

Phase 3: Flash Attention 📅 (Planned)

Phase 4: Unsloth Integration 📅 (Planned)

🔧 Troubleshooting

GPU Not Compatible

Binary Download Failed

📚 Tutorials & Notebooks

Google Colab Notebooks

Additional Resources

📄 License

🙏 Acknowledgments

🔗 Links

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 11

Packages 0

Contributors 3

Languages

Packages