This project explores how to accelerate LLaMA-3.2-3B-Instruct inference on a single NVIDIA T4 GPU, while keeping perplexity under or equal to 11.5.
The optimization pipeline includes:
- LoRA fine-tuning on the 3B teacher model
- Knowledge distillation from 3B to 1B using GKD
- LoRA fine-tuning on the distilled 1B student model
- Quantization using HQQ with hybrid 4/8-bit precision
- Inference backend optimization with
torch.compileandgemlite
First, clone the repository:
git clone https://github.com/bonginn/llm-acceleration.git
cd llm-accelerationWe strongly recommend using a Conda environment to avoid dependency conflicts and ensure reproducibility:
conda create -n llama-acc python=3.10 -y
conda activate llama-accThen install the required packages:
pip install torch==2.7.0
pip install -r requirements.txt
⚠️ Note:torchmust be installed beforegemlitesincegemliteimportstorchduring installation.
After installing all dependencies, make sure you are logged in to Hugging Face to access the LLaMA models:
huggingface-cli loginYou can generate one at Hugging Face.
Running inference to reproduce the results:
python3 inference.pyIf you want to run distillation and/or LoRA fine-tuning, you can use the following commands:
python3 distillation.py
python3 lora.pyIt may take ~10 hours to run distillation and ~1.5 hours to run LoRA fine-tuning on a single A100 40GB.
| Model Configuration | Perplexity | Throughput (tokens/s) |
|---|---|---|
| 3B Base | 11.05 | 26.2 |
| 3B + LoRA | 9.67 | 26.2 |
| Distilled 1B | 11.76 | 67.5 |
| Distilled 1B + LoRA | 11.42 | 67.5 |
| Distilled 1B + LoRA + HQQ | 11.49 | 125.0 |
Note: All experiments were conducted on a single NVIDIA T4 GPU. The throughput measurements are end-to-end (e2e), including both prefill and decoding phases.