LLM Acceleration

Overview

This project explores how to accelerate LLaMA-3.2-3B-Instruct inference on a single NVIDIA T4 GPU, while keeping perplexity under or equal to 11.5.

The optimization pipeline includes:

LoRA fine-tuning on the 3B teacher model
Knowledge distillation from 3B to 1B using GKD
LoRA fine-tuning on the distilled 1B student model
Quantization using HQQ with hybrid 4/8-bit precision
Inference backend optimization with torch.compile and gemlite

Setup and Reproduce

First, clone the repository:

git clone https://github.com/bonginn/llm-acceleration.git
cd llm-acceleration

We strongly recommend using a Conda environment to avoid dependency conflicts and ensure reproducibility:

conda create -n llama-acc python=3.10 -y
conda activate llama-acc

Then install the required packages:

pip install torch==2.7.0
pip install -r requirements.txt

⚠️ Note: torch must be installed before gemlite since gemlite imports torch during installation.

After installing all dependencies, make sure you are logged in to Hugging Face to access the LLaMA models:

huggingface-cli login

You can generate one at Hugging Face.

Running inference to reproduce the results:

python3 inference.py

If you want to run distillation and/or LoRA fine-tuning, you can use the following commands:

python3 distillation.py
python3 lora.py

It may take ~10 hours to run distillation and ~1.5 hours to run LoRA fine-tuning on a single A100 40GB.

Results

Model Configuration	Perplexity	Throughput (tokens/s)
3B Base	11.05	26.2
3B + LoRA	9.67	26.2
Distilled 1B	11.76	67.5
Distilled 1B + LoRA	11.42	67.5
Distilled 1B + LoRA + HQQ	11.49	125.0

Note: All experiments were conducted on a single NVIDIA T4 GPU. The throughput measurements are end-to-end (e2e), including both prefill and decoding phases.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
utils		utils
README.md		README.md
distillation.py		distillation.py
inference.py		inference.py
lora.py		lora.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Acceleration

Overview

Setup and Reproduce

Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Acceleration

Overview

Setup and Reproduce

Results

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages