Skip to content

credo92/vllm-model-distillation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vLLM Distillation Playground

This repo contains:

  • Backend: FastAPI + SQLAlchemy + Postgres, OSS teacher + OSS student vLLM clients
  • Frontend: React + Vite + Tailwind, auth, datasets, and a distillation playground
  • Infra: Docker Compose for Postgres and vLLM, plus simple Makefile helpers

What is knowledge distillation?

  • Goal: Train a smaller, cheaper student model to mimic a larger, more capable teacher model.
  • Why:
    • Run models with lower latency and cost in production.
    • Deploy on smaller GPUs or CPUs while retaining most of the teacher’s quality.
  • How (high level):
    • Send prompts to the teacher model and capture its responses.
    • Optionally compare teacher vs. student responses for the same prompt.
    • Use the collected (prompt, teacher_output) (and possibly student_output) pairs as a supervised training dataset for the student model.

In this repo, the playground focuses on the data collection side of distillation: creating prompt datasets and logging teacher/student responses in a structured way that can be exported for training.


How this playground does distillation

At a high level:

  1. Create a project (dataset) from the UI.
  2. Enter prompts in the playground.
  3. For each prompt, the backend:
  • Calls the teacher OSS model (e.g., a larger Mistral/LLaMA variant).
  • Calls the student OSS model served by vLLM via an OpenAI-compatible API.
  1. The backend stores the prompt + both responses in Postgres.
  2. You can replay prompts, iterate on them, and export all data for offline training.

This gives you a repeatable loop:

  1. Design prompts → collect teacher/student data.
  2. Export dataset → train or fine-tune the student.
  3. Update the student model behind vLLM → repeat and compare.

System architecture overview

At a component level:

  • Frontend (React + Vite + Tailwind):
    • Auth flows (/register, /login).
    • Dataset (project) list and management.
    • Distillation playground UI for running teacher vs. student side by side.
  • Backend (FastAPI + SQLAlchemy + Postgres):
    • Auth endpoints (/auth/*).
    • Project and prompt management.
    • Integrations with:
      • An OSS teacher model client.
      • An OSS student model client talking to a vLLM OpenAI-compatible server.
  • Database (Postgres):
    • Persists users, projects, prompts, teacher outputs, and student outputs.
  • Model serving (vLLM):
    • Serves the student OSS model via an OpenAI-compatible HTTP API.

High-level architecture diagram

flowchart LR
    subgraph User
        B[Browser<br/>React + Vite UI]
    end

    subgraph Backend[FastAPI Backend]
        A1[Auth & Users]
        A2[Projects & Datasets]
        A3[Playground API<br/>Prompts & Runs]
        A4[Teacher Client]
        A5[Student Client<br/>vLLM/OpenAI]
    end

    subgraph DB[Postgres]
        D1[(Users)]
        D2[(Projects)]
        D3[(Prompts & Runs)]
    end

    subgraph Models
        T[Teacher OSS Model]
        S[vLLM Server<br/>Student OSS Model]
    end

    B <--> A1
    B <--> A2
    B <--> A3

    A1 <--> D1
    A2 <--> D2
    A3 <--> D3

    A4 --> T
    A5 --> S

    A3 --> A4
    A3 --> A5
Loading

Distillation data flow

sequenceDiagram
    participant U as User (Browser)
    participant FE as Frontend
    participant BE as Backend (FastAPI)
    participant T as Teacher Model
    participant S as vLLM Student
    participant DB as Postgres

    U->>FE: Enter prompt in playground
    FE->>BE: POST /projects/{id}/prompts/run
    BE->>T: Generate teacher response
    T-->>BE: Teacher output
    BE->>S: OpenAI-compatible /chat/completions
    S-->>BE: Student output
    BE->>DB: Store {project, prompt, teacher, student}
    BE-->>FE: Return both responses
    FE-->>U: Render side-by-side outputs
Loading

You can then export the data from the backend and feed it into your own training pipelines (PyTorch, Hugging Face, etc.).

Directory layout

  • backend: FastAPI app (backend.main:app), models, schemas, and API routes
  • frontend: React/Vite SPA in frontend/src
  • docker-compose.yml: Local Postgres and vLLM services
  • Makefile: Convenience commands for running services and apps

Prerequisites

  • Python: 3.10+ (for the backend)
  • Node.js: 18+ and npm (for the frontend)
  • Docker + Docker Compose (for Postgres and vLLM)
  • GPU + recent NVIDIA drivers (recommended for the vLLM container)

Backend setup (FastAPI)

1. Create a virtualenv and install dependencies

cd backend
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

pip install -e .

This uses pyproject.toml to install FastAPI, SQLAlchemy, psycopg, etc.

2. Configure environment

Copy the example env file and adjust values as needed:

cd backend
cp .env.example .env

Key fields:

  • DATABASE_URL: Defaults to postgresql+psycopg://postgres:postgres@localhost:5432/oss_distiller
  • TEACHER_OSS_MODEL_NAME: Larger OSS model used as the teacher
  • OSS_MODEL_NAME and VLLM_BASE_URL: Student OSS model and OpenAI-compatible server URL
  • JWT_SECRET_KEY: Change this in non-dev environments

3. Run the backend

From the repo root (or backend directory):

cd backend
uvicorn backend.main:app --reload --host 0.0.0.0 --port 8000

The backend will:

  • Connect to Postgres using DATABASE_URL
  • Initialize tables on startup
  • Expose OpenAPI docs at http://localhost:8000/docs

Frontend setup (React + Vite)

1. Configure API base URL

From the frontend directory:

cd frontend
cp .env.example .env

The default points to the local backend:

  • VITE_API_BASE_URL: http://localhost:8000

2. Install dependencies and run dev server

cd frontend
npm install
npm run dev

The app will start at http://localhost:5173 and talk to the backend at http://localhost:8000.


Running Postgres and vLLM via Docker

1. Start services

From the repo root:

make services-up

This is equivalent to:

docker compose up -d db vllm

Services:

  • db:
    • Image: postgres:16
    • Port: 5432 exposed on the host
    • Default credentials match DATABASE_URL in backend/.env.example
  • vllm:
    • Image: vllm/vllm-openai:latest
    • Command: --model mistralai/Mistral-7B-Instruct-v0.2
    • Port: 8001 on host (mapped to container 8000)
    • Backend expects VLLM_BASE_URL="http://localhost:8001"

Note: The vLLM container expects a GPU and recent NVIDIA drivers. If you do not have a GPU, you can:

  • Comment out the vllm service in docker-compose.yml, and
  • Point VLLM_BASE_URL in backend/.env to any other OpenAI-compatible endpoint you control.

2. Stop services

make services-down

or directly:

docker compose down

One-liner workflow

From the repo root, in three terminals:

  1. Start infra (Postgres + vLLM):
 make services-up
  1. Run backend:
 make backend
  1. Run frontend:
 make frontend

Then open http://localhost:5173 in your browser.


Authentication & playground flow

  • Register and log in via the frontend (/register and /login), which hits:
    • POST /auth/register
    • POST /auth/login
    • GET /auth/me
  • Create and use datasets (projects) from the UI:
    • / lists datasets (projects) and lets you:
      • Select them for training
      • Open the playground
      • Download an export as JSON
  • /projects/:projectId is the distillation playground:
    • Enter a prompt, then run an OSS teacher model and OSS vLLM student model side by side
      • Each run is persisted as a prompt plus two model responses
      • You can re-run previous prompts and download the full dataset for that project

Backend dataset exports are served from:

  • GET /prompts/export/project/{project_id}

The frontend’s dataset download buttons call this endpoint and save a nicely formatted JSON file for training.

About

App for Model Distillation and to understand what it's about

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors