Skip to content

farhanaugustine/DINOv3_Distillation_YOLO-pose

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

YOLOv11-Pose Pre-training via DINOv3 Distillation πŸš€

This project provides a complete pipeline to pre-train the backbone of a custom YOLOv11 pose estimation model using knowledge distillation from a powerful DINOv3 vision foundation model. The entire process is handled by a single, configurable Python script that leverages the lightly-train and ultralytics libraries.

The goal is to transfer the rich, general-purpose visual understanding from the massive DINOv3 "teacher" model to a lightweight, efficient YOLOv11 "student" model. This pre-training step, performed on unlabeled images, gives the YOLO model a significant head start, leading to better performance, faster convergence, and improved data efficiency when you later fine-tune it on your specific (and often limited) labeled dataset.

fps_vs_map
Model Format Processor Quantization Box(P) Box(R) Box(mAP50) Box(mAP50-95) Pose(P) Pose(R) Pose(mAP50) Pose(mAP50-95) Size (MB) metrics/mAP50-95(P) Inference time (ms/im) FPS
YOLOvl-pose_distilled OpenVINO CPU INT8 0.958 0.961 0.98 0.958 0.961 0.964 0.985 0.953 26.3 0.9528 46.3 21.6
YOLOvn-pose_distilled OpenVINO CPU INT8 0.966 0.958 0.981 0.951 0.966 0.958 0.981 0.909 3.6 0.9091 8.05 124.25
YOLOvn-pose_distilled ONNX GPU FP16 0.972 0.95 0.981 0.952 0.972 0.95 0.981 0.91 5.5 0.9104 9.39 106.44
YOLOvl-pose_distilled ONNX GPU FP16 0.967 0.948 0.981 0.962 0.967 0.948 0.981 0.95 49.6 0.9497 18.38 54.4
YOLOvl-pose_distilled TensorRT GPU INT8 0.967 0.945 0.98 0.953 0.967 0.945 0.98 0.946 33.7 0.9459 2.36 422.96
YOLOvn-pose_distilled TensorRT GPU INT8 0.959 0.965 0.98 0.945 0.959 0.965 0.98 0.901 9.6 0.9013 1.64 607.92
YOLOvn-pose_distilled TensorRT GPU FP16 0.971 0.951 0.981 0.955 0.971 0.951 0.981 0.91 11.1 0.9102 1.48 675.3
YOLOvl-pose_distilled TensorRT GPU FP16 0.967 0.949 0.981 0.961 0.967 0.949 0.981 0.949 56.2 0.9495 2.84 351.43
  • Model Performance on RTX 4080 Super (16GvRAM) and Intel i9-14900F; All metrics in the figure are obtained after Fine-Tuning on Custom Dataset * OpenVINO = CPU * TensorRT = GPU

Table of Contents


Key Concepts Explained 🧠

To understand what this script does, let's break down the core ideas.

What is Knowledge Distillation?

Knowledge Distillation is a machine learning technique where we train a smaller, more efficient "student" model by transferring knowledge from a larger, more powerful "teacher" model. Instead of training the student directly on ground-truth labels (which we don't have for pre-training), we train it to mimic the outputs or internal representations of the teacher.

Think of it like an apprenticeship. A master artisan (the teacher) doesn't just show the apprentice (the student) the finished product. Instead, the master demonstrates the process and guides the apprentice's technique. In our case, the DINOv3 teacher shows the YOLOv11 student how to "see" and interpret an image by forcing the student's internal feature maps to match the teacher's. The training loss is calculated based on the difference between the student's and teacher's representations.

Who are the Teacher and Student?

  • Teacher: DINOv3 (dinov3/vitb16)
    • DINOv3 is a state-of-the-art vision foundation model developed by Meta AI. It was trained using self-supervised learning on a massive, diverse dataset of images.
    • Because it wasn't trained for a single, narrow task, it has developed a profound and generalizable understanding of visual patterns, textures, shapes, and object parts. Its internal representations (feature maps) are incredibly rich and semantically meaningful.
  • Student: YOLOv11-Pose (YOLO yaml model file or IntegraPose11x-pose.yaml) Native YOLO YAML files
    • YOLO (You Only Look Once) is a family of models famous for being extremely fast and efficient, making them ideal for real-time applications like pose estimation.
    • Our student is a custom-defined YOLOv11 architecture for pose estimation. By itself, its randomly initialized backbone knows nothing about the visual world.

Why Does This Work?

By distilling knowledge from DINOv3, we are essentially "imprinting" the teacher's sophisticated visual understanding onto the student's smaller, more efficient architecture. This provides several key advantages:

  1. Leverages Unlabeled Data: We can use a massive corpus of cheap, unlabeled images to give our model a robust starting point.
  2. Better Initialization: The student model starts the final fine-tuning process not with random weights, but with a backbone that already understands visual concepts. This is far more effective than traditional pre-training on datasets like ImageNet.
  3. Improved Performance & Data Efficiency: A well-pre-trained model requires less labeled data and fewer epochs during the final fine-tuning stage to achieve high performance on the target task (pose estimation).

Features ✨

  • Distillation Pre-training: Uses lightly-train to distill knowledge from a local DINOv3 teacher into a YOLOv11 student backbone.
  • Custom YOLO YAML: Supports any custom YOLOv8/v9/v10-style pose estimation YAML file.
  • Automatic YAML Patching: Intelligently patches common YAML configuration errors (e.g., residual channel mismatches) to prevent training failures.
  • Flexible Resumption: Robust policies for resuming interrupted runs (resume), starting new runs with old weights (warm_start), or starting fresh.
  • Automatic Export: The final pre-trained backbone is automatically exported to a .pt file, ready for use with the ultralytics framework.
  • Optional Fine-tuning: A built-in option to immediately proceed to fine-tuning the pre-trained model on a labeled dataset.
  • Optional Embedding Extraction: A utility to generate image embeddings from your final pre-trained model.

Prerequisites πŸ“‹

  • Hardware: A modern NVIDIA GPU with at least 8GB of VRAM is recommended.
  • Software: Python 3.8+ and pip.
  • Environment: A CUDA-enabled environment to run PyTorch on the GPU.

Step-by-Step Instructions πŸ› οΈ

1. Clone the Repository

First, get the project files on your local machine.

git clone https://github.com/farhanaugustine/DINOv3_Distillation_YOLO-pose.git
cd DINOv3_Distillation_YOLO-pose

2. Install Dependencies

The script requires lightly-train with the ultralytics extra, as well as ultralytics itself.

pip install "lightly-train[ultralytics]" ultralytics

3. Download the DINOv3 Teacher Weights

You need the pre-trained weights for the DINOv3 teacher model. This script is configured for the ViT-B/16 version.

  • Download Link: You can find download links and instructions on the official DINOv3 GitHub repository. The script is configured for dinov3_vitb16_pretrain.pth.
  • Save Location: Place the downloaded .pth file in a known location. You will provide the path to this file in the script's configuration.

4. Prepare Your Data and Models

  • Unlabeled Images: Gather all the images you want to use for pre-training. Place them in a single folder. The script will recursively scan this folder for images. This can be thousands or millions of images.
    • Example: C:\data\unlabeled_images\
  • YOLOv11 YAML File: You need a YOLOv11-Pose model definition file (e.g., IntegraPose11x-pose.yaml). This file must be placed in the same folder as the Python script.
  • (Optional) Labeled Dataset: If you plan to use the automatic fine-tuning feature (DO_FINETUNE = True), prepare your labeled dataset in the Ultralytics format and have the dataset.yaml file ready.

5. Configure the Script

Open the pre-train_distill_yolo11.py file and edit the variables in the CONFIG section at the top. This is the most important step.

# Unlabeled images root (Lightly scans recursively)
UNLABELED_DATA_DIR = r"C:\path\to\your\unlabeled_images"

# Your YAML is in the SAME FOLDER as this script. Put its filename here:
YOLO_MODEL = r"IntegraPose11x-pose.yaml"

# Lightly run output directory
OUT_DIR = r"C:\path\to\save\training\output"

# DINOv3 teacher weights you downloaded
TEACHER_WEIGHTS = r"C:\path\to\your\dinov3_vitb16_pretrain.pth"

# Training knobs (start low, increase if your GPU can handle it)
EPOCHS = 50
BATCH_SIZE = 2
PRECISION = "16-mixed"

# --- Optional Steps ---
# Set to True to run fine-tuning after pre-training
DO_FINETUNE = False
ULTRALYTICS_DATA_YAML = r"C:\path\to\your\labeled_dataset.yaml"

# Set to True to extract embeddings after pre-training
DO_EMBED = False

6. Run the Training

Once configured, execute the script from your terminal.

python pre-train_distill_yolo11.py

The script will log its progress, including the configuration, model setup, and training epochs.


Understanding the Output πŸ“‚

All artifacts from the training run will be saved in the directory you specified in OUT_DIR.

C:\path\to\save\training\output\
β”œβ”€β”€ checkpoints\
β”‚   β”œβ”€β”€ last.ckpt         # Checkpoint for resuming
β”‚   └── epoch=X-step=Y.ckpt
β”œβ”€β”€ exported_models\
β”‚   └── exported_last.pt  # <<< YOUR FINAL, USABLE MODEL FOR ULTRALYTICS FINE-TUNING
β”œβ”€β”€ ... (other Lightly log files)

The most important file is exported_last.pt. This is your pre-trained YOLOv11 model, ready to be used for fine-tuning or inference with Ultralytics.


Code Breakdown πŸ”¬

  • CONFIG Block: All user-configurable parameters are centralized at the top of the script for easy access.
  • _get_model_for_lightly(): This function resolves the YOLO_MODEL variable. It smartly handles local YAML files and includes a fallback to an official alias if the local file fails to build.
  • _autopatch_yaml(): A crucial utility that prevents common training errors by creating a patched copy of your YAML on the fly. It disables shortcuts (residuals) that might have mismatched channel counts due to model scaling and ensures a scale parameter is present.
  • _make_resume_kwargs(): Implements the logic for the RESUME_POLICY, ensuring that the training can be resumed correctly without conflicting arguments.
  • pretrain_distill(): The main function that sets up and launches the lightly_train.train process. It configures the dataloader, optimizer, and distillation method arguments. After training, it ensures the model is exported to the Ultralytics .pt format.
  • finetune_ultralytics() / embed_from_checkpoint(): Wrapper functions that handle the optional post-training steps.

Acknowledgements πŸ™

This project stands on the shoulders of giants. Special thanks to:

  • Meta AI for developing and open-sourcing the powerful DINOv3 models.
  • Lightly AI for their excellent lightly-train library, which makes complex self-supervised learning and distillation techniques accessible.
  • Ultralytics for the versatile and easy-to-use YOLO framework.

About

This project provides a complete pipeline to pre-train the backbone of a custom YOLOv11 pose estimation model using knowledge distillation from a powerful DINOv3 vision foundation model.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages