Official PyTorch implementation of VLPL, a vision-language pseudo-labeling approach for multi-label image classification under the single-positive label setting.
Authors: Xin Xing, Zhexiao Xiong, Abby Stylianou, Srikumar Sastry, Liyu Gong, and Nathan Jacobs
Corresponding author: Xin Xing (xxing@unomaha.edu)
Multi-label image classification requires annotating all classes present in an image, which is expensive and error-prone. The single-positive label setting reduces this cost by annotating only one positive class per image, even when multiple classes are present. Existing methods for this setting rely primarily on novel loss functions; pseudo-label approaches have historically underperformed.
VLPL uses a pretrained vision-language model (CLIP) to generate strong positive and negative pseudo-labels, supplementing the sparse single-positive supervision signal.
Key contributions:
- Vision-language pseudo-labeling strategy that proposes reliable positive and negative pseudo-labels from CLIP similarity scores
- Demonstrated on four benchmarks: Pascal VOC, MS-COCO, NUS-WIDE, and CUB-Birds
- Outperforms prior SOTA by +5.4% (VOC), +15.6% (COCO), +15.2% (NUS-WIDE), and +11.3% (CUB) when using a stronger backbone
Requires Python 3.8 and a CUDA-capable GPU.
conda create --name vlpl python=3.8.8
conda activate vlpl
pip install -r requirements.txtPASCAL VOC
cd data/pascal
curl http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar --output pascal_raw.tar
tar -xf pascal_raw.tar
rm pascal_raw.tarMS-COCO
cd data/coco
curl http://images.cocodataset.org/annotations/annotations_trainval2014.zip --output coco_annotations.zip
curl http://images.cocodataset.org/zips/train2014.zip --output coco_train_raw.zip
curl http://images.cocodataset.org/zips/val2014.zip --output coco_val_raw.zip
unzip -q coco_annotations.zip
unzip -q coco_train_raw.zip
unzip -q coco_val_raw.zip
rm coco_annotations.zip coco_train_raw.zip coco_val_raw.zipNUS-WIDE
- Download
Flickr.zipfrom the NUS-WIDE website. - Move and extract:
mv /path/to/Flickr.zip data/nuswide/
cd data/nuswide && unzip -q Flickr.zip && rm Flickr.zipCUB-Birds
- Download
CUB_200_2011.tgzfrom Caltech. - Move and extract:
mv /path/to/CUB_200_2011.tgz data/cub/
cd data/cub && tar -xf CUB_200_2011.tgz && rm CUB_200_2011.tgzFor PASCAL VOC, MS-COCO, and CUB, run from the repository root:
python preproc/format_pascal.py
python preproc/format_coco.py
python preproc/format_cub.pyFor NUS-WIDE, download the pre-formatted files from Google Drive and place them in data/nuswide/:
formatted_train_images.npyformatted_train_labels.npyformatted_val_images.npyformatted_val_labels.npy
python preproc/generate_observed_labels.py --dataset <DATASET><DATASET> ∈ {pascal, coco, nuswide, cub}
Run main_clip.py to train and evaluate a model:
python main_clip.py -d <DATASET> -l <LOSS> -g <GPU> -m <MODEL> -t <TEMP> -th <THRESHOLD> -p <PARTIAL> -s <SEED>| Argument | Flag | Default | Options |
|---|---|---|---|
| Dataset | -d |
pascal |
pascal, coco, nuswide, cub |
| Loss | -l |
EM_PL |
bce, iun, an, EM, EM_APL, EM_PL |
| GPU index | -g |
0 |
0, 1, 2, 3 |
| Backbone | -m |
resnet50 |
resnet50, clip_vision, convnext_xlarge_22k, convnext_xlarge_1k |
| Temperature | -t |
0.01 |
float |
| Pseudo-label threshold | -th |
0.3 |
float |
| Negative pseudo-label fraction | -p |
0.0 |
float |
| PyTorch seed | -s |
0 |
int |
Example – train VLPL with EM_PL loss on PASCAL VOC (default settings):
python main_clip.py -d pascal -l EM_PLExample – train with CLIP vision backbone on MS-COCO:
python main_clip.py -d coco -l EM_PL -m clip_visionPrecomputed CLIP text features for each dataset are provided in the repository root as .npy files and are loaded automatically by the training script.
VLPL/
├── main_clip.py # Main training and evaluation script
├── models.py # Model definitions (ResNet50, CLIP-ViT, ConvNeXt)
├── losses.py # Loss functions (BCE, AN, EM, EM_PL, EM_APL, ...)
├── datasets.py # Dataset loading utilities
├── metrics.py # Evaluation metrics (mAP, etc.)
├── instrumentation.py # Training logger
├── requirements.txt # Python dependencies
├── preproc/
│ ├── format_pascal.py # Format PASCAL VOC annotations
│ ├── format_coco.py # Format MS-COCO annotations
│ ├── format_cub.py # Format CUB-Birds annotations
│ └── generate_observed_labels.py # Sample single-positive labels
├── data/
│ ├── pascal/ # PASCAL VOC data directory
│ ├── coco/ # MS-COCO data directory
│ ├── nuswide/ # NUS-WIDE data directory
│ └── cub/ # CUB-Birds data directory
├── *text_feature.npy # Precomputed CLIP text features per dataset
└── images/ # Architecture figures
Main results on the single-positive label setting (1 P. & 0 N.), measured by mAP. Input image size: 448×448.
| Method | VOC | COCO | NUS | CUB |
|---|---|---|---|---|
| AN Loss | 85.89 | 64.92 | 42.27 | 18.31 |
| EM | 89.09 | 70.70 | 47.15 | 20.85 |
| EM+APL | 89.19 | 70.87 | 47.59 | 21.84 |
| LL-R | 89.2 | 71.0 | 47.4 | 19.5 |
| DualCoOp | 83.6 | 69.2 | 42.8 | — |
| VLPL (Ours) | 89.10 | 71.45 | 49.55 | 24.02 |
See the paper for the full comparison table including all baselines.
@inproceedings{xing2024vlpl,
title={VLPL: Vision Language Pseudo Labels for Multi-label Learning with Single Positive Labels},
author={Xing, Xin and Xiong, Zhexiao and Stylianou, Abby and Sastry, Srikumar and Gong, Liyu and Jacobs, Nathan},
booktitle={CVPR 2024 Workshop on Learning with Limited Labelled Data (LIMIT)},
year={2024}
}This codebase builds on single-positive-multi-label and SPML-AckTheUnknown.
- Add pretrained model checkpoints
