This repository is the official Pytorch implementation of "Leveraging Pretrained Unimodal Models for Efficient Vision-Language Pretraining". See here for details.
Most of the data, including all GLUE tasks and COCO, are downloaded automatically before training starts. Some however, need to be downloaded using separate scripts or actions.
To generate CC3M and CC12M, one must scrape the images from the web.
To download CC3M data, please first download the index from
https://ai.google.com/research/ConceptualCaptions/download.
Under the header Downloads, click on the button Training split. This should download the file Train-GCC-training.tsv
(other splits are not supported).
Then run:
python src/datasets_/download/download_cc3m_images.py --data_path <path_to_store_all_data> --path_to_index <path_to_Train-GCC-training.tsv>This will create a directory conceptual_captions_3m in <path_to_store_all_data>, which will be used during pretraining.
To download CC12M data, it is sufficient to run:
python src/datasets_/download/download_cc12m_images.py --data_path <path_to_store_all_data>The index will be downloaded automatically.
Note that not all images in both indices will be downloaded successfully, as they are scraped from various websites and could be taken down at any time by the respective owners.
Flickr30K is not a public dataset, and you have to fill out a form to request access. The form can be found here.
Upon receiving access via email, download the archive flickr30k-images.tar.gz and unpack it to get the folder flickr30k-images with the images.
Then, under <path_to_store_all_data>, create a folder flickr30k (this is the dataset folder,
like conceptual_captions_3m for CC3M), and copy the image folder flickr30k-images
into this directory. Everything else will be handled by the code!
The ImageNet-1K dataset is downloaded from HuggingFace. This requires a HuggingFace account!
First, run:
huggingface-cli loginTo authenticate, use an access token created on the HuggingFace website.
To download the data, run:
HF_HUB_ENABLE_HF_TRANSFER=1 python src/datasets_/download/download_imagenet.py --data_path <path_to_store_all_data>HF_HUB_ENABLE_HF_TRANSFER=1 will allow for a faster download if using high bandwidth. Do not be confused if the download is
complete and the script appears to hang, this is because the downloaded tar.gz files need to be extracted, which can take quite long.
Again, the rest is handled by the code.
Note: Due to the amount of downstream tasks and running costs we do not provide pretrained weights yet.
To pretrain S-SMKE, first download the weights for Data2Vec2Image by clicking here,
and the weights of BEiTv2 by clicking here.
Next, create a folder modelsin the directory <path_to_store_all_data> and copy both downloaded state dicts into the models folder.
The downloaded files should be named base_imagenet.pt and beitv2_base_patch16_224_pt1k.pth for Data2Vec2Image and BEiTv2, respectively.
Then run the following to start training:
python src/train.py --config-path configs/training --config-name s_smke.yaml base_dir=<path_to_store_all_data>The trained model will be stored under <path_to_store_all_data>/models/S-SMKE/model-{step}-{loss}-train.ckpt.
| Model | ImageNet-1K | COCO R@1 | Flickr30K R@1 | #params | weights |
|---|---|---|---|---|---|
| S-SMKE | 37.1 | 44.6 | 61.8 | 117M | - |
For finetuning S-SMKE on ImageNet-1K, run:
python src/train.py \
--config-path configs/fine_tuning \
--config-name imagenet.yaml base_dir=<path_to_store_all_data> \
model.model_path=<path_to_store_all_data>/models/S-SMKE/model-{step}-{loss}-train.ckpt \
model.model_name=S-SMKE \
model.linear_classifier=FalseFor linear evaluation, set model.linear_classifier=True.
| Model | Resolution | finetune acc@1 | lin eval acc@1 | #params | weights |
|---|---|---|---|---|---|
| ImageClassification | 224x224 | 75.5 | 65.0 | 43.9M | - |
To finetune S-SMKE on all GLUE tasks, navigate to src/sweep_glue.sh, and set the
variables model and model_name to:
#!/bin/bash
model="<path_to_store_all_data>/models/S-SMKE/model-{step}-{loss}-train.ckpt"
model_name="S-SMKE"
...Then run:
cd src
python sweep_glue.sh| Model | Task | Metric | Score | #params | weights |
|---|---|---|---|---|---|
| TextClassification | MNLI | Accuracy | 73.88 | 67M | - |
| TextClassification | QNLI | Accuracy | 78.71 | 67M | - |
| TextClassification | RTE | Accuracy | 51.6 | 67M | - |
| TextClassification | MRPC | F1 | 79.9 | 67M | - |
| TextClassification | QQP | F1 | 81.2 | 67M | - |
| TextClassification | STS-B | Spearman | 57.5 | 67M | - |
| TextClassification | CoLA | MCC | 14.2 | 67M | - |
| TextClassification | SST | Accuracy | 83.5 | 67M | - |
| TextClassification | WNLI | Accuracy | 45.0 | 67M | - |
The finetune S-SMKE on Flickr30K and COCO image-text retrieval, navigate to src/sweep_retrieval.sh, and set the
variables model and model_name to:
#!/bin/bash
model="<path_to_store_all_data>/models/S-SMKE/model-{step}-{loss}-train.ckpt"
model_name="S-SMKE"
...Then run:
cd src
python sweep_retrieval.shTo generate the image-text retrieval scores published in research papers like BEiT-3 run the following (COCO):
python src/run_image_text_retrieval.py \
--config-path configs \
--config-name coco_flickr_retrieval.yaml \
model_path=<path_to_store_all_data>/models/retrieval_finetune_coco/model-{step}-{loss}-val.ckpt \
model_name=S-SMKEAnd for Flickr30K replace model_path with model_path=<path_to_store_all_data>/models/retrieval_finetune_flickr30k/model-{step}-{loss}-val.ckpt.
| Model | Resolution | IR@1 | TR@1 | #params | weights |
|---|---|---|---|---|---|
| RetrievalFinetune | 224x224 | 64.6 | 82.0 | 117M | - |
| Model | Resolution | IR@1 | TR@1 | #params | weights |
|---|---|---|---|---|---|
| RetrievalFinetune | 224x224 | 39.8 | 56.2 | 117M | - |
- Not applicable to visual reasoning, visual question answering, image captioning
- Low performance on unimodal downstream tasks, especially CoLA (GLUE)