[CVPR2026] RegFormer: Transferable Relational Grounding for Efficient Weakly-Supervised Human-Object Interaction Detection
Jihwan Park1, Chanhyeong Yang2, Jinyoung Park1, Taehoon Song1, Hyunwoo J. Kim1
1KAIST 2LG Energy Solution
Create and activate the conda environment:
conda create -n regformer python=3.9
conda activate regformerInstall PyTorch:
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2Install the remaining dependencies:
pip install -r requirements.txtInstall pocket, which is required by the detection code inherited from UPT:
git clone https://github.com/fredzzhang/pocket.git ../pocket
pip install -e ../pocketIf pocket is already available locally, install that checkout instead:
pip install -e /path/to/pocketDownload HICO-DET by following the data preparation instructions in the UPT repository. Place the annotations and images under hicodet/:
hicodet/
instances_train2015.json
instances_test2015.json
hico_20160224_det/
images/
train2015/
test2015/
Detection file extraction expects the detector checkpoint at:
params/detr-r50-e632da11.pth
Download the DETR R50 checkpoint from the official DETR model zoo:
mkdir -p params
wget https://dl.fbaipublicfiles.com/detr/detr-r50-e632da11.pth -O params/detr-r50-e632da11.pthTrain the weak HOI model in full mode with:
bash scripts/weak_run.sh 768 facebook/dinov2-with-registers-small openai/clip-vit-base-patch16 518For zero-shot settings, pass the training split mode as the fifth argument:
# RF-UC
bash scripts/weak_run.sh 768 facebook/dinov2-with-registers-small openai/clip-vit-base-patch16 518 rare_first
# NF-UC
bash scripts/weak_run.sh 768 facebook/dinov2-with-registers-small openai/clip-vit-base-patch16 518 non_rare_firstArguments:
<embed_dim> attention/pooling embedding dimension
<vision_encoder> Hugging Face vision encoder name
<text_encoder> Hugging Face text encoder name
<input_resolution> image input resolution
[zs_type] optional zero-shot split; use rare_first for RF-UC, non_rare_first for NF-UC
Checkpoints and logs are written under output/weak_hoi/.
Before running detection evaluation, extract detector boxes for HICO-DET:
bash scripts/det_extract/hico_r50.shThis uses params/detr-r50-e632da11.pth from the official DETR repository by default and writes:
data/hicodet_pkl_files/hicodet_test_bbox_R50_detr-r50-e632da11.p
After training, apply the detector using the weak output directory:
bash scripts/apply_detection.sh {weak_output_dir}/final_model.pthDetection outputs are saved under {weak_output_dir}/detection/ by default.
This repository builds on components from ADA-CM and UPT. We thank the authors for releasing their code.
If you find this work useful, please cite:
@inproceedings{park2026regformer,
title = {RegFormer: Transferable Relational Grounding for Efficient Weakly-Supervised Human-Object Interaction Detection},
author = {Park, Jihwan and Yang, Chanhyeong and Park, Jinyoung and Song, Taehoon and Kim, Hyunwoo J.},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}