We propose VideoAuto-R1, a video understanding framework that adopts a "reason-when-necessary" strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning.
- [2025-01-09]: Try our online demo at HuggingFace Spaces!
- [2025-01-08]: We have released the training code and data for VideoAuto-R1!
git clone git@github.com:IVUL-KAUST/VideoAuto-R1.git
cd VideoAuto-R1
conda create -n videoauto-r1 python=3.12
source activate videoauto-r1
pip install -r requirements.txt
conda install "ffmpeg<8"
pip install flash-attn==2.8.0.post2 --no-build-isolationThe code is tested with Python 3.12, PyTorch 2.8, CUDA 12.4 on linux, and may also work on other versions.
Please download the data from HuggingFace and put them under the data/ folder.
For training, please run the following scripts:
# for Qwen2.5-VL
bash scripts/train/grpo_autothink/train_qwen2.5vl_grpo_auto_text_image_video.sh
# for Qwen3-VL
bash scripts/train/grpo_autothink/train_qwen3vl_grpo_auto_text_image_video.shOur models are trained on 32 H100 GPUs. You may need to adjust the batch size and accumulation steps according to your hardware settings.
We use lmms_eval framework to evaluate our models.
For evaluating the baseline Qwen models, please run the following scripts:
# for Qwen2.5-VL
bash scripts/eval/benchmark_qwen/eval_qwen2_5_vl_16k.sh
# for Qwen3-VL
bash scripts/eval/benchmark_qwen/eval_qwen3_vl_128k.shFor evaluating our VideoAuto-R1 models, please run the following scripts:
# for Qwen2.5-VL
bash scripts/eval/grpo_autothink/eval_qwen2_5_vl_auto_16k.sh
# for Qwen3-VL
bash scripts/eval/grpo_autothink/eval_qwen3_vl_auto_128k.shOur models are evaluated on 8 H100 GPUs. You may need to adjust according to your hardware settings.
Expected Results:
| Benchmarks | Qwen2.5-VL-7B | VideoAuto-R1-7B | Qwen3-VL-8B | VideoAuto-R1-8B |
|---|---|---|---|---|
| VideoMME | 66.0 | 67.3 | 72.5 | 71.7 |
| MVBench | 67.1 | 71.0 | 69.4 | 72.0 |
| LongVideoBench | 60.9 | 60.5 | 67.6 | 67.4 |
| MMVU | 66.2 | 69.7 | 69.9 | 71.1 |
| VideoMMMU | 54.7 | 58.6 | 61.0 | 65.0 |
| MVP | 36.5 | 39.4 | 40.5 | 43.0 |
| Charades-STA | 52.9 | 60.0 | 44.6 | 63.7 |
| ActivityNet-QA | 26.9 | 47.6 | 36.1 | 51.9 |
| Next-GQA | 20.2 | 36.7 | 37.1 | 44.2 |
Due to the different environment or library versions, the performance may vary slightly from the reported results in the paper (±0.5%).
This project builds upon the following excellent works: Qwen-VL, TRL, lmms-eval, etc. We thank all researchers and developers who contributed to these foundational projects.
If you use VideoAuto-R1 in your research, please cite:
@article{liu2026videoautor1,
title={VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice},
author={Liu, Shuming and Zhuge, Mingchen and Zhao, Changsheng and Chen, Jun and Wu, Lemeng and Liu, Zechun and Zhu, Chenchen and Cai, Zhipeng and Zhou, Chong and Liu, Haozhe and Chang, Ernie and Suri, Saksham and Xu, Hongyu and Qian, Qi and Wen, Wei and Varadarajan, Balakrishnan and Liu, Zhuang and Xu, Hu and Bordes, Florian and Krishnamoorthi, Raghuraman and Ghanem, Bernard and Chandra, Vikas and Xiong, Yunyang},
journal={arXiv preprint arXiv:2601.05175},
year={2026}
}This project is licensed under the Apache License 2.0. See LICENSE file for details.
If you have any questions, please contact: shuming.liu@kaust.edu.sa.
