VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice

📖 Overview

We propose VideoAuto-R1, a video understanding framework that adopts a "reason-when-necessary" strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning.

🔥 Updates

[2025-01-09]: Try our online demo at HuggingFace Spaces!
[2025-01-08]: We have released the training code and data for VideoAuto-R1!

Installation

git clone git@github.com:IVUL-KAUST/VideoAuto-R1.git
cd VideoAuto-R1

conda create -n videoauto-r1 python=3.12
source activate videoauto-r1

pip install -r requirements.txt

conda install "ffmpeg<8"
pip install flash-attn==2.8.0.post2 --no-build-isolation

The code is tested with Python 3.12, PyTorch 2.8, CUDA 12.4 on linux, and may also work on other versions.

Training

Please download the data from HuggingFace and put them under the data/ folder.

For training, please run the following scripts:

# for Qwen2.5-VL
bash scripts/train/grpo_autothink/train_qwen2.5vl_grpo_auto_text_image_video.sh

# for Qwen3-VL
bash scripts/train/grpo_autothink/train_qwen3vl_grpo_auto_text_image_video.sh

Our models are trained on 32 H100 GPUs. You may need to adjust the batch size and accumulation steps according to your hardware settings.

Evaluation

We use lmms_eval framework to evaluate our models.

For evaluating the baseline Qwen models, please run the following scripts:

# for Qwen2.5-VL
bash scripts/eval/benchmark_qwen/eval_qwen2_5_vl_16k.sh

# for Qwen3-VL
bash scripts/eval/benchmark_qwen/eval_qwen3_vl_128k.sh

For evaluating our VideoAuto-R1 models, please run the following scripts:

# for Qwen2.5-VL
bash scripts/eval/grpo_autothink/eval_qwen2_5_vl_auto_16k.sh

# for Qwen3-VL
bash scripts/eval/grpo_autothink/eval_qwen3_vl_auto_128k.sh

Our models are evaluated on 8 H100 GPUs. You may need to adjust according to your hardware settings.

Expected Results:

Benchmarks	Qwen2.5-VL-7B	VideoAuto-R1-7B	Qwen3-VL-8B	VideoAuto-R1-8B
VideoMME	66.0	67.3	72.5	71.7
MVBench	67.1	71.0	69.4	72.0
LongVideoBench	60.9	60.5	67.6	67.4
MMVU	66.2	69.7	69.9	71.1
VideoMMMU	54.7	58.6	61.0	65.0
MVP	36.5	39.4	40.5	43.0
Charades-STA	52.9	60.0	44.6	63.7
ActivityNet-QA	26.9	47.6	36.1	51.9
Next-GQA	20.2	36.7	37.1	44.2

Due to the different environment or library versions, the performance may vary slightly from the reported results in the paper (±0.5%).

Acknowledgement

This project builds upon the following excellent works: Qwen-VL, TRL, lmms-eval, etc. We thank all researchers and developers who contributed to these foundational projects.

Citation

If you use VideoAuto-R1 in your research, please cite:

@article{liu2026videoautor1,
  title={VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice},
  author={Liu, Shuming and Zhuge, Mingchen and Zhao, Changsheng and Chen, Jun and Wu, Lemeng and Liu, Zechun and Zhu, Chenchen and Cai, Zhipeng and Zhou, Chong and Liu, Haozhe and Chang, Ernie and Suri, Saksham and Xu, Hongyu and Qian, Qi and Wen, Wei and Varadarajan, Balakrishnan and Liu, Zhuang and Xu, Hu and Bordes, Florian and Krishnamoorthi, Raghuraman and Ghanem, Bernard and Chandra, Vikas and Xiong, Yunyang},
  journal={arXiv preprint arXiv:2601.05175},
  year={2026}
}

This project is licensed under the Apache License 2.0. See LICENSE file for details.

If you have any questions, please contact: shuming.liu@kaust.edu.sa.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
data		data
lmms_eval		lmms_eval
scripts		scripts
videoauto_r1		videoauto_r1
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice

📖 Overview

🔥 Updates

Installation

Training

Evaluation

Acknowledgement

Citation

About

Uh oh!

Releases

Languages

License

IVUL-KAUST/VideoAuto-R1

Folders and files

Latest commit

History

Repository files navigation

VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice

📖 Overview

🔥 Updates

Installation

Training

Evaluation

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Languages