This is a re-implementation of our work "OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions" using public datasets and re-trained model based on public codes. In this work, we present a data construction pipeline that can create data pairs and a diffusion Transformer for subject-driven video customization under different control conditions. I will continue to complete this repo. If you find our repo useful, please give it a star ⭐ and consider citing our paper. Thank you :)
The overall framework of our OmniVCus
- 2025.12.26 : Training and testing codes, training data, and pre-trained models have been released. Please feel free to check and try. 🚀
- 2025.12.03 : The data construction code has been uploaded. I will continue refine and construct this repo. Stay tuned. 💫
- 2025.09.19 : Our paper has been accepted by NeurIPS 2025. 🎉 🎊
- 2025.06.30 : Our paper is on arxiv now. 🚀
- 2025.06.28 : Our project page has been built up. Feel free to check the video generation results on the project page.
We implement our data construction pipeline in the folder VideoCus-Factory, which can construct the multi-modal control conditions including subjects, depth, mask, motion, etc. We also provide the code in the folder Video-Depth-Anything for better constructing the video depth condition. Please enter the corresponding subfolders for environment installation and data preparation. The following is an example of constructing from a raw video.
| Generated Prompt: a woman and a child playing with a toy train. | ||
|
|
|
| Original Video | Segmented Subject | Augmented Subject |
|
|
|
| Depth Video | Mask Video | Motion Video |
For your convenience to do research, we provide our training and testing datasets in our huggingface pages: [Train Set], [Test Set]
In addition, we also provide parts of the original training and testing dataset samples in Google Drive. We construct webpages in the Google Drive for your convenience to browse the data samples as
We re-implement our method in the folder DiffSynth-Studio based on Wan2.1-1.3B, Wan2.1-14B, Wan2.2-14B, and VACE models. We provide our trained models in the huggingface website. I write tensor-parallel testing and training code as follow.
· Model Overview
| Model ID | Inference | Training |
|---|---|---|
| Wan2.1-OmniVCus-1.3B | code | code |
| Wan2.1-OmniVCus-14B | code | code |
| Wan2.2-OmniVCus-14B-high | code | code |
| Wan2.2-OmniVCus-14B-low | code | code |
We compare our OmniVCus with the state-of-the-art method VACE as follow:
· (a) 2.1-1.3B model
| (a1) a woman rolling up a fitted sheet | |
|
|
| Reference Image | Depth Video |
|
|
| VACE-2.1-1.3B | OmniVCus-2.1-1.3B (Ours) |
| (a2) a church in the winter | |
|
|
| Reference Image | Mask Video |
|
|
| VACE-2.1-1.3B | OmniVCus-2.1-1.3B |
· (b) 2.1-14B model
· (c) 2.2-14B model
Please enter the subfolder DiffSynth-Studio for detailed instruction to train and test the models.
@inproceedings{omnivcus,
title={OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions},
author={Yuanhao Cai and He Zhang and Xi Chen and Jinbo Xing and Kai Zhang and Yiwei Hu and Yuqian Zhou and Zhifei Zhang and Soo Ye Kim and Tianyu Wang and Yulun Zhang and Xiaokang Yang and Zhe Lin and Alan Yuille},
booktitle={NeurIPS},
year={2025}
}
Acknowledgments: Our code is built upon and inspired by Wan2.1, Wan2.2, VACE, DiffSynth-Studio, SAM2, Depth-Anything-V2, Video-Depth-Anything, CoTracker3. We thank their solid open-source work.




































