GitHub - caiyuanhao1998/Open-OmniVCus: OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions (NeurIPS 2025)

[NIPS 25] OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

Introduction

This is a re-implementation of our work "OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions" using public datasets and re-trained model based on public codes. In this work, we present a data construction pipeline that can create data pairs and a diffusion Transformer for subject-driven video customization under different control conditions. I will continue to complete this repo. If you find our repo useful, please give it a star ⭐ and consider citing our paper. Thank you :)

The overall framework of our OmniVCus

News

2025.12.26 : Training and testing codes, training data, and pre-trained models have been released. Please feel free to check and try. 🚀
2025.12.03 : The data construction code has been uploaded. I will continue refine and construct this repo. Stay tuned. 💫
2025.09.19 : Our paper has been accepted by NeurIPS 2025. 🎉 🎊
2025.06.30 : Our paper is on arxiv now. 🚀
2025.06.28 : Our project page has been built up. Feel free to check the video generation results on the project page.

1. Data Construction

We implement our data construction pipeline in the folder VideoCus-Factory, which can construct the multi-modal control conditions including subjects, depth, mask, motion, etc. We also provide the code in the folder Video-Depth-Anything for better constructing the video depth condition. Please enter the corresponding subfolders for environment installation and data preparation. The following is an example of constructing from a raw video.

Generated Prompt: a woman and a child playing with a toy train.

Original Video	Segmented Subject	Augmented Subject

Depth Video	Mask Video	Motion Video

For your convenience to do research, we provide our training and testing datasets in our huggingface pages: [Train Set], [Test Set]

In addition, we also provide parts of the original training and testing dataset samples in Google Drive. We construct webpages in the Google Drive for your convenience to browse the data samples as

2. Training and Inference

We re-implement our method in the folder DiffSynth-Studio based on Wan2.1-1.3B, Wan2.1-14B, Wan2.2-14B, and VACE models. We provide our trained models in the huggingface website. I write tensor-parallel testing and training code as follow.

· Model Overview

Model ID	Inference	Training
Wan2.1-OmniVCus-1.3B	code	code
Wan2.1-OmniVCus-14B	code	code
Wan2.2-OmniVCus-14B-high	code	code
Wan2.2-OmniVCus-14B-low	code	code

We compare our OmniVCus with the state-of-the-art method VACE as follow:

· (a) 2.1-1.3B model

(a1) a woman rolling up a fitted sheet

Reference Image	Depth Video

VACE-2.1-1.3B	OmniVCus-2.1-1.3B (Ours)
(a2) a church in the winter

Reference Image	Mask Video

VACE-2.1-1.3B	OmniVCus-2.1-1.3B

· (b) 2.1-14B model

(b1) a man holding a piece of paper in his hands

Reference Image	Depth Video

VACE-2.1-14B	OmniVCus-2.1-14B (Ours)
(b2) a boy in a medical gown and hairnet in a hospital room

Reference Image	Mask Video

VACE-2.1-14B	OmniVCus-2.1-14B (Ours)

· (c) 2.2-14B model

(c1) a boy looking into an open refrigerator, with tomatoes and a bottle of water on the floor

Reference Image	Depth Video

VACE-2.2-14B	OmniVCus-2.2-14B (Ours)
(c2) a woman standing in a room

Reference Image	Mask Video

VACE-2.2-14B	OmniVCus-2.2-14B (Ours)

Please enter the subfolder DiffSynth-Studio for detailed instruction to train and test the models.

3. Citation

@inproceedings{omnivcus,
  title={OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions},
  author={Yuanhao Cai and He Zhang and Xi Chen and Jinbo Xing and Kai Zhang and Yiwei Hu and Yuqian Zhou and Zhifei Zhang and Soo Ye Kim and Tianyu Wang and Yulun Zhang and Xiaokang Yang and Zhe Lin and Alan Yuille},
  booktitle={NeurIPS},
  year={2025}
}

Acknowledgments: Our code is built upon and inspired by Wan2.1, Wan2.2, VACE, DiffSynth-Studio, SAM2, Depth-Anything-V2, Video-Depth-Anything, CoTracker3. We thank their solid open-source work.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
DiffSynth-Studio		DiffSynth-Studio
Video-Depth-Anything		Video-Depth-Anything
VideoCus-Factory		VideoCus-Factory
img		img
.gitignore		.gitignore
README.md		README.md
SLURM.md		SLURM.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[NIPS 25] OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

Introduction

News

1. Data Construction

2. Training and Inference

3. Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[NIPS 25] OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

Introduction

News

1. Data Construction

2. Training and Inference

3. Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages