Skip to content

caiyuanhao1998/Open-OmniVCus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

 

arXiv  project page 

[NIPS 25] OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

abo gso flux_1 green_man

 

Introduction

This is a re-implementation of our work "OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions" using public datasets and re-trained model based on public codes. In this work, we present a data construction pipeline that can create data pairs and a diffusion Transformer for subject-driven video customization under different control conditions. I will continue to complete this repo. If you find our repo useful, please give it a star ⭐ and consider citing our paper. Thank you :)

pipeline

The overall framework of our OmniVCus

 

News

  • 2025.12.26 : Training and testing codes, training data, and pre-trained models have been released. Please feel free to check and try. 🚀
  • 2025.12.03 : The data construction code has been uploaded. I will continue refine and construct this repo. Stay tuned. 💫
  • 2025.09.19 : Our paper has been accepted by NeurIPS 2025. 🎉 🎊
  • 2025.06.30 : Our paper is on arxiv now. 🚀
  • 2025.06.28 : Our project page has been built up. Feel free to check the video generation results on the project page.

 

1. Data Construction

We implement our data construction pipeline in the folder VideoCus-Factory, which can construct the multi-modal control conditions including subjects, depth, mask, motion, etc. We also provide the code in the folder Video-Depth-Anything for better constructing the video depth condition. Please enter the corresponding subfolders for environment installation and data preparation. The following is an example of constructing from a raw video.

Generated Prompt: a woman and a child playing with a toy train.
Original Video Segmented Subject Augmented Subject
Depth Video Mask Video Motion Video

For your convenience to do research, we provide our training and testing datasets in our huggingface pages: [Train Set], [Test Set]

In addition, we also provide parts of the original training and testing dataset samples in Google Drive. We construct webpages in the Google Drive for your convenience to browse the data samples as

pipeline

 

2. Training and Inference

We re-implement our method in the folder DiffSynth-Studio based on Wan2.1-1.3B, Wan2.1-14B, Wan2.2-14B, and VACE models. We provide our trained models in the huggingface website. I write tensor-parallel testing and training code as follow.

· Model Overview

Model ID Inference Training
Wan2.1-OmniVCus-1.3B code code
Wan2.1-OmniVCus-14B code code
Wan2.2-OmniVCus-14B-high code code
Wan2.2-OmniVCus-14B-low code code

We compare our OmniVCus with the state-of-the-art method VACE as follow:

· (a) 2.1-1.3B model

(a1) a woman rolling up a fitted sheet
Reference Image Depth Video
VACE-2.1-1.3B OmniVCus-2.1-1.3B (Ours)
(a2) a church in the winter
Reference Image Mask Video
VACE-2.1-1.3B OmniVCus-2.1-1.3B

· (b) 2.1-14B model

(b1) a man holding a piece of paper in his hands
Reference Image Depth Video
VACE-2.1-14B OmniVCus-2.1-14B (Ours)
(b2) a boy in a medical gown and hairnet in a hospital room
Reference Image Mask Video
VACE-2.1-14B OmniVCus-2.1-14B (Ours)

· (c) 2.2-14B model

(c1) a boy looking into an open refrigerator, with tomatoes and a bottle of water on the floor
Reference Image Depth Video
VACE-2.2-14B OmniVCus-2.2-14B (Ours)
(c2) a woman standing in a room
Reference Image Mask Video
VACE-2.2-14B OmniVCus-2.2-14B (Ours)

Please enter the subfolder DiffSynth-Studio for detailed instruction to train and test the models.

 

3. Citation

@inproceedings{omnivcus,
  title={OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions},
  author={Yuanhao Cai and He Zhang and Xi Chen and Jinbo Xing and Kai Zhang and Yiwei Hu and Yuqian Zhou and Zhifei Zhang and Soo Ye Kim and Tianyu Wang and Yulun Zhang and Xiaokang Yang and Zhe Lin and Alan Yuille},
  booktitle={NeurIPS},
  year={2025}
}

 

Acknowledgments: Our code is built upon and inspired by Wan2.1, Wan2.2, VACE, DiffSynth-Studio, SAM2, Depth-Anything-V2, Video-Depth-Anything, CoTracker3. We thank their solid open-source work.

About

OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions (NeurIPS 2025)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors