The default training commands for the different versions are as follows:
We can choose whether to use deep speed in CogVideoX-Fun, which can save a lot of video memory.
Some parameters in the sh file can be confusing, and they are explained in this document:
enable_bucketis used to enable bucket training. When enabled, the model does not crop the images and videos at the center, but instead, it trains the entire images and videos after grouping them into buckets based on resolution.random_frame_cropis used for random cropping on video frames to simulate videos with different frame counts.random_hw_adaptis used to enable automatic height and width scaling for images and videos. Whenrandom_hw_adaptis enabled, the training images will have their height and width set toimage_sample_sizeas the maximum andmin(video_sample_size, 512)as the minimum. For training videos, the height and width will be set toimage_sample_sizeas the maximum andmin(video_sample_size, 512)as the minimum.- For example, when
random_hw_adaptis enabled, withvideo_sample_n_frames=49,video_sample_size=1024, andimage_sample_size=1024, the resolution of image inputs for training is512x512to1024x1024, and the resolution of video inputs for training is512x512x49to1024x1024x49. - For example, when
random_hw_adaptis enabled, withvideo_sample_n_frames=49,video_sample_size=1024, andimage_sample_size=256, the resolution of image inputs for training is256x256to1024x1024, and the resolution of video inputs for training is256x256x49.
- For example, when
training_with_video_token_lengthspecifies training the model according to token length. For training images and videos, the height and width will be set toimage_sample_sizeas the maximum andvideo_sample_sizeas the minimum.- For example, when
training_with_video_token_lengthis enabled, withvideo_sample_n_frames=49,token_sample_size=1024,video_sample_size=1024, andimage_sample_size=256, the resolution of image inputs for training is256x256to1024x1024, and the resolution of video inputs for training is256x256x49to1024x1024x49. - For example, when
training_with_video_token_lengthis enabled, withvideo_sample_n_frames=49,token_sample_size=512,video_sample_size=1024, andimage_sample_size=256, the resolution of image inputs for training is256x256to1024x1024, and the resolution of video inputs for training is256x256x49to1024x1024x9. - The token length for a video with dimensions 512x512 and 49 frames is 13,312. We need to set the
token_sample_size = 512.- At 512x512 resolution, the number of video frames is 49 (~= 512 * 512 * 49 / 512 / 512).
- At 768x768 resolution, the number of video frames is 21 (~= 512 * 512 * 49 / 768 / 768).
- At 1024x1024 resolution, the number of video frames is 9 (~= 512 * 512 * 49 / 1024 / 1024).
- These resolutions combined with their corresponding lengths allow the model to generate videos of different sizes.
- For example, when
train_modeis used to specify the training mode, which can be either normal or i2v. Since CogVideoX-Fun uses the inpaint model to achieve image-to-video generation, the default is set to inpaint mode. If you only wish to achieve text-to-video generation, you can remove this line, and it will default to the text-to-video mode.resume_from_checkpointis used to set the training should be resumed from a previous checkpoint. Use a path or"latest"to automatically select the last available checkpoint.
CogVideoX-Fun without deepspeed:
export MODEL_NAME="models/Diffusion_Transformer/CogVideoX-Fun-2b-InP"
export DATASET_NAME="datasets/internal_datasets/"
export DATASET_META_NAME="datasets/internal_datasets/metadata.json"
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
NCCL_DEBUG=INFO
accelerate launch --mixed_precision="bf16" scripts/cogvideox_fun/train.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--train_data_dir=$DATASET_NAME \
--train_data_meta=$DATASET_META_NAME \
--image_sample_size=1024 \
--video_sample_size=256 \
--token_sample_size=512 \
--video_sample_stride=3 \
--video_sample_n_frames=49 \
--train_batch_size=1 \
--video_repeat=1 \
--gradient_accumulation_steps=1 \
--dataloader_num_workers=8 \
--num_train_epochs=100 \
--checkpointing_steps=50 \
--learning_rate=2e-05 \
--lr_scheduler="constant_with_warmup" \
--lr_warmup_steps=100 \
--seed=42 \
--output_dir="output_dir" \
--gradient_checkpointing \
--mixed_precision="bf16" \
--adam_weight_decay=3e-2 \
--adam_epsilon=1e-10 \
--vae_mini_batch=1 \
--max_grad_norm=0.05 \
--random_hw_adapt \
--training_with_video_token_length \
--enable_bucket \
--use_ema \
--train_mode="inpaint" \
--trainable_modules "."CogVideoX-Fun with deepspeed:
export MODEL_NAME="models/Diffusion_Transformer/CogVideoX-Fun-2b-InP"
export DATASET_NAME="datasets/internal_datasets/"
export DATASET_META_NAME="datasets/internal_datasets/metadata.json"
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
NCCL_DEBUG=INFO
accelerate launch --use_deepspeed --deepspeed_config_file config/zero_stage2_config.json --deepspeed_multinode_launcher standard scripts/cogvideox_fun/train.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--train_data_dir=$DATASET_NAME \
--train_data_meta=$DATASET_META_NAME \
--image_sample_size=1024 \
--video_sample_size=256 \
--token_sample_size=512 \
--video_sample_stride=3 \
--video_sample_n_frames=49 \
--train_batch_size=4 \
--video_repeat=1 \
--gradient_accumulation_steps=1 \
--dataloader_num_workers=8 \
--num_train_epochs=100 \
--checkpointing_steps=50 \
--learning_rate=2e-05 \
--lr_scheduler="constant_with_warmup" \
--lr_warmup_steps=100 \
--seed=42 \
--output_dir="output_dir" \
--gradient_checkpointing \
--mixed_precision="bf16" \
--adam_weight_decay=3e-2 \
--adam_epsilon=1e-10 \
--vae_mini_batch=1 \
--max_grad_norm=0.05 \
--random_hw_adapt \
--training_with_video_token_length \
--enable_bucket \
--use_deepspeed \
--train_mode="inpaint" \
--trainable_modules "."CogVideoX-Fun with multi machines:
export MODEL_NAME="models/Diffusion_Transformer/CogVideoX-Fun-2b-InP"
export DATASET_NAME="datasets/internal_datasets/"
export DATASET_META_NAME="datasets/internal_datasets/metadata.json"
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
NCCL_DEBUG=INFO
NUM_PROCESS=$((WORLD_SIZE * 8))
echo "MASTER_ADDR: ${MASTER_ADDR} MASTER_PORT: ${MASTER_PORT} NUM_PROCESS: ${NUM_PROCESS}"
accelerate launch --main_process_ip=$MASTER_ADDR --main_process_port=$MASTER_PORT --num_machines=$WORLD_SIZE --num_processes=$NUM_PROCESS --machine_rank=$RANK scripts/cogvideox_fun/train.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--train_data_dir=$DATASET_NAME \
--train_data_meta=$DATASET_META_NAME \
--image_sample_size=1024 \
--video_sample_size=256 \
--token_sample_size=512 \
--video_sample_stride=3 \
--video_sample_n_frames=49 \
--train_batch_size=4 \
--video_repeat=1 \
--gradient_accumulation_steps=1 \
--dataloader_num_workers=8 \
--num_train_epochs=100 \
--checkpointing_steps=50 \
--learning_rate=2e-05 \
--lr_scheduler="constant_with_warmup" \
--lr_warmup_steps=100 \
--seed=42 \
--output_dir="output_dir" \
--gradient_checkpointing \
--mixed_precision="bf16" \
--adam_weight_decay=3e-2 \
--adam_epsilon=1e-10 \
--vae_mini_batch=1 \
--max_grad_norm=0.05 \
--random_hw_adapt \
--training_with_video_token_length \
--enable_bucket \
--train_mode="inpaint" \
--trainable_modules "."