Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
112 commits
Select commit Hold shift + click to select a range
ee8931b
Adding Cosmos 3
atharvajoshi10 May 14, 2026
da55428
removed dead code
atharvajoshi10 May 15, 2026
786dbd4
Change customer TimeEmbedding Layer to DIffusers Time Embedding
atharvajoshi10 May 15, 2026
2415b39
removed dependency on hugging face transformers
atharvajoshi10 May 15, 2026
5d4d453
refactor 1
atharvajoshi10 May 15, 2026
059644d
Fixed Attention Pattern
atharvajoshi10 May 15, 2026
277ef7b
Removed from Pretrain overrides
atharvajoshi10 May 15, 2026
2aa39f7
Removing normalization from the audio Tokenizer
atharvajoshi10 May 18, 2026
1b73e0b
fixed diffusers checkpoint
atharvajoshi10 May 18, 2026
dc6460f
fixed video save uint conversion
atharvajoshi10 May 18, 2026
6cd13c9
added forward hook for cpu offload case
atharvajoshi10 May 18, 2026
3c5b60e
removed dead params for sound tokenizer
atharvajoshi10 May 18, 2026
807344f
renaming audio encoder for readability
atharvajoshi10 May 18, 2026
722f4ee
ruff format
atharvajoshi10 May 18, 2026
0d5391a
Fix checkpoint conversion script for sound tokenizer
yzhautouskay May 19, 2026
f28e468
Audio Decoder trim and removing some dead code
atharvajoshi10 May 19, 2026
409a3a4
removing dead sequence packing code
atharvajoshi10 May 19, 2026
5f5f72a
refactor pipeline to diffusers style formatting
atharvajoshi10 May 19, 2026
da0b661
removing use of cosmos3 audio encoder
atharvajoshi10 May 19, 2026
d774d04
Revert "removing use of cosmos3 audio encoder"
atharvajoshi10 May 20, 2026
b367226
refactor audio encoder
atharvajoshi10 May 20, 2026
fda144f
inline remaining sequence packing functions and lint
atharvajoshi10 May 21, 2026
9529ac5
Removed GenerationDataClean class and Action logic
atharvajoshi10 May 21, 2026
0e8e1ae
inlined default args
atharvajoshi10 May 21, 2026
a008d0c
removed dead code and refactoring
atharvajoshi10 May 21, 2026
2ab7f6d
drop pipeline-helper @no_grad, inline derive helper, move guidance ch…
atharvajoshi10 May 21, 2026
23f6cd3
drop unused list bookkeeping from PackedSequence
atharvajoshi10 May 21, 2026
e77e917
extract pack-time build state from PackedSequence dataclass
atharvajoshi10 May 21, 2026
721bcb7
drop private-API isinstance/shape asserts
atharvajoshi10 May 21, 2026
c19ff6f
build PackedSequence tensors on target device, drop to_cuda
atharvajoshi10 May 21, 2026
24e1bfc
drop SequencePlan, skip_text_tokens, and bos_token_id branch
atharvajoshi10 May 21, 2026
bc5655a
use retrieve_latents helper in _encode_video
atharvajoshi10 May 21, 2026
e991c7c
drop get_data_and_condition + data_batch dict scaffolding
atharvajoshi10 May 21, 2026
0ecddcc
drop _load_image_as_tensor; use VideoProcessor for conditioning frame
atharvajoshi10 May 21, 2026
30ed7e6
move encode/decode helpers + transformer forward into Cosmos3OmniTran…
atharvajoshi10 May 22, 2026
6069cd7
remove Cosmos3VLTextModel; flatten transformer layout
atharvajoshi10 May 22, 2026
7249739
move save_img_or_video/save_wav to cosmos/export_utils.py
atharvajoshi10 May 22, 2026
3c8ecba
drop @torch.no_grad on pipeline __call__
atharvajoshi10 May 22, 2026
682f04e
follow standard transformer conventions in Cosmos3OmniTransformer
atharvajoshi10 May 22, 2026
3d24008
restore @torch.no_grad() on Cosmos3OmniDiffusersPipeline.__call__
atharvajoshi10 May 22, 2026
0e6a80c
drop training-only dummy projection paths in transformer decode helpers
atharvajoshi10 May 22, 2026
338144d
inline modality encode/decode helpers into transformer forward
atharvajoshi10 May 22, 2026
b789b4c
trim dead flags and idioms in Cosmos3 pipeline __call__
atharvajoshi10 May 22, 2026
6445473
introduce Cosmos3Condition for prepare_latents return shape
atharvajoshi10 May 22, 2026
793ab50
restructure pack helpers as data-returning pipeline methods
atharvajoshi10 May 22, 2026
ec698b2
collapse builder-pattern PackedSequence/ModalityData into flat datacl…
atharvajoshi10 May 22, 2026
ff9aaa0
exclude dtype from saved transformer config via ignore_for_config
atharvajoshi10 May 22, 2026
705c6e5
add Cosmos3 pipeline and transformer docs pages; retire JSON example …
atharvajoshi10 May 22, 2026
088e5f9
restore full prompts in cosmos3 docs from original example JSONs
atharvajoshi10 May 22, 2026
63a54dc
Renamed Cosmos3 module attributes
MaciejBalaNV May 25, 2026
b8ad4cd
bugfix
MaciejBalaNV May 25, 2026
0b7d8e1
Removed unnecessary helper function; added extra comments
MaciejBalaNV May 26, 2026
637a680
Moved from encode_prompt to tokenize_prompt
MaciejBalaNV May 26, 2026
0432f15
Bring back video system prompt
MaciejBalaNV May 26, 2026
a883979
Remove multi frame conditioning for now
MaciejBalaNV May 26, 2026
4211004
Removed default negative prompts from code
MaciejBalaNV May 26, 2026
f1be047
Remove Cosmos3condition; simplify sequence pack
MaciejBalaNV May 26, 2026
c3ba51f
Clean up multiple parameters
MaciejBalaNV May 26, 2026
2f0052f
Simplify decode video; remove remainings of batching
MaciejBalaNV May 26, 2026
7531305
simple renames
MaciejBalaNV May 26, 2026
4c6275e
Refactored schedulers for sound
MaciejBalaNV May 26, 2026
9e74deb
Remove unnecessary autocast
MaciejBalaNV May 26, 2026
d748309
Update sound example
MaciejBalaNV May 26, 2026
6cf0bfa
Simplify loops in transformer_cosmos3.py
MaciejBalaNV May 26, 2026
91446a6
Remove unused config attributes
MaciejBalaNV May 26, 2026
21b4f92
Cleanup audio decoder
MaciejBalaNV May 26, 2026
9067c30
Reuse encoder_video from LTX2 for Cosmos3
MaciejBalaNV May 26, 2026
9eb6081
Fixed a few nits
MaciejBalaNV May 26, 2026
d169bd4
Moved to RMSNorm for Cosmos3
MaciejBalaNV May 26, 2026
425eda6
Remove meta_tensor usage
MaciejBalaNV May 26, 2026
57ce98e
Improved rope handling
MaciejBalaNV May 26, 2026
1abd488
Improved prompt templates
MaciejBalaNV May 26, 2026
dc4c1f2
Added extra docs for templatete
MaciejBalaNV May 26, 2026
03eb98f
remove dataclasses
Zhylkaaa May 26, 2026
f7a23f1
Cleanup after merging
MaciejBalaNV May 26, 2026
84a5557
Added guardrails
MaciejBalaNV May 26, 2026
99a35cc
bugfixed guardrails
MaciejBalaNV May 26, 2026
60bc3f5
Bugfix guardrails v2
MaciejBalaNV May 26, 2026
7f3ffb0
Simplified input_timestep
MaciejBalaNV May 26, 2026
ccd24b4
Add TODO
MaciejBalaNV May 26, 2026
b59253f
simplify conditional mask generation
Zhylkaaa May 26, 2026
5c8febc
Inlined _postprocess_latents
MaciejBalaNV May 26, 2026
6bd2c58
removed pack_input_sequence helper
atharvajoshi10 May 26, 2026
981a49c
restore export utils with deprecation warning
atharvajoshi10 May 26, 2026
0144512
moved sampling rate to pipeline attribute
atharvajoshi10 May 26, 2026
22064ee
inlined sound and image condition mask
atharvajoshi10 May 26, 2026
889ed18
seperating static and timestep based sound and vision token packing
atharvajoshi10 May 26, 2026
fa018fe
unpack transformer args
atharvajoshi10 May 26, 2026
8b7c601
enabled selection of cosmos3 super
atharvajoshi10 May 26, 2026
5c9dc8d
Update src/diffusers/pipelines/cosmos/pipeline_cosmos3_omni.py
atharvajoshi10 May 26, 2026
f797563
Update src/diffusers/pipelines/cosmos/pipeline_cosmos3_omni.py
atharvajoshi10 May 26, 2026
c47f8d3
Apply suggestions from code review
atharvajoshi10 May 26, 2026
db31af2
fixed sound and vision conditioning use from prepare latents
atharvajoshi10 May 26, 2026
16c12c4
ruff format and doc builder
atharvajoshi10 May 27, 2026
d2b5a45
ran fix copies
atharvajoshi10 May 27, 2026
7e05679
move typing to python3.10
Zhylkaaa May 27, 2026
9afd357
move special token application from pack_text_tokens to tokenize_prompt
Zhylkaaa May 27, 2026
03811a6
rename packing methods to process methods
Zhylkaaa May 27, 2026
ac6006f
remove guidance_scale check
Zhylkaaa May 27, 2026
2b3b006
fix nits
Zhylkaaa May 27, 2026
de6ad6d
respect vae dtype in the pipeline
Zhylkaaa May 27, 2026
8808a5b
use vae dtype for vae normalization stats
Zhylkaaa May 27, 2026
8153623
skip CFG if guidance_scale is 1
Zhylkaaa May 27, 2026
94bc454
Remove unnecessary parameter
MaciejBalaNV May 27, 2026
d1084cb
fix CFG for sound
Zhylkaaa May 27, 2026
017f628
bugfix for sound CFG
Zhylkaaa May 27, 2026
83e87ea
ruff format
atharvajoshi10 May 27, 2026
d2c92e3
Fix apply_chat_template return dict arg to return BatchEncoding
yzhautouskay May 27, 2026
9f99529
added option to select attention processor
atharvajoshi10 May 27, 2026
59e4369
docs: refresh Cosmos 3 pipeline intro
atharvajoshi10 May 27, 2026
d8b503d
docs: document Cosmos3OmniPipeline.__call__ arguments
atharvajoshi10 May 27, 2026
c2eb4cb
style: doc-builder reflow on Cosmos3OmniPipeline.__call__ docstring
atharvajoshi10 May 27, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,9 @@ tags

# RL pipelines may produce mp4 outputs
*.mp4
*.jpg
*.jepg
*.wav

# dependencies
/transformers
Expand Down
4 changes: 4 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -321,6 +321,8 @@
title: CogView4Transformer2DModel
- local: api/models/consisid_transformer3d
title: ConsisIDTransformer3DModel
- local: api/models/cosmos3_omni_transformer
title: Cosmos3OmniTransformer
- local: api/models/cosmos_transformer3d
title: CosmosTransformer3DModel
- local: api/models/dit_transformer2d
Expand Down Expand Up @@ -645,6 +647,8 @@
title: ConsisID
- local: api/pipelines/cosmos
title: Cosmos
- local: api/pipelines/cosmos3
title: Cosmos3
- local: api/pipelines/framepack
title: Framepack
- local: api/pipelines/helios
Expand Down
34 changes: 34 additions & 0 deletions docs/source/en/api/models/cosmos3_omni_transformer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
<!-- Copyright 2025 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->

# Cosmos3OmniTransformer

A Mixture-of-Transformer (MoT) joint vision-language transformer introduced as part of NVIDIA's Cosmos3 world foundation model family. The model runs two parallel computation pathways over a packed joint sequence:

- a **causal "understanding" pathway** that self-attends over text tokens with causal masking, and
- a **bi-directional "generation" pathway** that cross-attends from generation tokens (vision + optional sound latents) over the full understanding-plus-generation key/value set.

The two pathways share the same hidden size and number of layers but maintain **separate Q/K/V/O projections, MLPs, and RMSNorm parameters**, which is what makes the architecture a Mixture-of-Transformer rather than a standard Mixture-of-Experts. Position information is supplied through a 3D multimodal RoPE (mRoPE) that interleaves temporal / height / width frequencies for video latents and reuses the temporal axis for text and audio.

The model can be loaded as follows.

```python
import torch
from diffusers import Cosmos3OmniTransformer

transformer = Cosmos3OmniTransformer.from_pretrained(
"nvidia/Cosmos3-Nano", subfolder="transformer", torch_dtype=torch.bfloat16
)
```

## Cosmos3OmniTransformer

[[autodoc]] Cosmos3OmniTransformer
542 changes: 542 additions & 0 deletions docs/source/en/api/pipelines/cosmos3.md

Large diffs are not rendered by default.

12 changes: 6 additions & 6 deletions docs/source/en/api/pipelines/ltx2.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ from diffusers import FlowMatchEulerDiscreteScheduler
from diffusers.pipelines.ltx2 import LTX2Pipeline, LTX2LatentUpsamplePipeline
from diffusers.pipelines.ltx2.latent_upsampler import LTX2LatentUpsamplerModel
from diffusers.pipelines.ltx2.utils import STAGE_2_DISTILLED_SIGMA_VALUES
from diffusers.pipelines.ltx2.export_utils import encode_video
from diffusers.utils import encode_video

device = "cuda:0"
width = 768
Expand Down Expand Up @@ -124,7 +124,7 @@ import torch
from diffusers.pipelines.ltx2 import LTX2Pipeline, LTX2LatentUpsamplePipeline
from diffusers.pipelines.ltx2.latent_upsampler import LTX2LatentUpsamplerModel
from diffusers.pipelines.ltx2.utils import DISTILLED_SIGMA_VALUES, STAGE_2_DISTILLED_SIGMA_VALUES
from diffusers.pipelines.ltx2.export_utils import encode_video
from diffusers.utils import encode_video

device = "cuda"
width = 768
Expand Down Expand Up @@ -203,7 +203,7 @@ from diffusers import LTX2ConditionPipeline, LTX2LatentUpsamplePipeline
from diffusers.pipelines.ltx2.latent_upsampler import LTX2LatentUpsamplerModel
from diffusers.pipelines.ltx2.pipeline_ltx2_condition import LTX2VideoCondition
from diffusers.pipelines.ltx2.utils import DISTILLED_SIGMA_VALUES, STAGE_2_DISTILLED_SIGMA_VALUES
from diffusers.pipelines.ltx2.export_utils import encode_video
from diffusers.utils import encode_video
from diffusers.utils import load_image

device = "cuda"
Expand Down Expand Up @@ -292,7 +292,7 @@ You can use both image and video conditions:
import torch
from diffusers import LTX2ConditionPipeline
from diffusers.pipelines.ltx2.pipeline_ltx2_condition import LTX2VideoCondition
from diffusers.pipelines.ltx2.export_utils import encode_video
from diffusers.utils import encode_video
from diffusers.pipelines.ltx2.utils import DEFAULT_NEGATIVE_PROMPT
from diffusers.utils import load_image, load_video

Expand Down Expand Up @@ -367,7 +367,7 @@ These are controlled by the `guidance_scale`, `stg_scale`, and `modality_scale`
```py
import torch
from diffusers import LTX2ImageToVideoPipeline
from diffusers.pipelines.ltx2.export_utils import encode_video
from diffusers.utils import encode_video
from diffusers.pipelines.ltx2.utils import DEFAULT_NEGATIVE_PROMPT
from diffusers.utils import load_image

Expand Down Expand Up @@ -440,7 +440,7 @@ The LTX-2.X models are sensitive to prompting style. Refer to the [official prom
import torch
from transformers import Gemma3Processor
from diffusers import LTX2Pipeline
from diffusers.pipelines.ltx2.export_utils import encode_video
from diffusers.utils import encode_video
from diffusers.pipelines.ltx2.utils import DEFAULT_NEGATIVE_PROMPT, T2V_DEFAULT_SYSTEM_PROMPT

device = "cuda"
Expand Down
4 changes: 4 additions & 0 deletions docs/source/en/api/utilities.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,10 @@ Utility and helper functions for working with 🤗 Diffusers.

[[autodoc]] utils.export_to_video

## encode_video

[[autodoc]] utils.encode_video

## make_image_grid

[[autodoc]] utils.make_image_grid
Expand Down
63 changes: 63 additions & 0 deletions examples/cosmos3/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Cosmos3 — smoke-test runner

The canonical reference for `Cosmos3OmniPipeline` lives in the diffusers docs:
[`docs/source/en/api/pipelines/cosmos3.md`](../../docs/source/en/api/pipelines/cosmos3.md). Use the
examples there as the source of truth for application code — they cover text-to-image,
text-to-video, image-to-video, and text+sound modes.

This directory provides a small CLI wrapper (`inference_cosmos3.py`) that exercises the full
load → encode → denoise → decode path against either the Hub release or a local checkpoint
during development.

## Setup

```bash
pip install -r examples/cosmos3/requirements.txt
```

## Usage

Text-to-image:

```bash
python examples/cosmos3/inference_cosmos3.py \
--prompt "A medium shot of a modern robotics research laboratory…" \
--num-frames 1
```

Text-to-video:

```bash
python examples/cosmos3/inference_cosmos3.py \
--prompt "A waterfall cascading down a rocky cliff in a lush forest."
```

Image-to-video:

```bash
python examples/cosmos3/inference_cosmos3.py \
--prompt "The right robotic hand picks up the red sphere…" \
--vision-path https://github.com/nvidia-cosmos/cosmos-dependencies/releases/download/assets/robot_153.jpg
```

Text-to-video-with-sound (sound-capable checkpoint only):

```bash
python examples/cosmos3/inference_cosmos3.py \
--prompt "A waterfall in a lush forest." \
--enable-sound
```

### Useful flags

| Flag | Default | Description |
|---|---|---|
| `--prompt` | (required) | Text prompt. |
| `--vision-path` | `None` | URL or local path for an image-conditioning frame (image-to-video). |
| `--num-frames` | `189` | `1` = image, otherwise number of video frames (`189` ≈ 7.9 s @ 24 FPS). |
| `--height` / `--width` | `720` / `1280` | Output resolution (must be a multiple of the VAE spatial scale factor). |
| `--fps` | `24.0` | Frame rate of the generated video. |
| `--enable-sound` | off | Generate a synchronized audio track. |
| `--no-duration-template` | off | Skip the duration metadata sentence appended to the prompt and negative prompt. Ignored for `--num-frames 1`. |
| `--no-resolution-template` | off | Skip the resolution metadata sentence appended to the prompt and negative prompt. |
| `--output` | `.` | Directory to write `sample.jpg` or `sample.mp4`. |
150 changes: 150 additions & 0 deletions examples/cosmos3/inference_cosmos3.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
#!/usr/bin/env python3
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""
Minimal smoke-test runner for the Cosmos3 diffusers pipeline.

Canonical examples live in the docs page at
``docs/source/en/api/pipelines/cosmos3.md`` — copy from there for production use.
This script exists to exercise the full load → encode → denoise → decode path
during development.

Text-to-image:
python inference_cosmos3.py --prompt "A robot in a lab." --num-frames 1

Text-to-video:
python inference_cosmos3.py --prompt "A waterfall in a forest."

Image-to-video:
python inference_cosmos3.py --prompt "..." --vision-path /path/to/image.jpg

Text-to-video-with-sound (requires a sound-capable checkpoint):
python inference_cosmos3.py --prompt "..." --enable-sound
"""

import argparse
import pathlib

import torch
from huggingface_hub import snapshot_download

from diffusers import Cosmos3OmniPipeline
from diffusers.utils import encode_video, export_to_video, load_image


HF_REPOS = {
"nano": "nvidia/Cosmos3-Nano",
"super": "nvidia/Cosmos3-Super",
}


def main():
parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument("--prompt", required=True, help="Text prompt.")
parser.add_argument(
"--model",
choices=sorted(HF_REPOS),
default="nano",
help="Which Cosmos3 checkpoint to load (maps to the corresponding nvidia/Cosmos3-* repo).",
)
parser.add_argument(
"--vision-path",
default=None,
help="Optional URL or local path for an image-conditioning frame (enables image-to-video).",
)
parser.add_argument("--output", default=".", help="Directory to save generated video/image/audio files.")
parser.add_argument("--height", type=int, default=720)
parser.add_argument("--width", type=int, default=1280)
parser.add_argument(
"--num-frames",
type=int,
default=189,
help="Number of frames to generate. Use 1 for text-to-image; defaults to 189 for video (≈ 7.9s @ 24 FPS).",
)
parser.add_argument("--fps", type=float, default=24.0)
parser.add_argument(
"--enable-sound",
action="store_true",
default=False,
help="Generate sound alongside video (requires a sound-capable checkpoint).",
)
parser.add_argument(
"--no-duration-template",
dest="add_duration_template",
action="store_false",
default=True,
help="Skip the duration metadata sentence appended to the prompt and negative prompt (video only).",
)
parser.add_argument(
"--no-resolution-template",
dest="add_resolution_template",
action="store_false",
default=True,
help="Skip the resolution metadata sentence appended to the prompt and negative prompt.",
)
parser.add_argument(
"--disable-safety-checker",
action="store_true",
default=False,
help="Disable the Cosmos Guardrail safety checker at pipeline construction (no checker instantiated).",
)
parser.add_argument(
"--no-safety-check",
action="store_true",
default=False,
help="Skip the Cosmos Guardrail text/video safety checks for this call (checker still constructed).",
)
args = parser.parse_args()

hf_repo = HF_REPOS[args.model]
print(f"Downloading pipeline from {hf_repo}")
pipeline_path = pathlib.Path(snapshot_download(repo_id=hf_repo))
print(f"Loading pipeline from {pipeline_path} …")
pipeline = Cosmos3OmniPipeline.from_pretrained(
str(pipeline_path),
torch_dtype=torch.bfloat16,
device_map="cuda",
enable_safety_checker=not args.disable_safety_checker,
)
print("Pipeline loaded successfully.")

output_dir = pathlib.Path(args.output)
output_dir.mkdir(parents=True, exist_ok=True)

image = load_image(args.vision_path) if args.vision_path is not None else None

result = pipeline(
prompt=args.prompt,
image=image,
num_frames=args.num_frames,
height=args.height,
width=args.width,
fps=args.fps,
enable_sound=args.enable_sound,
add_resolution_template=args.add_resolution_template,
add_duration_template=args.add_duration_template,
enable_safety_check=not args.no_safety_check,
)

if args.num_frames == 1:
save_path = output_dir / "sample.jpg"
result.video[0].save(save_path, format="JPEG", quality=85)
else:
save_path = output_dir / "sample.mp4"
if result.sound is not None:
assert pipeline.sound_tokenizer is not None
encode_video(
result.video,
fps=int(args.fps),
audio=result.sound,
audio_sample_rate=pipeline.sound_tokenizer.config.sampling_rate,
output_path=str(save_path),
)
else:
# macro_block_size=1 allows arbitrary frame sizes (Cosmos3 outputs are not always divisible by 16).
export_to_video(result.video, str(save_path), fps=int(args.fps), quality=10, macro_block_size=1)
print(f"Saved: {save_path}")


if __name__ == "__main__":
main()
17 changes: 17 additions & 0 deletions examples/cosmos3/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
--extra-index-url https://download.pytorch.org/whl/cu130
torch
torchvision
accelerate>=0.31.0
av
huggingface_hub
imageio
imageio-ffmpeg
transformers>=4.41.2,<5
einops
peft>=0.11.1
datasets
numpy
tqdm
sentencepiece
tensorboard
wandb
Loading
Loading