Lumina 2.0 Hyperparams

Model Hyperparameters (Unified Next-DiT)

Params: 2.6B
Patch Size: 2
Dimension: 2304 
Heads: 24
KV Heads: 8
Layers: 26
RMSNorm: 1e −5
Pos. Emb.: M-ROPE

The text encoder used is Gemma2-2b. The text and image processors use lightweight single-stream blocks. Multimodal-ROPE (mRoPE) is used to model text-image sequences.

Optimizer Hyperparameters

Optimizer: AdamW    
Learning Rate: 2×10e−4 for all three training stages (Low Res., High Res., HQ Tuning).   
Flow Matching Hyperparameters

The paper mentions that an auxiliary loss computes the flow-matching objective for high res training.
Training Hyperparameters

The training process is divided into three stages with the following configurations:

Low Resolution Stage:
Image Resolution: 256×256    
#Images: 100M    
Training Steps (K): 144    
Batch Size: 1024    
GPU Days (A100): 191    
High Resolution Stage:
Image Resolution: 1024×1024    
#Images: 10M    
Training Steps (K): 40    
Batch Size: 512    
GPU Days (A100): 176    
HQ Tuning Stage:
Image Resolution: 1024×1024    
#Images: 1M    
Training Steps (K): 15    
Batch Size: 512    
GPU Days (A100): 224    
The training was performed on 32 A100 GPUs for all stages.

Inference Hyperparameters

For efficient inference, several techniques are discussed:

Classifier-Free Guidance (CFG): The paper discusses the use of CFG-Renormalization and CFG-Truncation. CFG-Truncation threshold (a): A predefined threshold for switching off conditional velocity calculation. The specific value is not mentioned. Flow-DPM-Solver (FDPM): Achieves convergence in 14-20 NFEs (Number of Function Evaluations).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lumina 2.0 Hyperparams

Model Hyperparameters (Unified Next-DiT)

Optimizer Hyperparameters

Inference Hyperparameters

Uh oh!

Clone this wiki locally