-
Notifications
You must be signed in to change notification settings - Fork 0
Lumina 2.0 Hyperparams
Params: 2.6B
Patch Size: 2
Dimension: 2304
Heads: 24
KV Heads: 8
Layers: 26
RMSNorm: 1e −5
Pos. Emb.: M-ROPE
The text encoder used is Gemma2-2b. The text and image processors use lightweight single-stream blocks. Multimodal-ROPE (mRoPE) is used to model text-image sequences.
Optimizer: AdamW
Learning Rate: 2×10e−4 for all three training stages (Low Res., High Res., HQ Tuning).
Flow Matching Hyperparameters
The paper mentions that an auxiliary loss computes the flow-matching objective for high res training.
Training Hyperparameters
The training process is divided into three stages with the following configurations:
Low Resolution Stage:
Image Resolution: 256×256
#Images: 100M
Training Steps (K): 144
Batch Size: 1024
GPU Days (A100): 191
High Resolution Stage:
Image Resolution: 1024×1024
#Images: 10M
Training Steps (K): 40
Batch Size: 512
GPU Days (A100): 176
HQ Tuning Stage:
Image Resolution: 1024×1024
#Images: 1M
Training Steps (K): 15
Batch Size: 512
GPU Days (A100): 224
The training was performed on 32 A100 GPUs for all stages.
For efficient inference, several techniques are discussed:
Classifier-Free Guidance (CFG): The paper discusses the use of CFG-Renormalization and CFG-Truncation. CFG-Truncation threshold (a): A predefined threshold for switching off conditional velocity calculation. The specific value is not mentioned. Flow-DPM-Solver (FDPM): Achieves convergence in 14-20 NFEs (Number of Function Evaluations).