[BUG] Fix: Fix gradient norm calculation and dynamic shape blocking in PP+ZeRO1 collective communication#7847
[BUG] Fix: Fix gradient norm calculation and dynamic shape blocking in PP+ZeRO1 collective communication#7847Thinksky5124 wants to merge 6 commits intodeepspeedai:masterfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 99697687fc
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if self.has_moe_layers: | ||
| self._average_expert_grad_norms(norm_groups) | ||
| self._average_expert_grad_norms(local_total_sq_norm) |
There was a problem hiding this comment.
Preserve per-group norms when averaging MoE expert grads
When has_moe_layers is true, _average_expert_grad_norms expects a per-parameter-group collection (it iterates by group index and checks self.is_moe_param_group[i]), but this call now passes a single accumulated tensor. In runs with multiple optimizer groups, that means MoE scaling is applied to at most index 0 (or to the already-mixed total), so expert and non-expert contributions are mis-scaled before clipping/overflow logic, producing incorrect global grad norms.
Useful? React with 👍 / 👎.
| self._average_expert_grad_norms(norm_groups) | ||
| self._average_expert_grad_norms(local_total_sq_norm) | ||
|
|
||
| local_total_sq_norm = local_total_sq_norm.to(torch.cuda.current_device()) |
There was a problem hiding this comment.
Avoid hard-coding CUDA device for norm reduction
This forces scaled_global_norm onto torch.cuda.current_device() even though the optimizer code is written against DeepSpeed's accelerator abstraction. On non-CUDA backends (or CPU-only execution), this line raises before all_reduce, so gradient norm computation and optimizer step fail outright; the tensor should stay on self.device or use get_accelerator().current_device_name().
Useful? React with 👍 / 👎.
…ice move in scaled_global_norm Signed-off-by: Thinksky5124 <40914433+Thinksky5124@users.noreply.github.com>
tohtana
left a comment
There was a problem hiding this comment.
Hi @Thinksky5124,
Thank you for the PR! However, since this contains significant changes to the core DeepSpeed engine, it is difficult to ensure that existing behavior won't be impacted. Given that you’ve listed three distinct changes, could you please separate them into individual PRs?
Additionally, I’m not entirely clear on what is currently incorrect regarding the gradient norm. While the reduction (dist.all_reduce(total_norm, ...)) seems redundant, the result itself appears to be correct
| if param_id in self.norm_for_param_grads: | ||
| param_norm = self.norm_for_param_grads[param_id] | ||
| total_norm += param_norm.item()**2 | ||
| local_sq_norm += param_norm**2 |
There was a problem hiding this comment.
With offloading enabled, local_sq_norm is on cpu and param_norm is on the accelerator.
| assert norm_type == 2, "only L2 norm supported" | ||
|
|
||
| self._model_parallel_all_reduce(tensor=total_norm, op=dist.ReduceOp.SUM) | ||
| local_sq_norm = torch.zeros(1, device=self.device, dtype=self.gradient_accumulation_dtype) |
There was a problem hiding this comment.
We can set fp16/bf16 to self.gradient_accumulation_dtype. Accumulating squared norm to the lower precision is not safe.
There was a problem hiding this comment.
change to torch.float32 to accumulate grad
@tohtana Thank you for your patient review! I think we don not need to separate this MR. I will provide a complete and integrated description of the bug and the corresponding fix, elaborating on both with full technical detail. Variable and Symbol DefinitionsTo clearly illustrate the differences before and after the fix, we first define the dimensions used in a distributed training cluster:
Overall Objective of the FunctionThis function computes the global L2 norm of all loss-scaled gradients across both Data Parallel (DP) and Pipeline Parallel (PP) dimensions for all parameter groups. Before the FixThe original implementation only performed aggregation across DP ranks and ignored the fact that different PP ranks may hold different numbers of parameters, which leads to inconsistent communication patterns. Step 1: Local Norm Contribution per Parameter Group(Inside For all parameters in group Step 2: All-Reduce Across DP Process GroupPerform Step 3: Second All-Reduce Across Model Parallel (MP) DimensionUse Expanding fully: Step 4: Compute Per-Group L2 Norm ScalarTake the square root of the total to obtain the global L2 norm for group If the result is Step 5: Final Formula for
|
| Aspect | Old Version | New Version |
|---|---|---|
Return value of complete_grad_norm_...
|
|
|
| Intra-group communication | Performed inside the function for each group | None — deferred to final global reduction |
Number of all_reduce calls |
|
1 time (after summing all groups) |
| Final square root | Applied to vector norm | Applied to global scalar sum |
The mathematical result is identical, but the new version **eliminates the risk of communication deadlock due to inconsistent all_reduce counts ** and significantly reduces communication overhead by reducing the number of collective operations from
Describe the bug
This commit fixes gradient normalization bugs when using DeepSpeed Pipeline Parallel (pp) together with ZeRO Stage 1 (zero1), including the following aspects:
PipelineEngine Buffer Type Consistency in Dyanmic Shape
In
deepspeed/runtime/pipe/engine.py, the activation buffer previously did not enforce dtype conversion, which could lead to inconsistent types and subsequent calculation errors. Now, the return value is explicitly cast to the target dtype, ensuring type consistency.ZeRO Stage 1/2 Gradient Normalization Logic Correction
In
deepspeed/runtime/zero/stage_1_and_2.py, for both CPU-offload and regular scenarios, the previous gradient normalization involved redundant communication and incorrect normalization:Removed the inf constant and unnecessary norm_type branches, now only supporting L2 norm to simplify the logic. Stricter skipping of None gradients and pipeline-replicated parameters improves robustness.
After these fixes, gradient normalization in pp+zero1 scenarios is more accurate, avoiding double counting, type inconsistency, and redundant communication, thus improving training stability and performance. Reviewers are advised to focus on the correctness and compatibility of gradient normalization code to ensure consistent behavior across different parallel/offload scenarios.
DeepSpeed Config
{ "train_micro_batch_size_per_gpu": 1, "gradient_accumulation_steps": 4, "steps_per_print": 1, "zero_optimization": { "stage": 1 } }