Fix ZeRO-3 optimizer initialization validation (#7844)#7929
Fix ZeRO-3 optimizer initialization validation (#7844)#7929amadhan882 wants to merge 8 commits intodeepspeedai:masterfrom
Conversation
6339f8c to
c7417b9
Compare
…ckward pass Signed-off-by: amadhan882 <amadhan882@gmail.com>
…F16/ZenFlow integration Signed-off-by: amadhan882 <amadhan882@gmail.com>
Signed-off-by: amadhan882 <amadhan882@gmail.com>
deepspeed/runtime/engine.py
Outdated
| raise RuntimeError( | ||
| "DeepSpeedEngine: Optimizer initialization failed. Check for JIT compilation errors.") | ||
|
|
||
| optimizer_methods = ['step', 'load_state_dict'] |
There was a problem hiding this comment.
Please add backward to this list.
deepspeed/runtime/engine.py
Outdated
| ) | ||
|
|
||
| # Validate engine separately | ||
| if not hasattr(self, "backward") or not callable(getattr(self, "backward")): |
There was a problem hiding this comment.
Apologies if not been previously clear, but self.optimizer.backward needs validating not self.backward.
deepspeed/runtime/engine.py
Outdated
| # ZeRO stage >= 2 communicates during non gradient accumulation boundaries as well | ||
| if self.zero_optimization_partition_gradients(): | ||
| self.optimizer.overlapping_partition_gradients_reduce_epilogue() | ||
| if hasattr(self.optimizer, 'overlapping_partition_gradients_reduce_epilogue'): |
There was a problem hiding this comment.
It seems this check is now redundant due to line 425.
Signed-off-by: amadhan882 <amadhan882@gmail.com>
|
@sfc-gh-truwase Thanks for the clarification!
Please let me know if anything else needs adjustment. |
Signed-off-by: amadhan882 <amadhan882@gmail.com>
|
@amadhan882 can you please address the formatting issues? |
Thanks for the feedback. I am currently reviewing the changes to resolve the formatting issues and will push the updated commits shortly. |
I saw a lot formatting changes, I met the same when using opencode to change the code. The reason is edit tool call will change formatting. If you are using opencode or cc, ask your agent to use sed instead of edit tool call to modify the code, which will solve this issue. |
Overview
This PR addresses issue #7844 by adding a validation check to ensure the ZeRO-3 optimizer is correctly initialized before training begins.
Changes
.stepattribute on the optimizer specifically for ZeRO-Stage 3 configurations.Related Issue
Fixes #7844