Hi there, thanks for the amazing work! I found expert parallel is not compatible with the distributed optimizer in the fork version of Megatron-LM here:
https://github.com/stanford-futuredata/Megatron-LM/blob/85f95aef3b648075fe6f291c86714fdcbd9cd1f5/megatron/arguments.py#L352-L356
But there's no such validation in the open PR to Megatron-LM: NVIDIA/Megatron-LM#288
Does that mean the assertion is redundant and the current version of megablocks is compatible with the distributed optimizer under expert parallelism?
Thanks very much.
Hi there, thanks for the amazing work! I found expert parallel is not compatible with the distributed optimizer in the fork version of Megatron-LM here:
https://github.com/stanford-futuredata/Megatron-LM/blob/85f95aef3b648075fe6f291c86714fdcbd9cd1f5/megatron/arguments.py#L352-L356
But there's no such validation in the open PR to Megatron-LM: NVIDIA/Megatron-LM#288
Does that mean the assertion is redundant and the current version of megablocks is compatible with the distributed optimizer under expert parallelism?
Thanks very much.