Remove LR scheduler for workers#1002
Conversation
There was a problem hiding this comment.
Code Review
This pull request effectively removes the internal learning rate scheduler logic from the training workers, centralizing LR control under an external system like Tinker. The changes are consistent and thorough across FSDP and Megatron strategies, abstract base classes, and worker implementations. Key modifications include removing scheduler instantiation, step() calls, and state management from checkpoints. The code is now cleaner and the responsibility for learning rate scheduling is clearly externalized. The changes look solid.
| from skyrl_train.distributed.megatron.optimizer import ( | ||
| init_megatron_optim_config, | ||
| get_megatron_optimizer, | ||
| get_megatron_optimizer_param_scheduler, | ||
| ) |
|
@erictang000 Do you think this makes sense to do / when would be the right time to merge it? As we fully standardize on the Tinker API, I feel like this should be removed, right? |
Referencing #998
Remove internal learning rate scheduler logic from the skyRL train workers.
Since #978 added set_lr() support, Tinker can now control the LR externally. Since the workers's LRs could conflict with Tinker's LR, we just remove the ability for workers to set their own LRs