Skip to content

Refactor/torch autocast encapsulate global state#7946

Open
nathon-lee wants to merge 15 commits intodeepspeedai:masterfrom
nathon-lee:refactor/torch-autocast-encapsulate-global-state
Open

Refactor/torch autocast encapsulate global state#7946
nathon-lee wants to merge 15 commits intodeepspeedai:masterfrom
nathon-lee:refactor/torch-autocast-encapsulate-global-state

Conversation

@nathon-lee
Copy link
Copy Markdown
Contributor

@nathon-lee nathon-lee commented Apr 2, 2026

refactor: replace bare global vars in torch_autocast with _AutocastState

TORCH_AUTOCAST_INITIALIZED and TORCH_AUTOCAST_DTYPE were module-level
globals mutated via global statements inside init_autocast_params().
This pattern is fragile: it is invisible to type checkers, prevents
isolation between multiple engine instances, and makes the state harder
to reset in tests.

Replace them with a private _AutocastState dataclass instance
_autocast_state. The public API (is_autocast_initialized,
get_autocast_dtype) is unchanged, so no call sites are affected.


fix: store autocast state per-engine to support multiple engine configs

Previously, _autocast_state was a module-level singleton in
torch_autocast.py. When a second DeepSpeed engine called
init_autocast_params(), it would overwrite the first engine's dtype
and initialized state, making it impossible to run two engines with
different autocast configurations concurrently.

Fix by attaching _AutocastState directly to the engine instance
(engine._autocast_state). Update is_autocast_initialized() and
get_autocast_dtype() to accept an engine argument. For ZeRO
optimizers (which hold no engine reference), switch from the global
state query to the per-parameter has_comm_dtype() check; parameters
are already stamped by their own engine inside init_autocast_params(),
so isolation is automatic.

Copy link
Copy Markdown
Collaborator

@tohtana tohtana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nathon-lee, thank you for opening this PR!
_autocast_state is still global and doesn't seem support different configs for multiple engines. Did I misunderstand something?

@nathon-lee
Copy link
Copy Markdown
Contributor Author

nathon-lee commented Apr 3, 2026

**tohtana **

@tohtana thank you, good catch, — I still need to make one more change.

Signed-off-by: nathon-lee <leejianwoo@gmail.com>
@nathon-lee
Copy link
Copy Markdown
Contributor Author

nathon-lee commented Apr 8, 2026

/ @copilot review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants