Resolve master merge conflict for #7916 by tohtana · Pull Request #1 · roycho96/DeepSpeed

tohtana · 2026-03-29T22:27:29Z

This PR resolves the conflicts between fix/support-func-torch and master (deepspeed/runtime/zero/linear.py).
Conflict resolution keeps the setup_context-based torch.func fix from deepspeedai#7916.

deepspeedai#7921) ### Summary - `is_nfs_path()` in `matmul_ext.py` passes the cache directory path to `df -T` before the directory is created, causing `df: /root/.triton/autotune: No such file or directory` errors on stderr - Fix by walking up to the nearest existing ancestor directory before invoking `df`, which correctly resolves the filesystem type without requiring the target path to exist - Also suppress stderr via `subprocess.DEVNULL` and catch `FileNotFoundError` for environments where `df` is unavailable (e.g., minimal containers) ### Root Cause In `AutotuneCacheManager.__init__`, `TritonCacheDir.warn_if_nfs(self.cache_dir)` is called before `os.makedirs(self.cache_dir, exist_ok=True)`. The `is_nfs_path()` function then runs `df -T` on a path that does not yet exist, which causes `df` to print an error to stderr. While the `CalledProcessError` exception was caught, the stderr output still leaked to the user's terminal. ### Changes - `deepspeed/ops/transformer/inference/triton/matmul_ext.py`: Walk up to nearest existing ancestor before calling `df -T`; suppress stderr; catch `FileNotFoundError` ### Testing - Python syntax validation: PASS - yapf formatting check: PASS (no diff) - flake8: PASS (no warnings) Fixes deepspeedai#7642 Signed-off-by: Krishna Chaitanya Balusu <krishnabkc15@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>

…eepspeedai#7920) `torch.amp.custom_fwd` was introduced in PyTorch 2.4, so installing DeepSpeed from source with an older PyTorch fails because `setup.py` triggers an import of the function. This PR adds a fallback to `torch.cuda.amp.custom_fwd` for PyTorch < 2.4. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

@pengdurice

Authors: @pengdurice and @PKUWZP Create a separate PR based on deepspeedai#7798 with the same functional diff on a clean signed-off branch to resolve DCO issues. We aim on adding Muon Optimizer to zero stage 3 in this draft PR: - Created a dedicated momentum buffer in zero stage 3 optimizer to save the momentum buffers specifically for Muon Optimizer. - The optimizer states can be dispatched into 3 devices: GPU, CPU and NVME. For GPU and CPU, we just make the new buffers the same device of `self.fp32_partitioned_groups_flat`; when `device == NVME`, we make sure that the momentum buffers can be swapped in and out along with other components in the optimizer states. - The new momentum buffers are also partitioned like `self.fp32_partitioned_groups_flat` to save memory footprint. So, before the muon update, we need to perform `all_gather` on top of each data-parallel group rank. The Muon updates of the parameters are also divided across the data-parallel ranks, and the results are all-gathered once all updates are complete. After the `all_gather`, the momentum buffers are partitioned and flattened again. Next steps: - Explore quantization of momentum buffers for saving memory - Explore using highly optimized Adam / AdamW Optimizers --------- Signed-off-by: PKUWZP <zhipeng.rainbowserie@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

deepspeedai#7905) (deepspeedai#7906) Fix [issue deepspeedai#7905](deepspeedai#7905) - Preserve optimizer param-group metadata across ZeRO-3 subgroup splitting so SuperOffload handles multiple optimizer groups correctly. - Switch the CPU worker path to shared CPU parameter and gradient buffers, removing the need to send updated parameters back through the result queue. - Make the GPU-to-CPU gradient copy asynchronous and submit CPU optimizer work only after the copy is ready. The figures below compare per-iteration time and GPU memory usage against the non-offload. The second figure presents a correctness check of the updated version. <img width="977" height="364" alt="image" src="https://github.com/user-attachments/assets/8fb2cf21-1a8c-47dd-9090-ec73acc5c9dc" /> <img width="3248" height="1748" alt="image" src="https://github.com/user-attachments/assets/d8121d64-dfd9-478c-87ea-b41e98630a2a" /> --------- Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>

…eepspeedai#7930) ## Summary - Added explicit `pre-commit run --files <changed_files>` command to the existing CI requirements line in AGENTS.md/CLAUDE.md - Clarifies that only modified files should be checked, not the entire codebase ## Changes - Enhanced the existing pre-commit bullet point in "Commit & CI requirements" section (no new section added) Signed-off-by: Guokai Ma <guokai.ma@intel.com>

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Krishnachaitanyakc and others added 7 commits March 25, 2026 13:18

Add news item for ASPLOS 2026 Best Paper Award (deepspeedai#7923)

62c3e6d

resolve conflict with master

703aad3

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana mentioned this pull request Mar 29, 2026

fix: add setup_context for torch.func compatibility deepspeedai/DeepSpeed#7916

Open

zhangj1an merged commit 8468149 into roycho96:fix/support-func-torch Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve master merge conflict for #7916#1

Resolve master merge conflict for #7916#1
zhangj1an merged 7 commits intoroycho96:fix/support-func-torchfrom
tohtana:tohtana/pr7916-merge-master-resolve

tohtana commented Mar 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

tohtana commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

tohtana commented Mar 29, 2026 •

edited

Loading