Skip to content

Resolve master merge conflict for #7916#1

Merged
zhangj1an merged 7 commits intoroycho96:fix/support-func-torchfrom
tohtana:tohtana/pr7916-merge-master-resolve
Mar 30, 2026
Merged

Resolve master merge conflict for #7916#1
zhangj1an merged 7 commits intoroycho96:fix/support-func-torchfrom
tohtana:tohtana/pr7916-merge-master-resolve

Conversation

@tohtana
Copy link
Copy Markdown

@tohtana tohtana commented Mar 29, 2026

This PR resolves the conflicts between fix/support-func-torch and master (deepspeed/runtime/zero/linear.py).
Conflict resolution keeps the setup_context-based torch.func fix from deepspeedai#7916.

Krishnachaitanyakc and others added 7 commits March 25, 2026 13:18
deepspeedai#7921)

### Summary

- `is_nfs_path()` in `matmul_ext.py` passes the cache directory path to
`df -T` before the directory is created, causing `df:
/root/.triton/autotune: No such file or directory` errors on stderr
- Fix by walking up to the nearest existing ancestor directory before
invoking `df`, which correctly resolves the filesystem type without
requiring the target path to exist
- Also suppress stderr via `subprocess.DEVNULL` and catch
`FileNotFoundError` for environments where `df` is unavailable (e.g.,
minimal containers)

### Root Cause

In `AutotuneCacheManager.__init__`,
`TritonCacheDir.warn_if_nfs(self.cache_dir)` is called before
`os.makedirs(self.cache_dir, exist_ok=True)`. The `is_nfs_path()`
function then runs `df -T` on a path that does not yet exist, which
causes `df` to print an error to stderr. While the `CalledProcessError`
exception was caught, the stderr output still leaked to the user's
terminal.

### Changes

- `deepspeed/ops/transformer/inference/triton/matmul_ext.py`: Walk up to
nearest existing ancestor before calling `df -T`; suppress stderr; catch
`FileNotFoundError`

### Testing

- Python syntax validation: PASS
- yapf formatting check: PASS (no diff)
- flake8: PASS (no warnings)

Fixes deepspeedai#7642

Signed-off-by: Krishna Chaitanya Balusu <krishnabkc15@gmail.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
…eepspeedai#7920)

`torch.amp.custom_fwd` was introduced in PyTorch 2.4, so installing
DeepSpeed from source with an older PyTorch fails because `setup.py`
triggers an import of the function.
This PR adds a fallback to `torch.cuda.amp.custom_fwd` for PyTorch <
2.4.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Authors: @pengdurice and @PKUWZP

Create a separate PR based on deepspeedai#7798 with the same functional diff on a
clean signed-off branch to resolve DCO issues.

We aim on adding Muon Optimizer to zero stage 3 in this draft PR:

- Created a dedicated momentum buffer in zero stage 3 optimizer to save
the momentum buffers specifically for Muon Optimizer.
- The optimizer states can be dispatched into 3 devices: GPU, CPU and
NVME. For GPU and CPU, we just make the new buffers the same device of
`self.fp32_partitioned_groups_flat`; when `device == NVME`, we make sure
that the momentum buffers can be swapped in and out along with other
components in the optimizer states.
- The new momentum buffers are also partitioned like
`self.fp32_partitioned_groups_flat` to save memory footprint. So, before
the muon update, we need to perform `all_gather` on top of each
data-parallel group rank. The Muon updates of the parameters are also
divided across the data-parallel ranks, and the results are all-gathered
once all updates are complete. After the `all_gather`, the momentum
buffers are partitioned and flattened again.

Next steps:
- Explore quantization of momentum buffers for saving memory
- Explore using highly optimized Adam / AdamW Optimizers

---------

Signed-off-by: PKUWZP <zhipeng.rainbowserie@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
deepspeedai#7905) (deepspeedai#7906)

Fix [issue deepspeedai#7905](deepspeedai#7905)

- Preserve optimizer param-group metadata across ZeRO-3 subgroup
splitting so SuperOffload handles multiple optimizer groups correctly.
- Switch the CPU worker path to shared CPU parameter and gradient
buffers, removing the need to send updated parameters back through the
result queue.
- Make the GPU-to-CPU gradient copy asynchronous and submit CPU
optimizer work only after the copy is ready.

The figures below compare per-iteration time and GPU memory usage
against the non-offload. The second figure presents a correctness check
of the updated version.
<img width="977" height="364" alt="image"
src="https://github.com/user-attachments/assets/8fb2cf21-1a8c-47dd-9090-ec73acc5c9dc"
/>

<img width="3248" height="1748" alt="image"
src="https://github.com/user-attachments/assets/d8121d64-dfd9-478c-87ea-b41e98630a2a"
/>

---------

Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
…eepspeedai#7930)

## Summary
- Added explicit `pre-commit run --files <changed_files>` command to the
existing CI requirements line in AGENTS.md/CLAUDE.md
- Clarifies that only modified files should be checked, not the entire
codebase

## Changes
- Enhanced the existing pre-commit bullet point in "Commit & CI
requirements" section (no new section added)

Signed-off-by: Guokai Ma <guokai.ma@intel.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
@zhangj1an zhangj1an merged commit 8468149 into roycho96:fix/support-func-torch Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants