Skip to content

Conversation

@ptrendx
Copy link
Member

@ptrendx ptrendx commented Feb 9, 2026

Description

This PR fixes most compilation warnings for the pyTorch extension. The one still left is the performance advisory from ptxas on using the multicast.cluster modifier when compiling for architecture 120 - fixing this one would require a special kernel for SM120 most probably.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Changed the deprecated set_is_inference call in cudnn frontend to set_generate_stats(false)
  • Fixed the warning about the base class function being shadowed in the child class rather than overridden
  • Fixed the warning about the wrong order of the parameters in the constructor

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
@ptrendx ptrendx changed the title Fix the compilation warnings [pyTorch] Fix the compilation warnings Feb 9, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 9, 2026

Greptile Overview

Greptile Summary

This PR targets compilation warnings in the PyTorch extension and cuDNN frontend fused-attention code paths.

Changes include:

  • Reordering GroupedTensor members and its constructor initializer list to match declaration order and silence -Wreorder warnings.
  • Updating cuDNN frontend SDPA attribute configuration to replace deprecated set_is_inference(false) usage with set_generate_stats(...).
  • Fixing C++ name-hiding/overload ambiguity for copy_into_buffer in the PyTorch CommOverlap* wrappers by adding using declarations and disambiguating the pybind11 binding via static_cast.

Main remaining concern is the FP8 fused-attention path, which now hardcodes set_generate_stats(true) despite having an is_training parameter, potentially changing inference behavior/performance if stats are not needed outside training.

Confidence Score: 4/5

  • This PR is likely safe to merge after confirming FP8 stats generation behavior is correct for inference.
  • Most changes are mechanical warning fixes (member initialization order, deprecated cuDNN frontend API replacement, and pybind overload disambiguation). The only material behavioral risk is set_generate_stats(true) in the FP8 SDPA attributes, which may force stats generation when is_training is false.
  • transformer_engine/common/fused_attn/fused_attn_fp8.cu

Important Files Changed

Filename Overview
transformer_engine/common/common.h Reordered GroupedTensor members and initializer list to match declaration order, addressing -Wreorder warnings; no functional changes apparent.
transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu Replaced deprecated set_is_inference(false) with set_generate_stats(generate_stats) for cuDNN frontend SDPA attributes; behavior stays parameter-driven.
transformer_engine/common/fused_attn/fused_attn_fp8.cu Replaced set_is_inference(false) with set_generate_stats(true) for SDPA_fp8_attributes; may unconditionally generate stats regardless of training/inference intent.
transformer_engine/pytorch/csrc/extensions.h Added using declarations to unhide base-class copy_into_buffer overloads in CommOverlap wrappers, fixing C++ shadowing warnings and enabling pybind disambiguation.
transformer_engine/pytorch/csrc/extensions/pybind.cpp Disambiguated pybind bindings for overloaded copy_into_buffer via explicit static_cast to the intended (Tensor, bool) overload.

Sequence Diagram

sequenceDiagram
  participant Py as PyTorch Python
  participant PB as pybind11 module
  participant CO as CommOverlap/CommOverlapP2P (C++)
  participant Base as CommOverlap*Base/Core (C++)
  participant FA as FusedAttn FP16/FP8 (CUDA)
  participant FE as cuDNN Frontend Graph

  Py->>PB: instantiate CommOverlap*(...)
  PB->>CO: constructor

  Py->>PB: copy_into_buffer(tensor, local_chunk)
  PB->>CO: copy_into_buffer(const at::Tensor&, bool)
  Note over PB,CO: Binding uses static_cast to pick (Tensor,bool)
  CO->>Base: (optional) call into base/core overlap implementation

  Py->>PB: run fused attention op
  PB->>FA: fused_attn_*_fwd_impl(...)
  FA->>FE: build SDPA attributes
  Note over FE: FP16: set_generate_stats(generate_stats)
  Note over FE: FP8: set_generate_stats(true) (current PR)
  FE-->>FA: compile/execute graph
  FA-->>PB: return outputs (incl. Stats when generated)
  PB-->>Py: return tensors
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>
@ptrendx ptrendx requested a review from cyanguwa February 9, 2026 23:13
@ptrendx
Copy link
Member Author

ptrendx commented Feb 9, 2026

/te-ci

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 9, 2026

Additional Comments (1)

transformer_engine/common/fused_attn/fused_attn_fp8.cu
Unconditional stats generation

fused_attn_fp8_fwd_impl_v1 takes bool is_training (see signature around transformer_engine/common/fused_attn/fused_attn_fp8.cu:1654), but the new .set_generate_stats(true) here forces stats collection even when is_training == false. If cuDNN frontend uses this flag to allocate/compute extra outputs, this changes inference behavior (and can add overhead) compared to a training-only stats path. This should likely be wired to is_training (or whatever condition controls whether Stats is actually needed) rather than hardcoded true.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants