Skip to content

Add ctas_per_cga for PAIR_CTA mode in TLX kernels#431

Closed
rafaykhurram wants to merge 1 commit intometa-recsys:mainfrom
rafaykhurram:export-D89439254
Closed

Add ctas_per_cga for PAIR_CTA mode in TLX kernels#431
rafaykhurram wants to merge 1 commit intometa-recsys:mainfrom
rafaykhurram:export-D89439254

Conversation

@rafaykhurram
Copy link
Copy Markdown
Contributor

Differential Revision: D89439254

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Dec 18, 2025

@rafaykhurram has exported this pull request. If you are a Meta employee, you can view the originating Diff in D89439254.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 18, 2025
facebook-github-bot pushed a commit to meta-pytorch/tritonbench that referenced this pull request Dec 18, 2025
Summary:
X-link: meta-recsys/generative-recommenders#431

Add `ctas_per_cga` parameter to `triton.Config` when `PAIR_CTA` is enabled, following the pattern from D89389230 which introduces CUDA-native cluster launch semantics.

This change affects:
- `tritonbench/operators/gemm/tlx_matmul.py`: Added `ctas_per_cga=(2, 1, 1) if pairCTA else None` to the autotune config
- `generative_recommenders/ops/triton/triton_addmm.py`: Added `c.ctas_per_cga = (2, 1, 1) if pair_cta_compatible else None` in `_prune_configs_for_pair_cta`

The `ctas_per_cga` parameter enables CUDA-native cluster launch semantics (TLX way), which differs from Triton's `num_ctas` approach:
- **Triton's way** (`num_ctas`): Grid is multiplied by cluster_dims to get total CTAs
- **TLX/CUDA way** (`ctas_per_cga`): Grid equals total CTAs, ctas_per_cga regroups them into clusters

Reviewed By: rafaykhurram

Differential Revision: D89439254
@LinjianMa LinjianMa closed this Jan 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants