add auto tma transpose scheduler by liqiangxl · Pull Request #6018 · NVIDIA/Fuser

liqiangxl · 2026-02-27T15:33:10Z

To reduce number of tranpose ops, is_output_smem_transpose is added to control input/output transpose:

1. When there are more inputs than outputs, is_output_smem_transpose = True
TMA load without swizzle, TMA store with swizzle, transpose at regs --> output cached smem

2. When there are less inputs than outputs, is_output_smem_transpose = False
TMA load with swizzle, register store, transpose at input cached smem -> regs

Current performance is in this doc.

liqiangxl · 2026-02-27T15:33:22Z

!test

liqiangxl · 2026-02-27T15:40:55Z

!test

greptile-apps · 2026-02-27T15:45:04Z

Greptile Summary

This PR implements the auto TMA transpose scheduler, adding is_output_smem_transpose to choose between two swizzle strategies based on tensor counts: swizzle the input smem when there are fewer inputs (TMA load + swizzled read), or swizzle the output smem when there are fewer outputs (TMA store + swizzled write). The feature is opt-in via the new EnableOption::TmaTranspose flag and falls back to the non-TMA scheduler when disabled.

Key changes:

transpose_tma.cpp: Full implementation of getTransposeHeuristics and scheduleTranspose for both input-smem and output-smem transpose paths.
transpose_heuristic.h: Four new params (use_tma_store, is_output_smem_transpose, chunks_per_thread, elements_per_chunk) with correct sameAs/hash/toString updates.
transpose.cpp: TMA path gated behind isOptionEnabled(TmaTranspose); dispatch extended to route on use_tma_store in addition to use_tma_load.
options.h/cpp: TmaTranspose added to EnableOption; all option enums switched to std::uint8_t underlying type.
tma.cpp: Batching eligibility check now skips trivial extent-1 IDs, avoiding spurious exclusions from TMA load batching.
Tests cover parameterized dtype/dim combinations, bank-conflict validation, and direct param-override combinations.

Confidence Score: 3/5

Mostly safe to merge, but a latent crash path exists when is_output_smem_transpose=true and use_tma_store=false that should be guarded before this lands.
The core heuristic and scheduling logic is well-structured and backed by parameterized tests. However, tma_store_tvs.at(0) is called unconditionally when is_output_smem_transpose=true without verifying that tma_store_tvs is non-empty — a combination the test infrastructure explicitly allows overriding. This is an unguarded crash path. The dead code in the bank-conflict test also produces compiler warnings. These issues lower confidence despite the otherwise solid implementation.
csrc/scheduler/transpose_tma.cpp — specifically the tma_store_tvs.at(0) access at line 260 needs a guard or assertion.

Important Files Changed

Filename	Overview
csrc/scheduler/transpose_tma.cpp	Core implementation of the new TMA transpose scheduler — ~340 lines added. Contains a latent crash when `is_output_smem_transpose=true` but `use_tma_store=false` (unchecked `tma_store_tvs.at(0)`), and the tile-size heuristic always divides by `n_input` regardless of which side is transposed.
tests/cpp/test_transpose.cpp	Good breadth of parameterized tests for the new TMA scheduler (dtype, dims, param combinations). Minor issue: `OutputTransposeBankconflict` contains dead code with unused structured bindings (`read_ways`/`write_ways`) that will produce compiler warnings.
csrc/scheduler/transpose_heuristic.h	Adds `use_tma_store`, `is_output_smem_transpose`, `chunks_per_thread`, and `elements_per_chunk` fields; correctly updates `sameAs`, `hash`, and `toString` accordingly.
csrc/scheduler/transpose.cpp	TMA path is now gated behind `isOptionEnabled(EnableOption::TmaTranspose)` and the dispatch condition now also triggers on `use_tma_store`. Both changes are correct.
csrc/options.h	Adds `TmaTranspose` to `EnableOption`; changes all option enums to use `std::uint8_t` (all enums have far fewer than 256 entries so no overflow risk); fixes the copy constructor initialization order.
csrc/device_lower/analysis/tma.cpp	Tightens the TMA batching check by filtering out trivial (extent=1) IDs before checking for thread-dim or serial parallelism. Intentional refinement — extent=1 dims don't cause mbarrier issues.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[TransposeScheduler::computeHeuristics] --> B{TmaTranspose enabled?}
    B -- No --> C[non_tma::getTransposeHeuristics]
    B -- Yes --> D[tma::getTransposeHeuristics]
    D --> E{n_input > n_output?}
    E -- Yes --> F["is_output_smem_transpose=true\nuse_tma_load=true\nuse_tma_store=true"]
    E -- No --> G["is_output_smem_transpose=false\nuse_tma_load=true\nuse_tma_store=false"]
    F --> H[scheduleTranspose]
    G --> H
    C --> H

    H --> I[cacheInputs + cacheAndForkOutputs]
    I --> J{use_tma_load?}
    J -- Yes --> K["input → smem_cache (TMA) → reg_cache"]
    J -- No --> L[skip]
    I --> M{use_tma_store?}
    M -- Yes --> N["reg_cache → smem_cache (TMA) → output"]
    M -- No --> O[register store]

    K --> P[Step 1: Tile both transpose dims, propagate BIDx]
    L --> P
    N --> P
    O --> P
    P --> Q{is_output_smem_transpose?}
    Q -- Yes --> R["Step 2: scheduleTMAStoreForMmaOutput\n(swizzled output smem)"]
    Q -- No --> S["Step 3: applyMmaSwizzleForTMALoad\n(swizzled input smem)"]
    R --> T[Step 4: Register scheduling + TIDx propagation]
    S --> T
    T --> U[Vectorize smem reads/writes]
    U --> V[inlineMost]

Comments Outside Diff (2)

csrc/scheduler/transpose_tma.cpp, line 257-262 (link)

Crash when is_output_smem_transpose=true but use_tma_store=false

tma_store_tvs.at(0) is called unconditionally whenever is_output_smem_transpose is true, but tma_store_tvs is only populated when use_tma_store is true. If a caller sets is_output_smem_transpose = true with use_tma_store = false (a combination not prevented by the heuristic or scheduler), this throws a std::out_of_range exception at runtime.

The default heuristic always sets use_tma_store = is_output_smem_transpose, so the normal path is safe, but the TmaTransposeParamsTestP test explicitly overrides these params independently, and a future test or user could hit this combination. An assertion or guard should be added here:
```
  if (tparams->is_output_smem_transpose) {
    NVF_ERROR(
        !tma_store_tvs.empty(),
        "is_output_smem_transpose requires use_tma_store to be true");
    MmaInputSmemSwizzle swizzle =
        mma_utils::tmaSwizzleSharedMemory(tma_store_tvs.at(0));
```
csrc/scheduler/transpose_tma.cpp, line 88-90 (link)

Tile-size heuristic always uses n_input, ignoring n_output for the output-smem path

The tile-size-1 heuristic divides by n_input unconditionally:
```
const int64_t bytes_per_tile = bytes_per_cta / n_input;
```
When is_output_smem_transpose = true the swizzled side is the output smem, and n_input > n_output by definition. The non-swizzled tile (tile1) spans the inputs, so using n_input is conceptually correct for the "data loaded per CTA" target. However, the output smem footprint scales with n_output, not n_input. If balancing smem usage across outputs is also a goal, dividing by max(n_input, n_output) (or a weighted combination) would be more symmetric. As a minimum a comment clarifying why n_input is the right divisor in both branches would help maintainability.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

_{Last reviewed commit: 0068fe5}

greptile-apps

_{7 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-27T15:45:08Z

csrc/scheduler/transpose_tma.cpp

+  NVF_ERROR(grouped_inputs_outputs.size() >= 2);
+
+  // When there are more inputs than outputs, output smem transpose should be
+  // used, however, if it is not, then input smem tranpose will be used, to


tranpose should be transpose

greptile-apps · 2026-02-27T15:45:09Z

csrc/scheduler/transpose_tma.cpp

+  const int64_t cta_per_sm =
+      dev_props->maxThreadsPerMultiProcessor / threads_per_cta;
+  const int64_t bytes_per_cta = bytes_per_sm / cta_per_sm;
+  const int64_t bytes_per_tile = bytes_per_cta / n_input;


Add check that n_input > 0 before this division. While the scheduler validation should prevent this, defensive programming would make the code more robust.

Suggested change

const int64_t bytes_per_tile = bytes_per_cta / n_input;

NVF_ERROR(n_input > 0, "Expected at least one TensorView input for transpose");

const int64_t bytes_per_tile = bytes_per_cta / n_input;

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

github-actions · 2026-03-02T19:21:58Z

Review updated until commit bc772db

Description

Implements automatic TMA (Tensor Memory Access) transpose scheduler with two paths: input smem transpose (swizzle on input) and output smem transpose (swizzle on output)
Adds new TmaTranspose enable option to toggle the feature; scheduler falls back to non-TMA when disabled
Introduces new parameters: use_tma_store, is_output_smem_transpose, chunks_per_thread, elements_per_chunk for flexible TMA configuration
Adds comprehensive tests covering different dtypes, transpose dimensions, and TMA parameter combinations

Changes walkthrough

	Relevant files

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Potential TMA load restriction The new code filters loop domains to only include non-trivial IDs (extent > 1 or non-const) before checking for thread/serial dims. This is more restrictive than the original which checked all loop domains. This could potentially exclude valid TMA loads where some dimensions have extent 1 but other dimensions are parallelized with threads. Need to verify this doesn't break existing TMA use cases. auto non_trivial_ids = tv->getLoopDomain() \| std::views::filter([](const IterDomain* id) { return !id->extent()->isConstScalar() \|\| id->extent()->evaluate().as<int64_t>() > 1; }); if (std::ranges::any_of(non_trivial_ids, [](const IterDomain* id) { return id->isThreadDim() \|\| id->getParallelType() == ParallelType::Serial; })) { return {}; } Missing null check In scheduleTranspose, when setting up TMA store (lines 165-172), the code accesses fusion->outputs()[output_idx] without checking if output_idx is within bounds. While cached_outputs should correspond to outputs, a bounds check would be safer. for (auto [cached_output, output_idx] : cached_outputs) { auto output = fusion->outputs()[output_idx]->as<TensorView>(); output->definition()->as<LoadStoreOp>()->setOpType( LoadStoreOpType::CpAsyncBulkTensorTile); cached_output->setMemoryType(MemoryType::Shared); cached_output->cacheBefore(); tma_store_tvs.push_back(cached_output); } Thread safety consideration The copy constructor was modified to use a lambda that captures other.mutex_ and returns other.options_. While this appears correct, the original implementation directly assigned options_. The new approach should be verified to maintain the same thread-safety semantics under concurrent access patterns. Options(const Options& other) : options_([&other]() { std::lock_guard<std::mutex> lock_other(other.mutex_); return other.options_; }()) {}

greptile-apps · 2026-03-03T15:48:37Z

Additional Comments (2)

csrc/scheduler/transpose_tma.cpp, line 106
Infinite loop when estimated_tile_size1 starts at zero

If bytes_per_tile < kTmaSwizzleBytes (line 91-92), integer division yields estimated_tile_size1 = 0. The while loop (line 104) then spins forever because 0 * 2 == 0 and get_chunks_per_thread() (line 98-102) stays at 0, which is always less than min_chunks_per_thread = 4.

On an H100 (maxThreadsPerMultiProcessor = 2048, cta_per_sm = 8, bytes_per_cta = 8192), this triggers when n_input > 64. Add an initialization guard before the loop:

  // Ensure we start from at least 1 to avoid multiplying 0 forever.
  if (estimated_tile_size1 == 0) {
    estimated_tile_size1 = 1;
  }
  while (get_chunks_per_thread() < min_chunks_per_thread) {
    estimated_tile_size1 *= 2;
  }

tests/cpp/test_transpose.cpp, line 1947
Unconditional debug output will pollute test logs

The std::cout block (lines 1945–1947) prints every bank conflict unconditionally. This makes test runner output noisy, especially since the BFloat16 path is expected to have bank conflicts. Consider wrapping the print in a debug flag or removing it:

      if (auto* ke = dynamic_cast<KernelExecutor*>(executor.get())) {
        auto bank_conflicts = getBankConflictInfo(ke->compiledKernel()->kernel());
        if (dtype == DataType::Float) {
          EXPECT_TRUE(bank_conflicts.empty());
        } else {
          // TODO: update to EXPECT_TRUE once bf16 bank conflicts are resolved.
          EXPECT_FALSE(bank_conflicts.empty());
        }
      }

greptile-apps · 2026-03-03T19:17:45Z

Additional Comments (4)

csrc/scheduler/transpose_tma.cpp, line 107
Potential infinite loop when estimated_tile_size1 initializes to zero

If bytes_per_tile < kTmaSwizzleBytes (128), integer division yields estimated_tile_size1 = 0. The while loop then evaluates get_chunks_per_thread() as 0 (because the numerator is 0 * tile_size2 = 0) and multiplies: 0 * 2 = 0 — the loop never terminates.

This happens when bytes_per_cta / n_input < 128. With an SM90 GPU (maxThreadsPerMultiProcessor = 2048), cta_per_sm = 8, giving bytes_per_cta = 8192. So the loop infinite-hangs when n_input > 64.

While unlikely for typical transpose fusions (1–2 inputs), this is an unbounded loop with no guard. A simple fix is to initialise estimated_tile_size1 to at least 1:

int64_t estimated_tile_size1 =
    std::max(int64_t(1), bytes_per_tile / kTmaSwizzleBytes);

csrc/scheduler/transpose_tma.cpp, line 267
Missing guard before accessing tma_store_tvs when use_tma_store may be false

tma_store_tvs is only populated when tparams->use_tma_store == true (lines 164–173), but this block checks only tparams->is_output_smem_transpose. If is_output_smem_transpose = true but use_tma_store = false, then tma_store_tvs will be empty and .at(0) throws std::out_of_range.

Note the asymmetry: Step 3 already guards the analogous constraint with an explicit NVF_ERROR(tparams->use_tma_load, ...) at line 286-288. Adding the same guard here would be consistent:

if (tparams->is_output_smem_transpose) {
    NVF_ERROR(
        tparams->use_tma_store,
        "TMA store must be used when output smem is transposed");
    MmaInputSmemSwizzle swizzle =
        mma_utils::tmaSwizzleSharedMemory(tma_store_tvs.at(0));

tests/cpp/test_transpose.cpp, line 1949
Debug std::cout in test code — use GTest facilities instead

These std::cout lines will only fire when bank conflicts are detected (when the test is already failing). However, raw std::cout in tests is unconventional — GTest's ADD_FAILURE() / SCOPED_TRACE or just the EXPECT_TRUE failure message would be more idiomatic:

      for (auto& [expr, ways] : bank_conflicts) {
        auto [read_ways, write_ways] = ways;
        ADD_FAILURE() << "Bank conflict in: " << expr->toString()
                      << "  read=" << read_ways << "-way"
                      << ", write=" << write_ways << "-way";
      }

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

tests/cpp/test_transpose.cpp, line 1969
Typo "tranapose" should be "transpose" in multiple lines

// Test different combinations of TMA transpose parameters:
// (is_output_smem, use_tma_load, use_tma_store)
//   (false, true, false)  - input smem transpose, TMA load only
//   (false, true, true)   - input smem transpose, TMA load + TMA store
//   (true,  true, true)   - output smem transpose, TMA load + TMA store
//   (true,  false, true)  - output smem transpose, TMA store only

liqiangxl · 2026-03-05T15:48:46Z

csrc/device_lower/analysis/tma.cpp

+    if (std::ranges::any_of(non_trivial_ids, [](const IterDomain* id) {
+          return id->isThreadDim() ||
+              id->getParallelType() == ParallelType::Serial;
+        })) {


trivial optimization of multiple-tma loads, doesn't have to be in this PR.

liqiangxl · 2026-03-05T15:54:17Z

csrc/scheduler/transpose_tma.cpp

+
+  // When not using output smem transpose but inputs > outputs, swap groups
+  // so group 2 remains the swizzled side.
+  if (!tparams->is_output_smem_transpose &&


This branch is not used in current heuristics, but may use it in future tuning.

liqiangxl · 2026-03-05T16:25:29Z

!test

liqiangxl · 2026-03-05T17:27:23Z

!test

liqiangxl · 2026-03-05T17:31:04Z

!test

rdspring1 · 2026-03-09T05:50:14Z

tests/cpp/test_transpose.cpp

+      auto bank_conflicts = getBankConflictInfo(ke->compiledKernel()->kernel());
+      for (auto& [expr, ways] : bank_conflicts) {
+        auto [read_ways, write_ways] = ways;
+        std::cout << "  Bank conflict: " << expr->toString()


Is std::cout necessary?

rdspring1

Are the Infinite loop in heuristics and Out-of-range crash with inconsistent params greptile concerns valid?

liqiangxl · 2026-03-09T13:24:21Z

Are the Infinite loop in heuristics and Out-of-range crash with inconsistent params greptile concerns valid?

Yes, they may happen in theory, added checks.

add auto tma transpose scheduler

cd77679

minor opt code

6963ca5

liqiangxl marked this pull request as ready for review February 27, 2026 15:40

greptile-apps bot reviewed Feb 27, 2026

View reviewed changes

liqiangxl requested a review from rdspring1 February 27, 2026 17:24

fix bank conflict in load

f707750

liqiangxl added 3 commits March 2, 2026 11:48

vectorize load bf16 from smem to regs

985f8d5

test bank conflict

2acf28b

test different combinations of tma load/store

0f793c5

use vect1 = 1 to avoid bank conflict

bc772db

liqiangxl commented Mar 5, 2026

View reviewed changes

liqiangxl and others added 2 commits March 5, 2026 10:58

Merge branch 'main' into llu/transpose_output_smem_auto

ba97d89

format

f722d67

rdspring1 reviewed Mar 9, 2026

View reviewed changes

fix

872edfe

liqiangxl added 3 commits March 12, 2026 05:52

Merge branch 'main' into llu/transpose_output_smem_auto

4a2984e

format

ec5f42b

format

0068fe5

	const int64_t bytes_per_tile = bytes_per_cta / n_input;
	NVF_ERROR(n_input > 0, "Expected at least one TensorView input for transpose");
	const int64_t bytes_per_tile = bytes_per_cta / n_input;

Conversation

liqiangxl commented Feb 27, 2026

Uh oh!

liqiangxl commented Feb 27, 2026

Uh oh!

liqiangxl commented Feb 27, 2026

Uh oh!

greptile-apps bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Comments Outside Diff (2)

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Uh oh!

greptile-apps bot commented Mar 3, 2026

Uh oh!

greptile-apps bot commented Mar 3, 2026

Uh oh!

liqiangxl Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

liqiangxl Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

liqiangxl commented Mar 5, 2026

Uh oh!

liqiangxl commented Mar 5, 2026

Uh oh!

liqiangxl commented Mar 5, 2026

Uh oh!

rdspring1 Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

liqiangxl Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

rdspring1 left a comment

Choose a reason for hiding this comment

Uh oh!

liqiangxl commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot commented Feb 27, 2026 •

edited

Loading

github-actions bot commented Mar 2, 2026 •

edited

Loading