Skip to content

Reproduce/issue 485#838

Open
kuri780 wants to merge 1 commit into
hw-native-sys:mainfrom
kuri780:reproduce/issue-485
Open

Reproduce/issue 485#838
kuri780 wants to merge 1 commit into
hw-native-sys:mainfrom
kuri780:reproduce/issue-485

Conversation

@kuri780

@kuri780 kuri780 commented Jun 18, 2026

Copy link
Copy Markdown

新增 PTOUnrollSIMTFor pass,规避 :SIMTVF kernel 中 scf.for + scf.if 常量分支导致 AICore 后端生成带谓词的 END @!P0

根因

SCF→CF→LLVM lowering 后 ret 自然在 cond_br 可达的 block 中,IR 本身正确。Bisheng AICore 后端未正确处理 SimtEntry calling convention,给 END 加上了谓词。

方案

pto.simt_entry 函数内常量 scf.for 完全展开,由下游 SCCP + canonicalize 自动消除 scf.if 分支,生成无分支直线代码。

  • 显式标注 {pto.unroll = "full"}:无条件展开
  • 自动检测:常量 bounds 且 trip count ≤ 64(可通过 --max-trip-count 调整)

涉及文件

  • lib/PTO/Transforms/PTOUnrollSIMTForPass.cpp — pass 实现(新增)
  • include/PTO/Transforms/Passes.td — 定义
  • include/PTO/Transforms/Passes.h — 声明
  • lib/PTO/Transforms/CMakeLists.txt — 注册
  • tools/ptoas/ptoas.cpp — 接入 prepareVPTOForEmission

@reedhecre

reedhecre commented Jun 18, 2026

Copy link
Copy Markdown

Codex Review

该评论由 review 机器人自动更新。

  • PR: Reproduce/issue 485 #838 Reproduce/issue 485
  • Author: kuri780
  • Base/Head: main / reproduce/issue-485
  • Head SHA: 89a33e767026
  • Trigger: PR 有新提交
  • Generated At: 2026-06-18T10:06:07Z
  • Previous Head SHA: 04d8a0c68f9a
  • Status: completed

Summary

PR #838 存在一个确定的构建阻断问题;另外,issue #485 的 workaround 只覆盖了 ptoas 驱动路径,没有覆盖公开的 VPTO lowering API,并且缺少对应回归测试。

Findings

  1. P1 新 pass 引入了 SCF utility 依赖,但 CMake 没有补齐链接库 lib/PTO/Transforms/CMakeLists.txt:104

PTOUnrollSIMTForPass.cpp 新增了对 loopUnrollByFactormlir/Dialect/SCF/Utils/Utils.h)的调用,但 PTOTransformsLINK_LIBS 仍然没有补上承载该实现的 SCF utility 库。上游 MLIR 中这个符号不在 MLIRSCFDialect / MLIRSCFToControlFlow / MLIRTransforms 里,因此这次改动会把未解析符号带到 PTOTransforms/ptoas 的链接阶段,导致 CI/本地构建直接失败。

  1. P2 修复只接到了驱动侧 pipeline,公开的 VPTO lowering API 仍会绕过它 tools/ptoas/ptoas.cpp:1465

workaround 只在 tools/ptoas/ptoas.cppprepareVPTOForEmission() 里接入。也就是说,走 ptoas CLI 的路径会执行这个 pass,但对外公开的 lowerVPTOModuleToLLVMModules*() lowering API 本身没有任何对应更新。仓库已经导出/安装了 PTOTransforms,因此直接嵌入该库的调用方会绕过这次修复,仍然命中 issue #485,造成“同一份 VPTO IR 经 CLI 可规避、经库 API 仍失败”的行为分叉。

  1. P2 缺少 issue [Bug] decode_attention_incore_7 kernel hangs on A5 but passes on A3 (ptoas 0.24) #485 的回归测试,CI 无法守住这次 workaround

这个 PR 改动了所有 pto.simt_entry 函数的 VPTO emission pipeline,但没有新增任何针对 issue #485 的回归用例,也没有测试覆盖新的 pto-unroll-simt-for pass。这样 CI 无法证明 scf.for + 依赖 IV 的 scf.if 这个关键模式已经被消除,也无法在后续 pass 顺序变更时第一时间发现 workaround 失效。对于一个面向后端误编译的修复,这个测试缺口是明显的。

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the PTOUnrollSIMTFor pass to unroll small constant-trip-count scf.for loops inside pto.simt_entry functions, addressing a bug (Issue #485) where structured control flow can lead to predicated END instructions in the AICore backend. Feedback on the implementation highlights a critical issue in PTOUnrollSIMTForPass.cpp where using applyPatternsAndFoldGreedily alongside loopUnrollByFactor can corrupt the greedy driver's internal state due to direct IR mutations. It is recommended to replace the pattern-based rewrite with a post-order walk-based unroller directly in runOnOperation.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +93 to +173
namespace {

struct UnrollSIMTForPattern : public OpRewritePattern<scf::ForOp> {
UnrollSIMTForPattern(MLIRContext *ctx, int64_t maxTripCount)
: OpRewritePattern(ctx), maxTripCount(maxTripCount) {}

LogicalResult matchAndRewrite(scf::ForOp forOp,
PatternRewriter &rewriter) const override {
// Only apply inside SIMT entry functions.
auto func = forOp->getParentOfType<func::FuncOp>();
if (!func || !isSIMTEntry(func))
return failure();

// Check explicit annotation first.
if (hasUnrollFullAttr(forOp)) {
return unrollFull(forOp, rewriter, /*isAnnotated=*/true);
}

// Auto-detect: constant bounds and small trip count.
auto tripCount = computeConstantTripCount(forOp);
if (!tripCount || *tripCount > maxTripCount)
return failure();

return unrollFull(forOp, rewriter, /*isAnnotated=*/false);
}

private:
LogicalResult unrollFull(scf::ForOp forOp, PatternRewriter &rewriter,
bool isAnnotated) const {
auto tripCount = computeConstantTripCount(forOp);
if (!tripCount || *tripCount <= 0)
return failure();

LLVM_DEBUG(llvm::dbgs()
<< "PTOUnrollSIMTFor: unrolling scf.for "
<< (isAnnotated ? "(annotated)" : "(auto)") << " tripCount="
<< *tripCount << " at " << forOp.getLoc() << "\n");

// loopUnrollByFactor returns failure if the loop carries iteration
// arguments that have uses outside the loop (live-out values). In that
// case we cannot unroll.
if (failed(loopUnrollByFactor(forOp, static_cast<uint64_t>(*tripCount))))
return failure();

return success();
}

int64_t maxTripCount;
};

} // namespace

// ---------------------------------------------------------------------------
// Pass definition
// ---------------------------------------------------------------------------

namespace {

struct PTOUnrollSIMTFor : public pto::impl::PTOUnrollSIMTForBase<PTOUnrollSIMTFor> {
using pto::impl::PTOUnrollSIMTForBase<
PTOUnrollSIMTFor>::PTOUnrollSIMTForBase;

void runOnOperation() override {
func::FuncOp func = getOperation();
if (!isSIMTEntry(func))
return;

MLIRContext *ctx = &getContext();
RewritePatternSet patterns(ctx);
patterns.add<UnrollSIMTForPattern>(ctx, maxTripCount);

GreedyRewriteConfig config;
config.maxIterations = 10; // loops may nest
config.strictMode = GreedyRewriteStrictness::ExistingOps;

if (failed(applyPatternsAndFoldGreedily(func, std::move(patterns), config)))
signalPassFailure();
}
};

} // namespace

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Dangerous IR Mutation in Greedy Pattern Rewrite Driver

Using applyPatternsAndFoldGreedily with an OpRewritePattern that calls loopUnrollByFactor is highly dangerous and can lead to compiler crashes or undefined behavior in MLIR.

Why this is an issue:

  1. Greedy Driver Contracts: The greedy pattern rewrite driver relies on the PatternRewriter to track all IR modifications (such as operation creation, erasure, and replacement) to keep its internal worklist and dead-node tracking structures consistent.
  2. Direct Mutation: loopUnrollByFactor directly mutates the IR (erasing the scf.for loop and its blocks) without using or notifying a PatternRewriter.
  3. State Corruption: When an OpRewritePattern performs direct IR mutations outside the rewriter, the greedy driver's internal state becomes corrupted, leading to use-after-free bugs or assertion failures (especially in debug builds).

Solution:

Instead of a pattern-based approach with applyPatternsAndFoldGreedily, implement a simple post-order walk-based unroller directly in runOnOperation. Since walk defaults to post-order (innermost first), we can safely collect and unroll nested loops without invalidating outer loop pointers or corrupting any driver state.

namespace {

struct PTOUnrollSIMTFor : public pto::impl::PTOUnrollSIMTForBase<PTOUnrollSIMTFor> {
  using pto::impl::PTOUnrollSIMTForBase<
      PTOUnrollSIMTFor>::PTOUnrollSIMTForBase;

  void runOnOperation() override {
    func::FuncOp func = getOperation();
    if (!isSIMTEntry(func))
      return;

    // Collect all loops in post-order (innermost first) to safely unroll nested loops.
    SmallVector<scf::ForOp> loops;
    func.walk([&](scf::ForOp forOp) {
      loops.push_back(forOp);
    });

    for (scf::ForOp forOp : loops) {
      // Check explicit annotation first.
      if (hasUnrollFullAttr(forOp)) {
        (void)unrollFull(forOp, /*isAnnotated=*/true);
        continue;
      }

      // Auto-detect: constant bounds and small trip count.
      auto tripCount = computeConstantTripCount(forOp);
      if (tripCount && *tripCount <= maxTripCount) {
        (void)unrollFull(forOp, /*isAnnotated=*/false);
      }
    }
  }

private:
  LogicalResult unrollFull(scf::ForOp forOp, bool isAnnotated) const {
    auto tripCount = computeConstantTripCount(forOp);
    if (!tripCount || *tripCount <= 0)
      return failure();

    LLVM_DEBUG(llvm::dbgs()
               << "PTOUnrollSIMTFor: unrolling scf.for "
               << (isAnnotated ? "(annotated)" : "(auto)") << " tripCount="
               << *tripCount << " at " << forOp.getLoc() << "\\n");

    if (failed(loopUnrollByFactor(forOp, static_cast<uint64_t>(*tripCount))))
      return failure();

    return success();
  }
};

} // namespace

Issue: mouliangyu#485

When a pto.simt_entry function contains scf.for + scf.if with constant
conditions (e.g. for i in 0..16: if i < 10: store), the AICore backend
may emit a predicated END instruction, violating SIMTVF constraints.

This pass unrolls small constant-trip-count scf.for loops inside
pto.simt_entry functions, then relies on SCCP + canonicalize to
constant-fold the induction-variable-dependent scf.if branches,
producing straight-line code without divergent control flow.

Two strategies:
1. Explicit: scf.for {pto.unroll = "full"} — always unrolls
2. Automatic: constant trip count <= max-trip-count (default 64)

Pass runs in prepareVPTOForEmission, before SCF->CF lowering.

Co-Authored-By: Claude <noreply@anthropic.com>
@kuri780 kuri780 force-pushed the reproduce/issue-485 branch from 2e0bbe2 to 89a33e7 Compare June 18, 2026 06:45

@Zhendong404 Zhendong404 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果已知这个循环次数是常量,用for static_range应该就可以在DSL层展平控制流。
而且我感觉这个PR不能真正解决问题,只是这个场景被规避了,如果真的有动态for循环+if怎么办

@KurrinQu

Copy link
Copy Markdown
Contributor

如果已知这个循环次数是常量,用for static_range应该就可以在DSL层展平控制流。 而且我感觉这个PR不能真正解决问题,只是这个场景被规避了,如果真的有动态for循环+if怎么办

  1. 对于issue上的这个场景,unroll的方案只是规避这个问题,真正问题是否为PTO生成的IR问题或者是CCEC问题,需要进一步对比分析,这个是issue上第3点问题。
  2. 现在SIMT是通过PTOAS对接,而非DSL对接,因此无法使用static_range展开规避。
  3. 抛开上边的issue,我认为在PTOAS中增加循环unroll是必要的,原因是我理解只通过DSL的static_range无法完全解决循环unroll的问题。例如:
if (runtime condition):
   // do something
else:
  for i in static_range(n):
    // do other things

如果是runtime才能确定if分支执行路径,我理解else中的循环无法展开,就需要PTOAS在编译时展开。


.simt
// .weak simt_vf_0_simt_entry
.p2align 3

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ll文件和汇编文件建议不要上传了

@kuri780 kuri780 force-pushed the reproduce/issue-485 branch from 04d8a0c to 89a33e7 Compare June 18, 2026 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants