Reproduce/issue 485 by kuri780 · Pull Request #838 · hw-native-sys/PTOAS

kuri780 · 2026-06-18T06:39:50Z

新增 PTOUnrollSIMTFor pass，规避：SIMTVF kernel 中 scf.for + scf.if 常量分支导致 AICore 后端生成带谓词的 END @!P0。

根因

SCF→CF→LLVM lowering 后 ret 自然在 cond_br 可达的 block 中，IR 本身正确。Bisheng AICore 后端未正确处理 SimtEntry calling convention，给 END 加上了谓词。

方案

将 pto.simt_entry 函数内常量 scf.for 完全展开，由下游 SCCP + canonicalize 自动消除 scf.if 分支，生成无分支直线代码。

显式标注 {pto.unroll = "full"}：无条件展开
自动检测：常量 bounds 且 trip count ≤ 64（可通过 --max-trip-count 调整）

涉及文件

lib/PTO/Transforms/PTOUnrollSIMTForPass.cpp — pass 实现（新增）
include/PTO/Transforms/Passes.td — 定义
include/PTO/Transforms/Passes.h — 声明
lib/PTO/Transforms/CMakeLists.txt — 注册
tools/ptoas/ptoas.cpp — 接入 prepareVPTOForEmission

reedhecre · 2026-06-18T06:40:27Z

Codex Review

该评论由 review 机器人自动更新。

PR: Reproduce/issue 485 #838 Reproduce/issue 485
Author: kuri780
Base/Head: main / reproduce/issue-485
Head SHA: 89a33e767026
Trigger: PR 有新提交
Generated At: 2026-06-18T10:06:07Z
Previous Head SHA: 04d8a0c68f9a
Status: completed

Summary

PR #838 存在一个确定的构建阻断问题；另外，issue #485 的 workaround 只覆盖了 ptoas 驱动路径，没有覆盖公开的 VPTO lowering API，并且缺少对应回归测试。

Findings

P1 新 pass 引入了 SCF utility 依赖，但 CMake 没有补齐链接库 lib/PTO/Transforms/CMakeLists.txt:104

PTOUnrollSIMTForPass.cpp 新增了对 loopUnrollByFactor（mlir/Dialect/SCF/Utils/Utils.h）的调用，但 PTOTransforms 的 LINK_LIBS 仍然没有补上承载该实现的 SCF utility 库。上游 MLIR 中这个符号不在 MLIRSCFDialect / MLIRSCFToControlFlow / MLIRTransforms 里，因此这次改动会把未解析符号带到 PTOTransforms/ptoas 的链接阶段，导致 CI/本地构建直接失败。

P2 修复只接到了驱动侧 pipeline，公开的 VPTO lowering API 仍会绕过它 tools/ptoas/ptoas.cpp:1465

workaround 只在 tools/ptoas/ptoas.cpp 的 prepareVPTOForEmission() 里接入。也就是说，走 ptoas CLI 的路径会执行这个 pass，但对外公开的 lowerVPTOModuleToLLVMModules*() lowering API 本身没有任何对应更新。仓库已经导出/安装了 PTOTransforms，因此直接嵌入该库的调用方会绕过这次修复，仍然命中 issue #485，造成“同一份 VPTO IR 经 CLI 可规避、经库 API 仍失败”的行为分叉。

P2 缺少 issue [Bug] decode_attention_incore_7 kernel hangs on A5 but passes on A3 (ptoas 0.24) #485 的回归测试，CI 无法守住这次 workaround

这个 PR 改动了所有 pto.simt_entry 函数的 VPTO emission pipeline，但没有新增任何针对 issue #485 的回归用例，也没有测试覆盖新的 pto-unroll-simt-for pass。这样 CI 无法证明 scf.for + 依赖 IV 的 scf.if 这个关键模式已经被消除，也无法在后续 pass 顺序变更时第一时间发现 workaround 失效。对于一个面向后端误编译的修复，这个测试缺口是明显的。

gemini-code-assist

Code Review

This pull request introduces the PTOUnrollSIMTFor pass to unroll small constant-trip-count scf.for loops inside pto.simt_entry functions, addressing a bug (Issue #485) where structured control flow can lead to predicated END instructions in the AICore backend. Feedback on the implementation highlights a critical issue in PTOUnrollSIMTForPass.cpp where using applyPatternsAndFoldGreedily alongside loopUnrollByFactor can corrupt the greedy driver's internal state due to direct IR mutations. It is recommended to replace the pattern-based rewrite with a post-order walk-based unroller directly in runOnOperation.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-18T06:41:36Z

+namespace {
+
+struct UnrollSIMTForPattern : public OpRewritePattern<scf::ForOp> {
+  UnrollSIMTForPattern(MLIRContext *ctx, int64_t maxTripCount)
+      : OpRewritePattern(ctx), maxTripCount(maxTripCount) {}
+
+  LogicalResult matchAndRewrite(scf::ForOp forOp,
+                                PatternRewriter &rewriter) const override {
+    // Only apply inside SIMT entry functions.
+    auto func = forOp->getParentOfType<func::FuncOp>();
+    if (!func || !isSIMTEntry(func))
+      return failure();
+
+    // Check explicit annotation first.
+    if (hasUnrollFullAttr(forOp)) {
+      return unrollFull(forOp, rewriter, /*isAnnotated=*/true);
+    }
+
+    // Auto-detect: constant bounds and small trip count.
+    auto tripCount = computeConstantTripCount(forOp);
+    if (!tripCount || *tripCount > maxTripCount)
+      return failure();
+
+    return unrollFull(forOp, rewriter, /*isAnnotated=*/false);
+  }
+
+private:
+  LogicalResult unrollFull(scf::ForOp forOp, PatternRewriter &rewriter,
+                           bool isAnnotated) const {
+    auto tripCount = computeConstantTripCount(forOp);
+    if (!tripCount || *tripCount <= 0)
+      return failure();
+
+    LLVM_DEBUG(llvm::dbgs()
+               << "PTOUnrollSIMTFor: unrolling scf.for "
+               << (isAnnotated ? "(annotated)" : "(auto)") << " tripCount="
+               << *tripCount << " at " << forOp.getLoc() << "\n");
+
+    // loopUnrollByFactor returns failure if the loop carries iteration
+    // arguments that have uses outside the loop (live-out values).  In that
+    // case we cannot unroll.
+    if (failed(loopUnrollByFactor(forOp, static_cast<uint64_t>(*tripCount))))
+      return failure();
+
+    return success();
+  }
+
+  int64_t maxTripCount;
+};
+
+} // namespace
+
+// ---------------------------------------------------------------------------
+// Pass definition
+// ---------------------------------------------------------------------------
+
+namespace {
+
+struct PTOUnrollSIMTFor : public pto::impl::PTOUnrollSIMTForBase<PTOUnrollSIMTFor> {
+  using pto::impl::PTOUnrollSIMTForBase<
+      PTOUnrollSIMTFor>::PTOUnrollSIMTForBase;
+
+  void runOnOperation() override {
+    func::FuncOp func = getOperation();
+    if (!isSIMTEntry(func))
+      return;
+
+    MLIRContext *ctx = &getContext();
+    RewritePatternSet patterns(ctx);
+    patterns.add<UnrollSIMTForPattern>(ctx, maxTripCount);
+
+    GreedyRewriteConfig config;
+    config.maxIterations = 10; // loops may nest
+    config.strictMode = GreedyRewriteStrictness::ExistingOps;
+
+    if (failed(applyPatternsAndFoldGreedily(func, std::move(patterns), config)))
+      signalPassFailure();
+  }
+};
+
+} // namespace


Dangerous IR Mutation in Greedy Pattern Rewrite Driver

Using applyPatternsAndFoldGreedily with an OpRewritePattern that calls loopUnrollByFactor is highly dangerous and can lead to compiler crashes or undefined behavior in MLIR.

Why this is an issue:

Greedy Driver Contracts: The greedy pattern rewrite driver relies on the PatternRewriter to track all IR modifications (such as operation creation, erasure, and replacement) to keep its internal worklist and dead-node tracking structures consistent.

Direct Mutation: loopUnrollByFactor directly mutates the IR (erasing the scf.for loop and its blocks) without using or notifying a PatternRewriter.

State Corruption: When an OpRewritePattern performs direct IR mutations outside the rewriter, the greedy driver's internal state becomes corrupted, leading to use-after-free bugs or assertion failures (especially in debug builds).

Solution:

Instead of a pattern-based approach with applyPatternsAndFoldGreedily, implement a simple post-order walk-based unroller directly in runOnOperation. Since walk defaults to post-order (innermost first), we can safely collect and unroll nested loops without invalidating outer loop pointers or corrupting any driver state.

namespace { struct PTOUnrollSIMTFor : public pto::impl::PTOUnrollSIMTForBase<PTOUnrollSIMTFor> { using pto::impl::PTOUnrollSIMTForBase< PTOUnrollSIMTFor>::PTOUnrollSIMTForBase; void runOnOperation() override { func::FuncOp func = getOperation(); if (!isSIMTEntry(func)) return; // Collect all loops in post-order (innermost first) to safely unroll nested loops. SmallVector<scf::ForOp> loops; func.walk([&](scf::ForOp forOp) { loops.push_back(forOp); }); for (scf::ForOp forOp : loops) { // Check explicit annotation first. if (hasUnrollFullAttr(forOp)) { (void)unrollFull(forOp, /*isAnnotated=*/true); continue; } // Auto-detect: constant bounds and small trip count. auto tripCount = computeConstantTripCount(forOp); if (tripCount && *tripCount <= maxTripCount) { (void)unrollFull(forOp, /*isAnnotated=*/false); } } } private: LogicalResult unrollFull(scf::ForOp forOp, bool isAnnotated) const { auto tripCount = computeConstantTripCount(forOp); if (!tripCount || *tripCount <= 0) return failure(); LLVM_DEBUG(llvm::dbgs() << "PTOUnrollSIMTFor: unrolling scf.for " << (isAnnotated ? "(annotated)" : "(auto)") << " tripCount=" << *tripCount << " at " << forOp.getLoc() << "\\n"); if (failed(loopUnrollByFactor(forOp, static_cast<uint64_t>(*tripCount)))) return failure(); return success(); } }; } // namespace

Issue: mouliangyu#485 When a pto.simt_entry function contains scf.for + scf.if with constant conditions (e.g. for i in 0..16: if i < 10: store), the AICore backend may emit a predicated END instruction, violating SIMTVF constraints. This pass unrolls small constant-trip-count scf.for loops inside pto.simt_entry functions, then relies on SCCP + canonicalize to constant-fold the induction-variable-dependent scf.if branches, producing straight-line code without divergent control flow. Two strategies: 1. Explicit: scf.for {pto.unroll = "full"} — always unrolls 2. Automatic: constant trip count <= max-trip-count (default 64) Pass runs in prepareVPTOForEmission, before SCF->CF lowering. Co-Authored-By: Claude <noreply@anthropic.com>

Zhendong404

如果已知这个循环次数是常量，用for static_range应该就可以在DSL层展平控制流。
而且我感觉这个PR不能真正解决问题，只是这个场景被规避了，如果真的有动态for循环+if怎么办

KurrinQu · 2026-06-18T09:05:36Z

如果已知这个循环次数是常量，用for static_range应该就可以在DSL层展平控制流。而且我感觉这个PR不能真正解决问题，只是这个场景被规避了，如果真的有动态for循环+if怎么办

对于issue上的这个场景，unroll的方案只是规避这个问题，真正问题是否为PTO生成的IR问题或者是CCEC问题，需要进一步对比分析，这个是issue上第3点问题。
现在SIMT是通过PTOAS对接，而非DSL对接，因此无法使用static_range展开规避。
抛开上边的issue，我认为在PTOAS中增加循环unroll是必要的，原因是我理解只通过DSL的static_range无法完全解决循环unroll的问题。例如：

if (runtime condition):
   // do something
else:
  for i in static_range(n):
    // do other things

如果是runtime才能确定if分支执行路径，我理解else中的循环无法展开，就需要PTOAS在编译时展开。

KurrinQu · 2026-06-18T09:31:11Z

+
+	.simt
+	// .weak	simt_vf_0_simt_entry
+	.p2align	3


ll文件和汇编文件建议不要上传了

gemini-code-assist Bot reviewed Jun 18, 2026

View reviewed changes

kuri780 force-pushed the reproduce/issue-485 branch from 2e0bbe2 to 89a33e7 Compare June 18, 2026 06:45

Zhendong404 suggested changes Jun 18, 2026

View reviewed changes

Zhendong404 approved these changes Jun 18, 2026

View reviewed changes

KurrinQu reviewed Jun 18, 2026

View reviewed changes

kuri780 force-pushed the reproduce/issue-485 branch from 04d8a0c to 89a33e7 Compare June 18, 2026 09:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduce/issue 485#838

Reproduce/issue 485#838
kuri780 wants to merge 1 commit into
hw-native-sys:mainfrom
kuri780:reproduce/issue-485

kuri780 commented Jun 18, 2026

Uh oh!

reedhecre commented Jun 18, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 18, 2026

Uh oh!

Zhendong404 left a comment

Uh oh!

KurrinQu commented Jun 18, 2026

Uh oh!

KurrinQu Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kuri780 commented Jun 18, 2026

根因

方案

涉及文件

Uh oh!

reedhecre commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codex Review

Summary

Findings

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 18, 2026

Choose a reason for hiding this comment

Dangerous IR Mutation in Greedy Pattern Rewrite Driver

Why this is an issue:

Solution:

Uh oh!

Zhendong404 left a comment

Choose a reason for hiding this comment

Uh oh!

KurrinQu commented Jun 18, 2026

Uh oh!

KurrinQu Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

reedhecre commented Jun 18, 2026 •

edited

Loading