Add shape inference.#841
Conversation
There was a problem hiding this comment.
Code Review
This pull request enhances the tile fusion pipeline to support operations with computed dynamic shapes by introducing a ShapeConstraintSolver and structural-equivalence-based shape canonicalization. It also adds support for row and column reduction operations, handles epilogue operations in low-level loop fusion, and updates scheduling dependencies for tile allocations. Feedback on these changes highlights two main issues: first, arith.divi is not a valid MLIR Arith operation and should be replaced with arith.divsi and arith.divui; second, there is a safety gap in the alias analysis where prior stages' epilogue operations are reordered across later stages' operations without being checked for dependencies.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if (opName.starts_with("arith.")) { | ||
| return opName == "arith.minsi" || opName == "arith.maxsi" || | ||
| opName == "arith.minui" || opName == "arith.maxui" || | ||
| opName == "arith.cmpi" || opName == "arith.select" || | ||
| opName == "arith.addi" || opName == "arith.subi" || | ||
| opName == "arith.muli" || opName == "arith.divi" || | ||
| opName == "arith.index_cast"; | ||
| } |
There was a problem hiding this comment.
In the MLIR Arith dialect, there is no arith.divi operation. The integer division operations are arith.divsi (signed division) and arith.divui (unsigned division). Please update the check to include arith.divsi and arith.divui instead of arith.divi to ensure integer division operations can participate in shape canonicalization.
| if (opName.starts_with("arith.")) { | |
| return opName == "arith.minsi" || opName == "arith.maxsi" || | |
| opName == "arith.minui" || opName == "arith.maxui" || | |
| opName == "arith.cmpi" || opName == "arith.select" || | |
| opName == "arith.addi" || opName == "arith.subi" || | |
| opName == "arith.muli" || opName == "arith.divi" || | |
| opName == "arith.index_cast"; | |
| } | |
| if (opName.starts_with("arith.")) { | |
| return opName == "arith.minsi" || opName == "arith.maxsi" || | |
| opName == "arith.minui" || opName == "arith.maxui" || | |
| opName == "arith.cmpi" || opName == "arith.select" || | |
| opName == "arith.addi" || opName == "arith.subi" || | |
| opName == "arith.muli" || opName == "arith.divsi" || | |
| opName == "arith.divui" || opName == "arith.index_cast"; | |
| } |
| // Epilogue ops are cloned after prior-stage leaf/epilogue ops in | ||
| // the fused loop, so they must not alias prior-stage memory ops. | ||
| for (Operation *op : level.epilogueOps) { | ||
| if (!canMovePreludeAcrossPriorStages(op, priorStages, debugOS)) | ||
| return false; | ||
| } |
There was a problem hiding this comment.
There is an asymmetry and potential safety gap in the alias analysis for epilogue operations:
- Redundancy: Checking
canMovePreludeAcrossPriorStages(op, priorStages)for the current stage's epilogue operations (level.epilogueOps) is redundant because they are cloned after all prior stages' leaf and epilogue operations in the fused loop. They do not move across them. - Omission: Conversely, prior stages' epilogue operations (
priorStage.epilogueOps) are reordered to execute after the current stage's leaf operations (level.leafOps/level.preludeOps). This means prior stages' epilogue operations are moved across later stages' operations, but we never check if they alias or have dependencies. While the fusion plan's domain-split checks mitigate this for direct dataflow, we should still perform a proper safety check to ensure prior stages' epilogue operations do not alias later stages' leaf/prelude operations.
Codex Review该评论由 review 机器人自动更新。
Summary检查到 2 个问题:新的 reduce-epilogue loop fusion 有副作用重排风险,且 Findings
这次改动允许 stage 在 child loop 后带
|
Summary
Two related improvements to the tile fusion pipeline that enable elementwise + reduce op chains (e.g.
tadd → tmul → trowsum) to be fused into a singlefusion_regionwith computed dynamic shapes, instead of generating separate elementwise and reduce loops.1. Shape inference via structural-equivalence canonicalization
Replaces the previous SSA-value-only
canonicalizeValuewith aStructuralSignatureMapthat deduplicates computable shape expressions (e.g.mins(vrow, c32)computed independently by different stages) into the same solver dimension. Two arith ops with the same name, attributes, and structurally equivalent operands are now treated as one representative, enabling stages that use separately materialized but structurally identical dynamic shapes to share loop bounds in the fused loop nest.Key changes:
FusionAnalysis.cpp: IntroduceStructuralSignatureMapwith structural-equivalence key (op name, canonical operand list, attribute dictionary). The solver merges structurally equivalent shape expressions into a single dimension, while preserving differences (e.g.mins(vrow0, c32)vsmins(vrow1, c32)wherevrow0andvrow1are different block arguments remain separate).PTOFusionPlan.cpp: Mark reduce ops (trowsum, trowmax, trowmin, etc.) asElementwisefor fusion grouping — they share the row iteration domain with their producers, and the solver correctly binds their output domain (e.g. 32×1 for row-reduce).PTOOpScheduling.cpp: Handle reduce output domains in scheduling.2. Epilogue ops support in PTOLowLevelLoopFusion
Previously,
analyzeStagerejected any ops appearing after the child loop within a loop level. Reduce ops like trowmax/trowsum produce a row-result store (avstswriting the 32×1 output tile) after the inner col loop — these were classified as "unsupported" and prevented fusion.This change introduces
epilogueOpstoLoopLevelInfo, collecting ops that appear after the child loop. These are:isSupportedPreludeOp(no regions, pure or alias-analyzable memory ops)canMovePreludeAcrossPriorStagesbuildFusedLoopNestAtLevel, ensuring they appear after the fused inner-loop computation for their stageResult:
tadd → trowsumchains now produce a singlefusion_regioncontaining a row loop withiter_args(reduce accumulator), an inner col loop with both elementwise and reduce computations, and an epiloguevstsfor the reduce row-result store.