feat(ptodsl): add flash-attention demos and mixed-backend subkernel pipeline support#816
feat(ptodsl): add flash-attention demos and mixed-backend subkernel pipeline support#816Zhendong404 wants to merge 13 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces backend-partitioned container support, a new pass to normalize uncovered tile sections, and helper inlining for PTODSL subkernel calls. It also updates the EmitC lowering pipeline to handle pointer-like tile buffer addresses and adds several new examples and tests. The reviewer feedback focuses on improving robustness and correctness: refactoring the uncovered segment collection to prevent over-aggressive splitting, adding defensive checks for memory space attributes in EmitC lowering to avoid potential crashes, ensuring deterministic symbol resolution during peer lookups, and using a more robust check for static dimensions.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| static void collectUncoveredTopLevelSegments( | ||
| func::FuncOp funcOp, SmallVectorImpl<UncoveredTopLevelSegment> &segments) { | ||
| if (!funcOp || funcOp.isDeclaration() || !funcOp.getBody().hasOneBlock()) | ||
| return; | ||
|
|
||
| Block &entryBlock = funcOp.getBody().front(); | ||
| UncoveredTopLevelSegment current; | ||
|
|
||
| auto flushCurrent = [&]() { | ||
| if (!current.firstOp) | ||
| return; | ||
| segments.push_back(current); | ||
| current = {}; | ||
| }; | ||
|
|
||
| for (Operation &op : entryBlock.getOperations()) { | ||
| if (isa<func::ReturnOp>(op)) { | ||
| flushCurrent(); | ||
| continue; | ||
| } | ||
|
|
||
| if (isExplicitSection(&op)) { | ||
| flushCurrent(); | ||
| continue; | ||
| } | ||
|
|
||
| UncoveredTopLevelSegment opSummary = summarizeTopLevelOperation(&op); | ||
| if (!opSummary.containsTileOp) { | ||
| flushCurrent(); | ||
| continue; | ||
| } | ||
|
|
||
| if (!current.firstOp) { | ||
| current = std::move(opSummary); | ||
| continue; | ||
| } | ||
|
|
||
| std::optional<InferredSectionKind> currentKind = inferSegmentKind(current); | ||
| std::optional<InferredSectionKind> opKind = inferSegmentKind(opSummary); | ||
| bool mustSplit = current.containsNestedExplicitSection || | ||
| opSummary.containsNestedExplicitSection || !currentKind || | ||
| !opKind || *currentKind != *opKind; | ||
| if (mustSplit) { | ||
| flushCurrent(); | ||
| current = std::move(opSummary); | ||
| continue; | ||
| } | ||
|
|
||
| mergeSegmentSummary(current, opSummary); | ||
| } | ||
|
|
||
| flushCurrent(); | ||
| } |
There was a problem hiding this comment.
The current implementation of collectUncoveredTopLevelSegments flushes and splits the active segment immediately upon encountering any top-level operation that does not contain a tile op (e.g., a scalar calculation or a subkernel call). This results in unnecessary splitting of adjacent tile operations of the same kind, creating multiple small sections and potentially duplicating neutral top-level operations (like subkernel calls) into both Cube and Vector child modules during splitting.\n\nInstead, we should use a cohesive partitioning algorithm that accumulates neutral operations into the current segment, and only flushes/splits when there is an actual conflict between Vector and Cube kinds, or when hitting an explicit section/return.
static void collectUncoveredTopLevelSegments(
func::FuncOp funcOp, SmallVectorImpl<UncoveredTopLevelSegment> &segments) {
if (!funcOp || funcOp.isDeclaration() || !funcOp.getBody().hasOneBlock())
return;
Block &entryBlock = funcOp.getBody().front();
UncoveredTopLevelSegment current;
auto flushCurrent = [&]() {
if (!current.firstOp)
return;
if (current.containsTileOp)
segments.push_back(current);
current = {};
};
for (Operation &op : entryBlock.getOperations()) {
if (isa<func::ReturnOp>(op) || isExplicitSection(&op)) {
flushCurrent();
continue;
}
UncoveredTopLevelSegment opSummary = summarizeTopLevelOperation(&op);
if (opSummary.containsNestedExplicitSection) {
flushCurrent();
segments.push_back(opSummary);
continue;
}
if (!current.firstOp) {
current = std::move(opSummary);
continue;
}
std::optional<InferredSectionKind> currentKind = inferSegmentKind(current);
std::optional<InferredSectionKind> opKind = inferSegmentKind(opSummary);
if (currentKind && opKind && *currentKind != *opKind) {
flushCurrent();
current = std::move(opSummary);
} else {
mergeSegmentSummary(current, opSummary);
}
}
flushCurrent();
}| std::string qualifier = | ||
| addrSpaceQualifier(type.getMemorySpace().getAddressSpace()); |
There was a problem hiding this comment.
Calling type.getMemorySpace().getAddressSpace() directly without checking if the memory space attribute is present and is of type AddressSpaceAttr can lead to a crash or undefined behavior. It is safer to use a defensive check similar to other parts of this file.
| std::string qualifier = | |
| addrSpaceQualifier(type.getMemorySpace().getAddressSpace()); | |
| std::string qualifier = "__gm__"; | |
| if (auto ms = type.getMemorySpace()) { | |
| if (auto ptoAttr = dyn_cast<pto::AddressSpaceAttr>(ms)) | |
| qualifier = addrSpaceQualifier(ptoAttr.getAddressSpace()); | |
| } |
| if (auto ptrTy = dyn_cast<pto::PtrType>(originalCalleeArgTy)) { | ||
| elemTy = ptrTy.getElementType(); | ||
| as = ptrTy.getMemorySpace().getAddressSpace(); |
There was a problem hiding this comment.
Defensively check ptrTy.getMemorySpace() using dyn_cast_or_null<pto::AddressSpaceAttr> before calling getAddressSpace() to prevent potential crashes if the memory space attribute is null or of an unexpected type.
if (auto ptrTy = dyn_cast<pto::PtrType>(originalCalleeArgTy)) {
elemTy = ptrTy.getElementType();
if (auto asAttr = dyn_cast_or_null<pto::AddressSpaceAttr>(ptrTy.getMemorySpace()))
as = asAttr.getAddressSpace();
}| SmallVector<func::FuncOp> fallbackMatches; | ||
| outerModule.walk([&](func::FuncOp funcOp) { | ||
| auto visibility = funcOp->getAttrOfType<StringAttr>("sym_visibility"); | ||
| if (visibility && visibility.getValue() == "private") | ||
| return WalkResult::advance(); | ||
|
|
||
| StringRef symbolName = funcOp.getSymName(); | ||
| if (symbolName == target) { | ||
| fallbackMatches.clear(); | ||
| fallbackMatches.push_back(funcOp); | ||
| return WalkResult::interrupt(); | ||
| } | ||
| if (funcOp->hasAttr(kPTODSLLogicalNameAttrName) && | ||
| getPTODSLLogicalNameOrSymbolName(funcOp) == target) | ||
| fallbackMatches.push_back(funcOp); | ||
| return WalkResult::advance(); | ||
| }); |
There was a problem hiding this comment.
Interrupting the walk immediately upon finding an exact symbol name match can lead to non-deterministic or incorrect symbol resolution if multiple sibling modules contain public functions with the same name. It is safer to collect all matches (both exact and logical) without interrupting, and then verify that exactly one unique match exists.
SmallVector<func::FuncOp> fallbackMatches;
outerModule.walk([&](func::FuncOp funcOp) {
auto visibility = funcOp->getAttrOfType<StringAttr>("sym_visibility");
if (visibility && visibility.getValue() == "private")
return WalkResult::advance();
StringRef symbolName = funcOp.getSymName();
if (symbolName == target ||
(funcOp->hasAttr(kPTODSLLogicalNameAttrName) &&
getPTODSLLogicalNameOrSymbolName(funcOp) == target)) {
fallbackMatches.push_back(funcOp);
}
return WalkResult::advance();
});| bool isStatic = llvm::all_of(shape, [](int64_t dim) { | ||
| return dim != ShapedType::kDynamic; | ||
| }); |
There was a problem hiding this comment.
Using dim != ShapedType::kDynamic to check for static dimensions is less robust than checking dim >= 0, as any negative dimension size in MLIR represents a dynamic or invalid dimension. Checking dim >= 0 prevents potential issues with other negative dimension representations.
| bool isStatic = llvm::all_of(shape, [](int64_t dim) { | |
| return dim != ShapedType::kDynamic; | |
| }); | |
| bool isStatic = llvm::all_of(shape, [](int64_t dim) { | |
| return dim >= 0; | |
| }); |
Codex Review该评论由 review 机器人自动更新。
SummaryReview failed at stage Findings未生成结构化 findings,因为 review 过程提前失败。 Log Tail |
4ab1ce5 to
912e855
Compare
|
/run a3 |
|
/run a5 |
|
已接收
页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。 |
|
已接收
页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。 |
A3 板测失败
日志尾部 |
A5 板测失败
失败用例
|
A5 板测失败详情:PR #816test_tmov_row_major_1x16_control_a5
test_tmov_col_major_16x1_align_a5
test_dynamic_valid_shape
test_barrier_sync
test_auto_sync_tail_hint
rmsnorm_incore_0
rar_optimization_test
nested_loop_confliect
matmul
decode_projection_incore_0
compensation_test
add_double_dynamic
rems
rem
rope_kv_cache
rmsnorm
qwen3_decode_incore_7
qwen3_decode_incore_6
qwen3_decode_incore_5
qwen3_decode_incore_4
qwen3_decode_incore_2
qwen3_decode_incore_1
qwen3_decode_incore_12
qwen3_decode_incore_11
qwen3_decode_incore_10
post_rmsnorm
vector_example_dag_kernel_mul
vector_example_dag_kernel_add_scalar
vector_example_dag_kernel_add
paged_attention_example_kernel_softmax_prepare
paged_attention_example_kernel_qk_matmul
paged_attention_example_kernel_pv_matmul
paged_attention_example_kernel_online_update
paged_attention_example_kernel_init_inplace
orchestration_example_kernel_mul
orchestration_example_kernel_add_scalar
orchestration_example_kernel_add
prelu
plan_memory_reuse_sequential
plan_memory_peak_exact_capacity
plan_memory_peak_8_overlapping
plan_memory_no_reuse_overlap
plan_memory_nested_loops
plan_memory_loop_no_reuse_outer_live
plan_memory_loop_in_if
plan_memory_if_yield
plan_memory_if_in_loop
plan_memory_fragmentation_two_holes
plan_memory_fragmentation_hole_fit
plan_memory_for_iter_args_yield
plan_memory_bind_tile_alias_liveness
partition_view_verify_valid
partition_view_verify_rank_mismatch_valid
partition5d_dynamic_a5
partition5d_a5
tensor_view_layout_dn
sparse_attn_test_incore_7
decode_swa_test_incore_40
decode_hca_test_incore_54
decode_csa_test_incore_81
attention_swa_test_incore_40
attention_hca_test_incore_54
attention_csa_test_refresh_incore_81
tbroadcast_root_binding
cmps
cmp
|
26c8e46 to
356a943
Compare
|
/run a3 |
|
已接收
页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。 |
|
/run a5 |
|
已接收
页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。 |
A5 板测失败
日志尾部 |
A3 板测失败
日志尾部 |
… output LLVM IR format
356a943 to
68069a0
Compare
Summary
This PR brings in a PTODSL/ptoas update centered on flash-attention demos, mixed-kernel or mixed-backend compilation, and subkernel/backend helper lowering fixes.
Compared with
hw-native-sys/PTOAS:main, this branch is ahead by 12 commits and includes both frontend PTODSL improvements and the supporting PTOAS lowering/runtime/test updates.Main Changes
ptodsl/examples/flash_attention/flash_attention_cv_split.pyptodsl/examples/flash_attention/flash_attention_vf_fusion.pyptodsl/examples/flash_attention/ptodsl/examples/fa_dn_ptodsl.pygm_tensorslot support for PTODSL pipe flowPTOAS__TILE_DATAreads afterTASSIGNemitc.verbatimtrailing semicolons in emitted C++ as a workaroundpto.section.*regionsvmulscvtemission in the CANN900 LLVM emitterptodsl/docs/user_guide/01-introduction.mdptodsl/docs/user_guide/03-kernel-entry-and-subkernels.mdtest/lit/lit.cfg.pyScope
Diff summary:
Validation
fa-demo-restackis already pushed to forkZhendong404/PTOAShw-native-sys/PTOAS:mainZhendong404:fa-demo-restack