[TIRX] Bind parallel loops to GPU threads before VerifyMemory by zhils · Pull Request #19363 · apache/tvm

zhils · 2026-04-06T08:04:31Z

VerifyMemory on GPU targets treats direct accesses outside thread environments as illegal. In the ScatterValue CUDA lowering path, topi.scatter_elements emits ForKind::kParallel loops without explicit thread bindings, which triggers false host-memory access failures (e.g. "Did you forget to bind?") during TIR verification.

This change adds a new tirx pass (BindParallelLoopsToThreads) and inserts it before VerifyMemory in the s_tir pipelines (including adreno). The pass rewrites parallel loops into blockIdx.x/threadIdx.x thread-extent regions, substitutes loop vars with global thread indices, and adds bounds checks for non-divisible extents. This preserves correctness while ensuring GPU kernels pass memory verification for this path.

`VerifyMemory` on GPU targets treats direct accesses outside thread environments as illegal. In the ScatterValue CUDA lowering path, `topi.scatter_elements` emits `ForKind::kParallel` loops without explicit thread bindings, which triggers false host-memory access failures (e.g. "Did you forget to bind?") during TIR verification. This change adds a new `tirx` pass (`BindParallelLoopsToThreads`) and inserts it before `VerifyMemory` in the `s_tir` pipelines (including adreno). The pass rewrites parallel loops into `blockIdx.x/threadIdx.x` thread-extent regions, substitutes loop vars with global thread indices, and adds bounds checks for non-divisible extents. This preserves correctness while ensuring GPU kernels pass memory verification for this path.

gemini-code-assist

Code Review

This pull request introduces the BindParallelLoopsToThreads pass, which converts ForKind::kParallel loops into GPU block and thread bindings, and integrates this pass into the S-TIR pipelines. Additionally, it provides a configuration option to allow unsupported host compilers for NVCC on Windows and adds a functional test for scatter operations on CUDA. Review feedback identifies a critical issue regarding the handling of nested parallel loops which could lead to invalid GPU register bindings, an inconsistency in GPU device type definitions between files, and a minor code redundancy in the loop variable substitution logic.

src/tirx/transform/bind_parallel_loops_to_threads.cc

Fix three correctness/configuration issues in the GPU parallel-loop binding path used before VerifyMemory. First, preserve non-zero loop mins by mapping parallel indices as min + global_idx instead of global_idx. Second, avoid rewriting parallel loops when already inside a thread environment to prevent invalid nested bindings. Third, register cuda.nvcc_allow_unsupported_compiler as a valid PassContext key so the NVCC workaround can be enabled via config without raising Invalid config option. Made-with: Cursor

- Add kDLWebGPU to IsGPUDevice in verify_memory.cc - Remove redundant Var wrapper in loop_partition.cc - Fix nested parallel loop handling in bind_parallel_loops_to_threads.cc

tlopex

A few things to address here:

Please remove .tmp_scatter_cuda_check.py. This looks like a local debugging script and should not be committed. It should be replaced with a proper pytest test instead.
This PR currently mixes several unrelated changes: the NVCC -allow-unsupported-compiler workaround, the loop_partition.cc Var{} fix, and the kDLWebGPU change in verify_memory.cc are all independent and should be split into separate PRs.
Please clarify the semantics for nested parallel loops. When in_parallel_loop_ is already true, inner parallel loops are left as kParallel, but once they are inside a GPU kernel they effectively become serial. If that is intentional, it would be good to document it explicitly; otherwise it may be better to reject this case.
The else branch in IfThenElse looks unnecessary. Evaluate(IntImm(DataType::Int(32), 0)) is a no-op, and TIR already supports IfThenElse without an else, so this can just be IfThenElse(global_idx < extent, mapped_body).
For target-less PrimFuncs, please avoid falling back to Target::Current(). If the function has no target attribute, it is better to leave it unchanged rather than guessing from ambient context. That would also be consistent with how other recent passes handle target-less functions.
This PR also needs proper pytest coverage for the new pass.

gemini-code-assist bot reviewed Apr 6, 2026

View reviewed changes

src/tirx/transform/bind_parallel_loops_to_threads.cc Outdated Show resolved Hide resolved

src/tirx/transform/bind_parallel_loops_to_threads.cc Show resolved Hide resolved

src/tirx/transform/bind_parallel_loops_to_threads.cc Outdated Show resolved Hide resolved

zhils added 3 commits April 6, 2026 16:32

Fix code review issues

a65d799

- Add kDLWebGPU to IsGPUDevice in verify_memory.cc - Remove redundant Var wrapper in loop_partition.cc - Fix nested parallel loop handling in bind_parallel_loops_to_threads.cc

Trigger CI

8f7d42a

tlopex requested changes Apr 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TIRX] Bind parallel loops to GPU threads before VerifyMemory#19363

[TIRX] Bind parallel loops to GPU threads before VerifyMemory#19363
zhils wants to merge 4 commits intoapache:mainfrom
zhils:my-fix-branch

zhils commented Apr 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tlopex left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhils commented Apr 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tlopex left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants