Skip to content

perf(metax): vectorize elementwise kernel for contiguous aligned tensors#1353

Open
LindseyMei wants to merge 1 commit into
InfiniTensor:mainfrom
LindseyMei:feat/metax-elementwise-vec
Open

perf(metax): vectorize elementwise kernel for contiguous aligned tensors#1353
LindseyMei wants to merge 1 commit into
InfiniTensor:mainfrom
LindseyMei:feat/metax-elementwise-vec

Conversation

@LindseyMei

Copy link
Copy Markdown

MetaX Elementwise Vectorization Performance Report

Environment

  • GPU: MetaX C500 64GB
  • MACA: 3.3.0.15
  • Repository: InfiniCore
  • File changed: src/infiniop/elementwise/metax/elementwise_metax.h
  • Benchmark script: vec_bench.py (silu, synchronized timing)

What changed

Added a 16-byte contiguous vector fast path to the shared MetaX elementwise kernel. When output and all inputs are contiguous, aligned, non-broadcasted, and use the same floating-point dtype, the kernel loads/stores float4 / Pack<half,8> / Pack<cuda_bfloat16,8> / Pack<double,2> packs and applies the existing scalar Op{} per component. Strided, broadcasted, unaligned, integer, or mixed-dtype cases fall back to the original scalar kernel unchanged.

Correctness

Regression tests passed on Metax:

  • silu: PASS
  • add: PASS
  • mul: PASS
  • reciprocal: PASS
  • gelu: PASS
  • swiglu: PASS
  • clip: PASS

tanh.py failed with INFINI_STATUS_DEVICE_TYPE_NOT_SUPPORTED because the tanh/operator.cc Metax registration is commented out on this branch; unrelated to this change. hardtanh.py has a pre-existing GPU crash on the scalar kernel; also unrelated.

Performance (silu)

Scalar baseline (before vectorization)

shape dtype lib_ms Gelem/s GB/s
(4096, 4096) F16 0.399 42.05 168.2
(4096, 4096) F32 0.403 41.66 333.3
(4096, 4096) BF16 0.371 45.25 181.0
(8192, 8192) F16 1.156 58.05 232.2
(8192, 8192) F32 1.222 54.93 439.4
(8192, 8192) BF16 1.150 58.37 233.5
(16384, 16384) F16 4.208 63.79 255.1
(16384, 16384) F32 4.555 58.94 471.5
(16384, 16384) BF16 4.210 63.77 255.1

Vectorized (after)

shape dtype lib_ms Gelem/s GB/s speedup
(4096, 4096) F16 0.146 114.66 458.6 2.73x
(4096, 4096) F32 0.151 110.76 886.1 2.66x
(4096, 4096) BF16 0.143 117.18 468.7 2.59x
(8192, 8192) F16 0.352 190.87 763.5 3.29x
(8192, 8192) F32 0.458 146.60 1172.8 2.67x
(8192, 8192) BF16 0.405 165.53 662.1 2.84x
(16384, 16384) F16 1.100 243.98 975.9 3.82x
(16384, 16384) F32 1.516 177.01 1416.1 3.00x
(16384, 16384) BF16 1.246 215.44 861.8 3.38x

Observations

  • Large contiguous tensors show 3-4x speedup over the scalar kernel.
  • F32 effective bandwidth reaches ~1.4 TB/s on the C500, indicating the vector path successfully saturates a large fraction of device memory bandwidth.
  • Smaller tensors (4Kx4K) show ~2.6-2.7x; fixed kernel-launch overhead reduces the relative gain.
  • All fallback cases (strided, broadcast, integer dtypes, mixed dtype) continue to use the original scalar path and were verified to pass correctness tests.

Known limitations

  • Only same-dtype floating-point elementwise ops are vectorized. Integer/bool/fp8 and mixed-dtype ops use the scalar fallback.
  • Input and output pointers must be 16-byte aligned. PyTorch CUDA allocations typically satisfy this for tensors created with the default allocator.
  • The MACA compiler emits lld warnings about #pragma unroll being unable to unroll some loops; these are non-fatal and do not affect correctness or the measured speedup.

Next steps

  • Apply the same pattern to the NVIDIA/CUDA elementwise backend for parity.
  • Extend to integer types (int32_t, int64_t) with 16-byte packs once validated.
  • Investigate whether compute-bound operators (e.g. gelu, swiglu) can additionally benefit from vector ALU intrinsics (half2).

Add a 16-byte vector fast path to the shared MetaX elementwise template.
When output and all inputs are contiguous, aligned, non-broadcasted, and
share the same floating-point dtype, load/store packed values and apply
the existing scalar Op functor per component. Falls back to the original
scalar kernel for all other cases.

Supported dtypes: float (float4), half (Pack<half,8>),
cuda_bfloat16 (Pack<cuda_bfloat16,8>), double (double2).
Integer/bool/fp8 and mixed-dtype ops continue to use the scalar path.

Benchmark (silu, MetaX C500):
- F32 16384x16384: ~59 Gelem/s -> ~177 Gelem/s (~3x)
- F16 16384x16384: ~64 Gelem/s -> ~244 Gelem/s (~3.8x)
- BF16 16384x16384: ~64 Gelem/s -> ~215 Gelem/s (~3.4x)

Regression tests passed: silu, add, mul, reciprocal, gelu, swiglu, clip.

Signed-off-by: LindseyMei <648816901@qq.com>
@LindseyMei LindseyMei requested a review from a team June 30, 2026 10:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant