perf(metax): vectorize elementwise kernel for contiguous aligned tensors by LindseyMei · Pull Request #1353 · InfiniTensor/InfiniCore

LindseyMei · 2026-06-30T10:27:24Z

MetaX Elementwise Vectorization Performance Report

Environment

GPU: MetaX C500 64GB
MACA: 3.3.0.15
Repository: InfiniCore
File changed: src/infiniop/elementwise/metax/elementwise_metax.h
Benchmark script: vec_bench.py (silu, synchronized timing)

What changed

Added a 16-byte contiguous vector fast path to the shared MetaX elementwise kernel. When output and all inputs are contiguous, aligned, non-broadcasted, and use the same floating-point dtype, the kernel loads/stores float4 / Pack<half,8> / Pack<cuda_bfloat16,8> / Pack<double,2> packs and applies the existing scalar Op{} per component. Strided, broadcasted, unaligned, integer, or mixed-dtype cases fall back to the original scalar kernel unchanged.

Correctness

Regression tests passed on Metax:

silu: PASS
add: PASS
mul: PASS
reciprocal: PASS
gelu: PASS
swiglu: PASS
clip: PASS

tanh.py failed with INFINI_STATUS_DEVICE_TYPE_NOT_SUPPORTED because the tanh/operator.cc Metax registration is commented out on this branch; unrelated to this change. hardtanh.py has a pre-existing GPU crash on the scalar kernel; also unrelated.

Performance (silu)

Scalar baseline (before vectorization)

shape	dtype	lib_ms	Gelem/s	GB/s
(4096, 4096)	F16	0.399	42.05	168.2
(4096, 4096)	F32	0.403	41.66	333.3
(4096, 4096)	BF16	0.371	45.25	181.0
(8192, 8192)	F16	1.156	58.05	232.2
(8192, 8192)	F32	1.222	54.93	439.4
(8192, 8192)	BF16	1.150	58.37	233.5
(16384, 16384)	F16	4.208	63.79	255.1
(16384, 16384)	F32	4.555	58.94	471.5
(16384, 16384)	BF16	4.210	63.77	255.1

Vectorized (after)

shape	dtype	lib_ms	Gelem/s	GB/s	speedup
(4096, 4096)	F16	0.146	114.66	458.6	2.73x
(4096, 4096)	F32	0.151	110.76	886.1	2.66x
(4096, 4096)	BF16	0.143	117.18	468.7	2.59x
(8192, 8192)	F16	0.352	190.87	763.5	3.29x
(8192, 8192)	F32	0.458	146.60	1172.8	2.67x
(8192, 8192)	BF16	0.405	165.53	662.1	2.84x
(16384, 16384)	F16	1.100	243.98	975.9	3.82x
(16384, 16384)	F32	1.516	177.01	1416.1	3.00x
(16384, 16384)	BF16	1.246	215.44	861.8	3.38x

Observations

Large contiguous tensors show 3-4x speedup over the scalar kernel.
F32 effective bandwidth reaches ~1.4 TB/s on the C500, indicating the vector path successfully saturates a large fraction of device memory bandwidth.
Smaller tensors (4Kx4K) show ~2.6-2.7x; fixed kernel-launch overhead reduces the relative gain.
All fallback cases (strided, broadcast, integer dtypes, mixed dtype) continue to use the original scalar path and were verified to pass correctness tests.

Known limitations

Only same-dtype floating-point elementwise ops are vectorized. Integer/bool/fp8 and mixed-dtype ops use the scalar fallback.
Input and output pointers must be 16-byte aligned. PyTorch CUDA allocations typically satisfy this for tensors created with the default allocator.
The MACA compiler emits lld warnings about #pragma unroll being unable to unroll some loops; these are non-fatal and do not affect correctness or the measured speedup.

Next steps

Apply the same pattern to the NVIDIA/CUDA elementwise backend for parity.
Extend to integer types (int32_t, int64_t) with 16-byte packs once validated.
Investigate whether compute-bound operators (e.g. gelu, swiglu) can additionally benefit from vector ALU intrinsics (half2).

Add a 16-byte vector fast path to the shared MetaX elementwise template. When output and all inputs are contiguous, aligned, non-broadcasted, and share the same floating-point dtype, load/store packed values and apply the existing scalar Op functor per component. Falls back to the original scalar kernel for all other cases. Supported dtypes: float (float4), half (Pack<half,8>), cuda_bfloat16 (Pack<cuda_bfloat16,8>), double (double2). Integer/bool/fp8 and mixed-dtype ops continue to use the scalar path. Benchmark (silu, MetaX C500): - F32 16384x16384: ~59 Gelem/s -> ~177 Gelem/s (~3x) - F16 16384x16384: ~64 Gelem/s -> ~244 Gelem/s (~3.8x) - BF16 16384x16384: ~64 Gelem/s -> ~215 Gelem/s (~3.4x) Regression tests passed: silu, add, mul, reciprocal, gelu, swiglu, clip. Signed-off-by: LindseyMei <648816901@qq.com>

LindseyMei requested a review from a team June 30, 2026 10:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(metax): vectorize elementwise kernel for contiguous aligned tensors#1353

perf(metax): vectorize elementwise kernel for contiguous aligned tensors#1353
LindseyMei wants to merge 1 commit into
InfiniTensor:mainfrom
LindseyMei:feat/metax-elementwise-vec

LindseyMei commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

LindseyMei commented Jun 30, 2026

MetaX Elementwise Vectorization Performance Report

Environment

What changed

Correctness

Performance (silu)

Scalar baseline (before vectorization)

Vectorized (after)

Observations

Known limitations

Next steps

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant