Skip to content

Redesign intrinsic-test to use simple comparison#2063

Draft
sayantn wants to merge 5 commits intorust-lang:mainfrom
sayantn:intrinsic-test
Draft

Redesign intrinsic-test to use simple comparison#2063
sayantn wants to merge 5 commits intorust-lang:mainfrom
sayantn:intrinsic-test

Conversation

@sayantn
Copy link
Contributor

@sayantn sayantn commented Mar 16, 2026

Currently intrinsic-test prints the outputs and then compares the outputs manually. This PR uses a different approach -- generate C wrappers for the intrinsics, link to them from Rust, and then just use simple rust tests to compare outputs

@sayantn sayantn force-pushed the intrinsic-test branch 3 times, most recently from fc52b8d to feb1dcd Compare March 16, 2026 00:27
@sayantn
Copy link
Contributor Author

sayantn commented Mar 16, 2026

---- test_vdupq_n_f16 stdout ----

thread 'test_vdupq_n_f16' (2187) panicked at mod_0/src/lib.rs:13773:17:
assertion `left == right` failed: 
  left: [NiceF16(0.0), NiceF16(0.0), NiceF16(0.0), NiceF16(0.0), NiceF16(0.0), NiceF16(0.0), NiceF16(0.0), NiceF16(0.0)]
 right: [NiceF16(0.0), NiceF16(0.0), NiceF16(0.0), NiceF16(0.0), NiceF16(1.43e-5), NiceF16(0.0), NiceF16(-50430.0), NiceF16(1.79e-5)]

This seems weird (left is the Rust output, right is the C one, and NiceF16 is a wrapper which implements PartialEq as a == b || (a.is_nan() && b.is_nan())). This looks like ABI-related issue. For reference, the declaration looks like

unsafe extern "C" {
    fn vdup_n_f16_wrapper(value: f16) -> float16x4_t;
}

In fact most f16 tests fail in armv7. @folkertdev can you help?

Edit:

To work around this issue I have modified the tool to communicate with C via pointers (e.g. the C wrapper for _mm_add_ps looks like void _mm_add_ps_wrapper(__m128 *dst, const __m128* a, const __m128* b). This fixed the AArch64 and ARMv7 problems, but now the AArch64BE tests are failing, because apparently C and Rust have different pointer load semantics for matrix-like vectors (e.g. uint64x2x2_t) https://godbolt.org/z/j1d16z1P9

@Amanieu is this intended behavior or a bug?

@sayantn sayantn force-pushed the intrinsic-test branch 4 times, most recently from e2346ff to db1b2ca Compare March 16, 2026 06:16
@sayantn
Copy link
Contributor Author

sayantn commented Mar 16, 2026

Btw the time gains are significant, it reduces the Arm and aarch64 times to 2-3 minutes, and the full x86 run (we did 20% previously) to around 12 mins for release and 17 mins for dev

@folkertdev
Copy link
Contributor

Great work. Quick sanity check on f16, perhaps we're still using LLVM 21 to compile the C? If LLVM 21 is used (and also on windows apparently still with LLVM 22) then on some targets the ABI is inconsistent.

@sayantn
Copy link
Contributor Author

sayantn commented Mar 16, 2026

@folkertdev ooh, that makes sense. I don't particularly care about windows, but we are using LLVM20 in the CI. I can change it to use the build from kernel.org

@folkertdev
Copy link
Contributor

I'm seeing clang-18 here even https://triage.rust-lang.org/gha-logs/rust-lang/stdarch/67182959392?pr=2063. I'm not sure what the best solution is really. You could ask T-infra if they have ideas.

@sayantn
Copy link
Contributor Author

sayantn commented Mar 16, 2026

yeah, but I can use the LLVM github builds or the kernel.org builds

@sayantn sayantn force-pushed the intrinsic-test branch 2 times, most recently from ce53e81 to 76dd339 Compare March 16, 2026 20:48
@tgross35
Copy link
Contributor

Can f16 tests just be gated with #[cfg(target_has_reliable_f16)]? That's likely easier than working around the Windows and old LLVM failures.

@sayantn
Copy link
Contributor Author

sayantn commented Mar 16, 2026

@tgross35 the f16 tests are mostly fine now. More concerning is that a lot of tests are failing in all 3 arm archs, e.g. vzipq. The C version seems to return all zeros

edit: sorry, my mistake, they are still failing in ARMv7. I will gate them against the flag

@folkertdev
Copy link
Contributor

With LLVM 22 f16 should work on armv7 though?

@tgross35
Copy link
Contributor

FTZ/DAZ-related perhaps?

@sayantn
Copy link
Contributor Author

sayantn commented Mar 17, 2026

FTZ/DAZ-related perhaps?

I don't really think so, the outputs seem completely distinct.

I noticed that vzipq etc was failing so I tried out the assemblies.
In aarch64_be-unknown-linux-gnu,

use core::arch::aarch64::*;

#[unsafe(no_mangle)]
#[target_feature(enable = "neon")]
pub unsafe extern "C" fn foo(dst: *mut uint8x16x2_t, a: *const uint8x16_t, b: *const uint8x16_t) {
    unsafe {
        *dst = vzipq_u8(*a, *b);
    }
}

produces

foo:
        ld1 { v0.16b }, [x1]
        ld1 { v1.16b }, [x2]
        add x8, x0, #16
        zip1 v2.16b, v0.16b, v1.16b
        zip2 v0.16b, v0.16b, v1.16b
        st1 { v2.16b }, [x0]
        st1 { v0.16b }, [x8]
        ret

But the C code seemingly has different behavior on GCC and clang https://godbolt.org/z/T3YnrejjG

@adamgemmell can you help in this?

@adamgemmell
Copy link
Contributor

I'm not sure it will fix your issue but the difference in instructions comes from the fact that in arm_neon.h, they reverse every vector before and after the operation on big endian. It's not always actually necessary so we only do it if it's broken without it - however, the intrinsic test tool doesn't detect the difference in behaviour because both arguments it picks are identical.

e.g.:

| a               | 15 | 14 | 13 | ... | 2  | 1  | 0  |
| b               | 31 | 30 | 29 | ... | 18 | 17 | 16 |
| a = rev(a)      | 0  | 1  | 2  | ... | 13 | 14 | 15 |
| b = rev(b)      | 16 | 17 | 18 | ... | 29 | 30 | 31 |
| ret = zip(a, b) | 0  | 16 | 1  | ... | 30 | 15 | 31 |
| rev(ret)        | 31 | 15 | 30 | ... | 1  | 16 | 0  |

@adamgemmell
Copy link
Contributor

adamgemmell commented Mar 18, 2026

You can try adding big_endian_inverse: true to the definition in the yaml, regenerating to see if it changes anything. We can probably offset the pointers to the values array for arguments slightly to ensure they're different

Also I don't actually see vzipq_u8 on the latest CI run, why is that?

@sayantn
Copy link
Contributor Author

sayantn commented Mar 18, 2026

Also I don't actually see vzipq_u8 on the latest CI run, why is that?

I have no idea, I can confirm that locally the test is generated and run.

You can try adding big_endian_inverse: true to the definition in the yaml, regenerating to see if it changes anything. We can probably offset the pointers to the values array for arguments slightly to ensure they're different

I will check. Thanks

Edit: @adamgemmell adding big_endian_inverse: true to the functions seem to work. The only question remains is that is it the correct behavior, or is clang buggy here

Edit2: sorry, vzipq_u8 is not getting tested even locally. I will look into it

@adamgemmell
Copy link
Contributor

None of the unsigned variants of vzipq seem to be seen there, weird.

I'd quite like to know why this patch detects the difference - when I looked locally the codegen of the tests seemed very similar

@sayantn
Copy link
Contributor Author

sayantn commented Mar 18, 2026

Yeah I fixed the test not being included, I used / instead of div_ceil when computing the chunk sizes, so the last module wasn't getting included 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants