Skip to content

Fix uninitialized variable in float_to_bf16_rtn_asm() causing incorrect rounding under -O3#3715

Open
zhyajie wants to merge 1 commit intoROCm:developfrom
zhyajie:fix/bf16-asm-uninitialized-tmp
Open

Fix uninitialized variable in float_to_bf16_rtn_asm() causing incorrect rounding under -O3#3715
zhyajie wants to merge 1 commit intoROCm:developfrom
zhyajie:fix/bf16-asm-uninitialized-tmp

Conversation

@zhyajie
Copy link

@zhyajie zhyajie commented Feb 24, 2026

Summary

Fix uninitialized tmp variable in float_to_bf16_rtn_asm() that causes the compiler to incorrectly alias registers under -O3, producing wrong BF16 conversion results for ~50% of inputs.

Problem

float_to_bf16_rtn_asm() in bfloat16.hpp (used when CK_TILE_FLOAT_TO_BFLOAT16_DEFAULT=3 / standard_asm) produces incorrect Round-to-Nearest-Even results when compiled with -O3.

Environment: hipcc (AMD clang 19.0.0, ROCm 6.4.3), -O3, gfx942

Root Cause

The inline assembly declares tmp with a "+v" (read+write) constraint but never initializes it:

uint32_t tmp;  // uninitialized
asm volatile("..."
    : "=s"(check_nan), "+v"(tmp), "+v"(u.fp32)   // %0, %1, %2
    : "v"(ROUND_BIAS_FOR_BF16), "v"(FP32_NAN));  // %3, %4

Under -O3, the compiler's register allocator treats tmp (%1) as having an undefined initial value and aggressively reuses registers, assigning %1 (tmp) and %3 (ROUND_BIAS_FOR_BF16 = 0x7fff) to the same VGPR.

-O3 generated assembly (BROKEN): %1 and %3 both mapped to v5

v_bfe_u32 v5, v4, 16, 1         ; v5 = lsb (overwrites the 0x7fff value!)
v_add3_u32 v5, v4, v5, v5       ; v5 = bits + lsb + lsb  (should be bits + lsb + 0x7fff)

-O0 generated assembly (CORRECT): %1v9, %3v12

v_bfe_u32 v9, v8, 16, 1         ; v9 = lsb
v_add3_u32 v9, v8, v9, v12      ; v9 = bits + lsb + 0x7fff

The v_bfe_u32 instruction writes to %1, destroying the value of %3 when they share the same register. This breaks the v_add3_u32 rounding computation, causing ~50% of FP32 to BF16 conversions to be off by 1 ULP.

…s under -O3

Initialize `tmp` to 0 in the inline assembly of `float_to_bf16_rtn_asm()`.
Without initialization, the compiler under -O3 may alias the `tmp` operand
(%1) with the ROUND_BIAS_FOR_BF16 input operand (%3) in the same VGPR,
causing v_bfe_u32 to overwrite the 0x7fff bias before v_add3_u32 reads it.
This produces incorrect BF16 rounding for ~50% of inputs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant