Skip to content

Three Hexagon bugs found by Gemini Pro.#8989

Open
mcourteaux wants to merge 3 commits intohalide:mainfrom
mcourteaux:hvx-bugs
Open

Three Hexagon bugs found by Gemini Pro.#8989
mcourteaux wants to merge 3 commits intohalide:mainfrom
mcourteaux:hvx-bugs

Conversation

@mcourteaux
Copy link
Contributor

@mcourteaux mcourteaux commented Mar 6, 2026

Below the relevant pieces of explanation of the LLM:


The Context

When Halide deinterleaves an RGB vector (stride 3), it translates to chained shuffle_vectors operations. Because a stride of 3 does not fit the power-of-two permutations required by Hexagon's butterfly network (vdelta/vrdelta), shuffle_vectors falls back to the generalized table lookup function: vlut(lut, indices).

For standard Hexagon operations (like 128-byte vectors), the concatenated input vectors (a and b) result in a lut containing 256 elements. Deinterleaving might even push this max_index higher if double-vectors are involved.

The Bugs

Inside CodeGen_Hexagon::vlut(Value *lut, Value *idx, int min_index, int max_index), when max_index >= 256, the code breaks the lookup into manageable 256-element chunks. However, there are two critical off-by-one errors in the chunking loop:

1. Truncating the Highest Index (range_extent_i vs size)

// Existing buggy code
int range_extent_i = std::min(max_index - min_index_i, 255);
Value *range_i = vlut256(slice_vector(lut, min_index_i, range_extent_i),
                         indices, 0, range_extent_i);

Here, range_extent_i is calculated to be a maximum of 255. It is then passed as the size argument to slice_vector.
If you slice 255 elements, the valid indices for that slice are 0 through 254. The element at relative index 255 is completely dropped.
When a vector lane requests index 255, it attempts an out-of-bounds access on the 255-element slice, loading garbage/undef memory into that lane.

2. Premature Loop Termination (Skipping the Final Chunk)

// Existing buggy code
for (int min_index_i = 0; min_index_i < max_index; min_index_i += 256) {

The loop condition uses strictly less-than (<). If max_index happens to be exactly a multiple of 256 (e.g., exactly 256), the loop reaches min_index_i = 256, evaluates 256 < 256 as false, and exits.
Because the chunk responsible for indices 256–511 is never generated, any lane requesting index 256 relies on the first chunk (0–255). But wait! The first chunk truncates 16-bit indices to 8-bit (256 & 255 = 0). Consequently, the lane requesting index 256 will silently read the value for index 0.

The Fix

To fix the interleave_rgb failure, the vlut chunking loop needs to be adjusted so that it operates inclusively (<= max_index) and slices the correct capacity (local_max_index + 1):

// Corrected loop
for (int min_index_i = 0; min_index_i <= max_index; min_index_i += 256) {
    // Make a vector of the indices shifted such that the min of
    // this range is at 0. Use 16-bit indices for this.
    Value *min_index_i_val = create_vector(i16x_t, min_index_i);
    Value *indices = builder->CreateSub(idx16, min_index_i_val);

    // Create a condition value for which elements of the range are valid
    // for this index.
    Value *use_index = builder->CreateICmpSGT(indices, minus_one);

    // After we've eliminated the invalid elements, we can
    // truncate to 8 bits, as vlut requires.
    indices = call_intrin(i8x_t, "halide.hexagon.pack.vh", {indices});

    // FIX: Calculate the max local index (up to 255) and the required slice size (+ 1)
    int local_max_index = std::min(max_index - min_index_i, 255);
    int slice_size = local_max_index + 1;
    
    Value *range_i = vlut256(slice_vector(lut, min_index_i, slice_size),
                             indices, 0, local_max_index);
    ranges.emplace_back(range_i, use_index);
}

This perfectly aligns the size of the sliced vector with the max_index expected internally by vlut256, preventing both the garbage memory reads and the index aliasing.


Based on an analysis of the provided Halide Hexagon backend code, there are two critical bugs and one missed optimization (potential logical bug) specifically related to how Shuffles are handled.

Both primary bugs revolve around the mishandling of -1 sentinel values, which Halide uses in shuffle masks to represent undef elements (i.e., "don't care" indices).

1. min index calculation ignores undef elements, breaking optimizations

In CodeGen_Hexagon::shuffle_vectors, there is an attempt to optimize shuffles that strictly access the second vector (b) by ignoring a.

// Try to rewrite shuffles that only access the elements of b.
int min = indices[0];
for (size_t i = 1; i < indices.size(); i++) {
    if (indices[i] != -1 && indices[i] < min) {
        min = indices[i];
    }
}
if (min >= a_elements) {
    ... // Re-write optimization
}

The Bug:
If the very first element of the shuffle mask happens to be an undef (i.e. indices[0] == -1), min is initialized to -1. Because all valid indices are >= 0, the loop condition indices[i] < min evaluates to indices[i] < -1, which is always false.

Consequently, min will remain -1 forever, and the condition if (min >= a_elements) will fail (since a_elements > 0). This quietly and permanently disables the b-only rewrite optimization for any shuffle mask that starts with an undef.

The Fix:
Initialize min to a sufficiently large value (e.g., INT_MAX) instead of indices[0].


3. Missing Optimization Fallback (Delegating to the Wrong Overload)

Going back to CodeGen_Hexagon::shuffle_vectors:

if (min >= a_elements) {
    vector<int> shifted_indices(indices);
    for (int &i : shifted_indices) {
        if (i != -1) { i -= a_elements; }
    }
    return shuffle_vectors(b, shifted_indices);
}

The Bug / Missed Optimization:
Notice that it recursively calls shuffle_vectors(b, shifted_indices). That is the two-argument overload. However, CodeGen_Hexagon only implements the three-argument override shuffle_vectors(Value *a, Value *b, ...).

Because of the using CodeGen_Posix::shuffle_vectors; directive at the top of the class, this delegates completely out of the Hexagon backend and directly to the parent LLVM POSIX implementation. As a result, the compiler emits standard, generic LLVM IR shufflevector instructions instead of trying to further lower this single-vector shuffle using Hexagon-specific optimized instructions (like vdelta, vlut, etc.) that are mapped inside the CodeGen_Hexagon scope.

The Fix:
This should likely be updated to call shuffle_vectors(b, b, shifted_indices); (or similar) so it stays within the CodeGen_Hexagon logic.

mcourteaux and others added 2 commits March 6, 2026 17:20
Co-authored-by: Gemini 3 Pro <gemini@aistudio.google.com>
Co-authored-by: Gemini 3 Pro <gemini@aistudio.google.com>
@mcourteaux mcourteaux changed the title Two bugs found by Gemini Pro. Three bugs found by Gemini Pro. Mar 6, 2026
@mcourteaux mcourteaux changed the title Three bugs found by Gemini Pro. Three Hexagon bugs found by Gemini Pro. Mar 6, 2026
Copy link
Member

@alexreinking alexreinking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if Gemini could also derive regression tests for these bugs.

Comment on lines +1161 to 1166
int min = INT_MAX;
for (int idx : indices) {
if (idx != -1 && idx < min) {
min = idx;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly a "premature optimization is the root of all evil" situation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do miss Haskell sometimes... minimum (filter (/= -1) indices)

@mcourteaux
Copy link
Contributor Author

I tried to get it to produce a test, but not much luck. It basically independently rewrote interleave_rgb, being convinced that would trigger it. I don't know if I'm missing a small difference between interleave_rgb and the new LLM-provided test, but I wouldn't expect them to behave differently. I should maybe still try that test on main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants