Skip to content

GH-49310: [C++][Compute] Fix segmentation fault in pyarrow.compute.if_else#49375

Open
Anakin100100 wants to merge 3 commits intoapache:mainfrom
Anakin100100:fix/compute-if-else-array-overflow
Open

GH-49310: [C++][Compute] Fix segmentation fault in pyarrow.compute.if_else#49375
Anakin100100 wants to merge 3 commits intoapache:mainfrom
Anakin100100:fix/compute-if-else-array-overflow

Conversation

@Anakin100100
Copy link
Contributor

@Anakin100100 Anakin100100 commented Feb 23, 2026

Rationale for this change

Fixing #49310

What changes are included in this PR?

I added a check for the if else compute kernels that verify that the result will fit in the offset of the result. The check asserts that no matter which values are chosen from either the left or right arrays/scalars there will be no overflow in order to use the UnsafeAppend which skips almost all safety checks.

Are these changes tested?

Yes, There are tests for each of the kernels that include an array that the values that would overflow the 32 bit index trigger the check. These tests don't actually allocate the data to save on time and compute. There are also two tests that verify an overflow for a 32 bit offset type and trigger an error and the second one that ensures that for a data type with a 64 bit offset there is no overflow error.

Are there any user-facing changes?

Yes, if the result may not fit in a type that relies on 32 bit offsets the user will receive a helpful error message instead of a segfault.

@github-actions
Copy link

⚠️ GitHub issue #49310 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this @Anakin100100 . Some comments below.

auto right_data_length = right_offsets[right.length] - right_offsets[0];
int64_t right_data_length = static_cast<int64_t>(right_offsets[right.length]) -
static_cast<int64_t>(right_offsets[0]);
ARROW_RETURN_NOT_OK(ValidateCapacityForOffsetType(right_data_length));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is required? We know right already fits in 2GiB, so the result here cannot possibly overflow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if the array Already holds 2GB then assigning from the scalar won't overflow,

CheckIfElseOutput(cond, left, right, expected_data);
}

std::shared_ptr<Array> MakeOversizedBuffer(int32_t data_length) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is creating an array rather than a buffer, perhaps change the name to reflect that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

Comment on lines 615 to 616
return MakeArray(
ArrayData::Make(binary(), 1, {nullptr, std::move(offsets), std::move(values)}));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so the array is technically invalid as the data buffer is too short, add a comment mentioning that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll document this

CallFunction("if_else", {cond, left, right}));
}

TEST_F(TestIfElseKernel, IfElseLargeStringAAADoesNotCapacityOverflow) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can re-use a common helper with the test above, to cut down on code duplication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

MakeAllocatedVarBinaryArray(large_string_type, kPerSide));

auto maybe_out = CallFunction("if_else", {cond, left, right});
ASSERT_FALSE(maybe_out.status().IsCapacityError()) << maybe_out.status();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should simply be ok, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that check can just be ASSERT_OK

constexpr int32_t capacity_limit = std::numeric_limits<int32_t>::max();

auto cond = ArrayFromJSON(boolean(), "[true]");
auto left = MakeOversizedBuffer(capacity_limit);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use MakeAllocatedVarBinaryArray?

Copy link
Contributor Author

@Anakin100100 Anakin100100 Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted the first helper

ASSERT_FALSE(maybe_out.status().IsCapacityError()) << maybe_out.status();
}

TEST_F(TestIfElseKernel, IfElseBinaryCapacityErrorAAA) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test checking something different than IfElseStringAAAAllocatedDataCapacityError?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we checked that if the sum of results could overflow then checking that if they are both at capacity also triggers the check isn't really needed

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Feb 24, 2026
@Anakin100100 Anakin100100 requested a review from pitrou February 28, 2026 08:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants