GH-49310: [C++][Compute] Fix segmentation fault in pyarrow.compute.if_else#49375
GH-49310: [C++][Compute] Fix segmentation fault in pyarrow.compute.if_else#49375Anakin100100 wants to merge 3 commits intoapache:mainfrom
Conversation
|
|
pitrou
left a comment
There was a problem hiding this comment.
Thanks for doing this @Anakin100100 . Some comments below.
| auto right_data_length = right_offsets[right.length] - right_offsets[0]; | ||
| int64_t right_data_length = static_cast<int64_t>(right_offsets[right.length]) - | ||
| static_cast<int64_t>(right_offsets[0]); | ||
| ARROW_RETURN_NOT_OK(ValidateCapacityForOffsetType(right_data_length)); |
There was a problem hiding this comment.
I don't think this is required? We know right already fits in 2GiB, so the result here cannot possibly overflow.
There was a problem hiding this comment.
Yeah, if the array Already holds 2GB then assigning from the scalar won't overflow,
| CheckIfElseOutput(cond, left, right, expected_data); | ||
| } | ||
|
|
||
| std::shared_ptr<Array> MakeOversizedBuffer(int32_t data_length) { |
There was a problem hiding this comment.
This is creating an array rather than a buffer, perhaps change the name to reflect that?
| return MakeArray( | ||
| ArrayData::Make(binary(), 1, {nullptr, std::move(offsets), std::move(values)})); |
There was a problem hiding this comment.
Ok, so the array is technically invalid as the data buffer is too short, add a comment mentioning that?
There was a problem hiding this comment.
I'll document this
| CallFunction("if_else", {cond, left, right})); | ||
| } | ||
|
|
||
| TEST_F(TestIfElseKernel, IfElseLargeStringAAADoesNotCapacityOverflow) { |
There was a problem hiding this comment.
You can re-use a common helper with the test above, to cut down on code duplication.
| MakeAllocatedVarBinaryArray(large_string_type, kPerSide)); | ||
|
|
||
| auto maybe_out = CallFunction("if_else", {cond, left, right}); | ||
| ASSERT_FALSE(maybe_out.status().IsCapacityError()) << maybe_out.status(); |
There was a problem hiding this comment.
Yes, that check can just be ASSERT_OK
| constexpr int32_t capacity_limit = std::numeric_limits<int32_t>::max(); | ||
|
|
||
| auto cond = ArrayFromJSON(boolean(), "[true]"); | ||
| auto left = MakeOversizedBuffer(capacity_limit); |
There was a problem hiding this comment.
Why not use MakeAllocatedVarBinaryArray?
There was a problem hiding this comment.
I deleted the first helper
| ASSERT_FALSE(maybe_out.status().IsCapacityError()) << maybe_out.status(); | ||
| } | ||
|
|
||
| TEST_F(TestIfElseKernel, IfElseBinaryCapacityErrorAAA) { |
There was a problem hiding this comment.
Is this test checking something different than IfElseStringAAAAllocatedDataCapacityError?
There was a problem hiding this comment.
If we checked that if the sum of results could overflow then checking that if they are both at capacity also triggers the check isn't really needed
Rationale for this change
Fixing #49310
What changes are included in this PR?
I added a check for the if else compute kernels that verify that the result will fit in the offset of the result. The check asserts that no matter which values are chosen from either the left or right arrays/scalars there will be no overflow in order to use the UnsafeAppend which skips almost all safety checks.
Are these changes tested?
Yes, There are tests for each of the kernels that include an array that the values that would overflow the 32 bit index trigger the check. These tests don't actually allocate the data to save on time and compute. There are also two tests that verify an overflow for a 32 bit offset type and trigger an error and the second one that ensures that for a data type with a 64 bit offset there is no overflow error.
Are there any user-facing changes?
Yes, if the result may not fit in a type that relies on 32 bit offsets the user will receive a helpful error message instead of a segfault.