-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the bug, including details regarding any error messages, version, and platform.
Coming from a pandas report, I could reproduce this with just pyarrow with a very specific usage of if_else (I am twice calling it, but the example is set up such that the results should each time be the same as the original array):
import pyarrow as pa
import pyarrow.compute as pc
arr1 = pa.chunked_array([
pa.array([None, "x", "x"], type=pa.string()),
pa.array([None, "x", "x"], type=pa.string()),
])
mask = arr1.is_null()
expected = arr1
# first if_else call
arr2 = pc.if_else(True, arr1.combine_chunks(), arr1)
arr2.validate(full=True)
assert arr2.equals(expected)
# second if_else call
arr3 = pc.if_else(mask, None, arr2)
arr3.validate(full=True)
assert arr3.equals(expected)When running the above, the first if_else (essentially a short-circuit where the condition is a scalar True?) call "works" fine, but then when calling it a second time (this time with an actual mask), it returns an invalid array (which fails the validate() check).
But when checking the returned arrays in detail, it seems that already after the first if_else call, the returned array has a strange second chunk:
>>> import nanoarrow
>>> nanoarrow.Array(arr1.chunk(1)).inspect()
<ArrowArray string>
- length: 3
- offset: 0
- null_count: 1
- buffers[3]:
- validity <bool[1 b] 01100000>
- data_offset <int32[16 b] 0 0 1 2>
- data <string[2 b] b'xx'>
- dictionary: NULL
- children[0]:
>>> nanoarrow.Array(arr2.chunk(1)).inspect()
<ArrowArray string>
- length: 3
- offset: 3
- null_count: 1
- buffers[3]:
- validity <bool[1 b] 01101100>
- data_offset <int32[28 b] 0 0 1 2 2 3 4>
- data <string[4 b] b'xxxx'>
- dictionary: NULL
- children[0]:
The chunk has "grown" and gotten an offset. So this array itself is still valid and represents the correct data (although I don't see a reason why it returns data like this. EDIT: I suppose this is because the if_else actually first creates a contiguous result and then slices it up to retain the input chunking?).
But then passing such array with offset to the second if_else calls seems to produce an entirely invalid second chunk:
>>> nanoarrow.Array(arr3.chunk(1)).inspect()
<ArrowArray string>
- length: 3
- offset: 0
- null_count: 1
- buffers[3]:
- validity <bool[1 b] 01100000>
- data_offset <int32[16 b] 2 2 3 4>
- data <string[4 b] b'xx\x00\x00'>
- dictionary: NULL
- children[0]:
This happens specifically when we are passing the concatenated array to the first if_else call, i.e. doing arr2 = pc.if_else(True, arr1.combine_chunks(), arr1) and not arr2 = pc.if_else(True, arr1, arr1), thus when the chunks are not aligned.
Component(s)
Python, C++