Skip to content

[C++][Python] if_else kernel creates invalid results with chunked array string inputs with non-aligned chunks #49410

@jorisvandenbossche

Description

@jorisvandenbossche

Describe the bug, including details regarding any error messages, version, and platform.

Coming from a pandas report, I could reproduce this with just pyarrow with a very specific usage of if_else (I am twice calling it, but the example is set up such that the results should each time be the same as the original array):

import pyarrow as pa
import pyarrow.compute as pc

arr1 = pa.chunked_array([
    pa.array([None, "x", "x"], type=pa.string()),
    pa.array([None, "x", "x"], type=pa.string()),
])
mask = arr1.is_null()
expected = arr1

# first if_else call
arr2 = pc.if_else(True, arr1.combine_chunks(), arr1)
arr2.validate(full=True)
assert arr2.equals(expected)

# second if_else call
arr3 = pc.if_else(mask, None, arr2)
arr3.validate(full=True)
assert arr3.equals(expected)

When running the above, the first if_else (essentially a short-circuit where the condition is a scalar True?) call "works" fine, but then when calling it a second time (this time with an actual mask), it returns an invalid array (which fails the validate() check).

But when checking the returned arrays in detail, it seems that already after the first if_else call, the returned array has a strange second chunk:

>>> import nanoarrow
>>> nanoarrow.Array(arr1.chunk(1)).inspect()
<ArrowArray string>
- length: 3
- offset: 0
- null_count: 1
- buffers[3]:
  - validity <bool[1 b] 01100000>
  - data_offset <int32[16 b] 0 0 1 2>
  - data <string[2 b] b'xx'>
- dictionary: NULL
- children[0]:
>>> nanoarrow.Array(arr2.chunk(1)).inspect()
<ArrowArray string>
- length: 3
- offset: 3
- null_count: 1
- buffers[3]:
  - validity <bool[1 b] 01101100>
  - data_offset <int32[28 b] 0 0 1 2 2 3 4>
  - data <string[4 b] b'xxxx'>
- dictionary: NULL
- children[0]:

The chunk has "grown" and gotten an offset. So this array itself is still valid and represents the correct data (although I don't see a reason why it returns data like this. EDIT: I suppose this is because the if_else actually first creates a contiguous result and then slices it up to retain the input chunking?).
But then passing such array with offset to the second if_else calls seems to produce an entirely invalid second chunk:

>>> nanoarrow.Array(arr3.chunk(1)).inspect()
<ArrowArray string>
- length: 3
- offset: 0
- null_count: 1
- buffers[3]:
  - validity <bool[1 b] 01100000>
  - data_offset <int32[16 b] 2 2 3 4>
  - data <string[4 b] b'xx\x00\x00'>
- dictionary: NULL
- children[0]:

This happens specifically when we are passing the concatenated array to the first if_else call, i.e. doing arr2 = pc.if_else(True, arr1.combine_chunks(), arr1) and not arr2 = pc.if_else(True, arr1, arr1), thus when the chunks are not aligned.

Component(s)

Python, C++

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions