fix: raise NotImplementedError when filtering by UUID column#2881
fix: raise NotImplementedError when filtering by UUID column#2881ndrluis wants to merge 1 commit intoapache:mainfrom
Conversation
There was a problem hiding this comment.
This seems to be a revert of #2007
I think this is a new issue with Spark, Iceberg, and UUID. The previous fix (apache/iceberg#13324) and its spark 4 backport (apache/iceberg#13573) is already included in 1.10.0
I've include the stacktrace from spark for debugging
EDIT:
The Java-side UUID fix in 1.10.1 is actually apache/iceberg#14027
I had to run make test-integration-rebuild to update the docker image cache
the new stacktrace is
> result = table.scan(row_filter=EqualTo("uuid_col", uuid.UUID("00000000-0000-0000-0000-000000000000").bytes)).to_arrow()
tests/integration/test_writes/test_writes.py:2588:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pyiceberg/table/__init__.py:2027: in to_arrow
).to_table(self.plan_files())
pyiceberg/io/pyarrow.py:1730: in to_table
first_batch = next(batches)
pyiceberg/io/pyarrow.py:1781: in to_record_batches
for batches in executor.map(batches_for_task, tasks):
../../.pyenv/versions/3.12.11/lib/python3.12/concurrent/futures/_base.py:619: in result_iterator
yield _result_or_cancel(fs.pop())
../../.pyenv/versions/3.12.11/lib/python3.12/concurrent/futures/_base.py:317: in _result_or_cancel
return fut.result(timeout)
../../.pyenv/versions/3.12.11/lib/python3.12/concurrent/futures/_base.py:456: in result
return self.__get_result()
../../.pyenv/versions/3.12.11/lib/python3.12/concurrent/futures/_base.py:401: in __get_result
raise self._exception
../../.pyenv/versions/3.12.11/lib/python3.12/concurrent/futures/thread.py:59: in run
result = self.fn(*self.args, **self.kwargs)
pyiceberg/io/pyarrow.py:1778: in batches_for_task
return list(self._record_batches_from_scan_tasks_and_deletes([task], deletes_per_file))
pyiceberg/io/pyarrow.py:1818: in _record_batches_from_scan_tasks_and_deletes
for batch in batches:
pyiceberg/io/pyarrow.py:1600: in _task_to_record_batches
fragment_scanner = ds.Scanner.from_fragment(
pyarrow/_dataset.pyx:3792: in pyarrow._dataset.Scanner.from_fragment
???
pyarrow/_dataset.pyx:3547: in pyarrow._dataset._populate_builder
???
pyarrow/_compute.pyx:2884: in pyarrow._compute._bind
???
pyarrow/error.pxi:155: in pyarrow.lib.pyarrow_internal_check_status
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E pyarrow.lib.ArrowNotImplementedError: Function 'equal' has no kernel matching input types (extension<arrow.uuid>, extension<arrow.uuid>)
pyarrow/error.pxi:92: ArrowNotImplementedError
😄 |
| """ | ||
| identifier = "default.test_write_uuid_in_pyiceberg_and_scan" | ||
|
|
||
| catalog = load_catalog("default", type="in-memory") |
There was a problem hiding this comment.
nit: do we need these lines with the session catlaog
geruh
left a comment
There was a problem hiding this comment.
LGTM! Just have a small nit on the test. One minor note, this changes how UUID columns are returned from Arrow as users get bytes instead of UUID. Worth noting in release notes.
|
ping @Fokko @kevinjqliu |
|
I share the concerd raised by @geruh:
Another option to fix this would be to add a flag to fallback to Ideally we fix this upstream in Arrow. Once this is fixed, we would break the API again by changing from Binary to UUID. @ndrluis WDYT? |
|
I agree that the API change is a problem. I think it’s better for us to raise an exception when a user tries to filter by UUID column, I believe this approach is better for maintaining the current API and ensuring consistency between files written by Java and Python clients, since Java writes UUIDs to the metadata. Once it’s fixed on the Arrow side, we can simply remove the exception and upgrade the Arrow version. WDYT @Fokko @geruh @kevinjqliu? |
|
@ndrluis Yes, that sounds like a good approach. I would refrain from changing the APIs back and forth 👍 |
PyArrow does not support filtering on UUID-typed columns. This commit raises a NotImplementedError with a clear message when such a filter is attempted
|
@geruh @kevinjqliu @Fokko, could you please review it again? |
Closes #2372
Rationale for this change
Python and Rust Arrow implementations don't recognize Java's UUID metadata for filtering. Reading works, but filtering returns the following error:
ArrowNotImplementedError: Function 'equal' has no kernel matching input types (extension<arrow.uuid>, extension<arrow.uuid>)While one approach would be to change the UUIDType Arrow schema conversion from
pa.uuid()topa.binary(16), this alter the returned data representation, breaking the existing API contract.Instead, this change raises an explicit exception when UUID filtering is attempted, preserving API compatibility without changing how UUID data is returned.
Are these changes tested?
Yes
Are there any user-facing changes?
No