fix: raise NotImplementedError when filtering by UUID column by ndrluis · Pull Request #2881 · apache/iceberg-python

ndrluis · 2026-01-05T16:54:21Z

Rationale for this change

Python and Rust Arrow implementations don't recognize Java's UUID metadata for filtering. Reading works, but filtering returns the following error:

ArrowNotImplementedError: Function 'equal' has no kernel matching input types (extension<arrow.uuid>, extension<arrow.uuid>)

While one approach would be to change the UUIDType Arrow schema conversion from pa.uuid() to pa.binary(16), this alter the returned data representation, breaking the existing API contract.
Instead, this change raises an explicit exception when UUID filtering is attempted, preserving API compatibility without changing how UUID data is returned.

Are these changes tested?

Yes

Are there any user-facing changes?

No

kevinjqliu

This seems to be a revert of #2007
I think this is a new issue with Spark, Iceberg, and UUID. The previous fix (apache/iceberg#13324) and its spark 4 backport (apache/iceberg#13573) is already included in 1.10.0

I've include the stacktrace from spark for debugging

EDIT:
The Java-side UUID fix in 1.10.1 is actually apache/iceberg#14027
I had to run make test-integration-rebuild to update the docker image cache

the new stacktrace is

    
>       result = table.scan(row_filter=EqualTo("uuid_col", uuid.UUID("00000000-0000-0000-0000-000000000000").bytes)).to_arrow()

tests/integration/test_writes/test_writes.py:2588: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pyiceberg/table/__init__.py:2027: in to_arrow
    ).to_table(self.plan_files())
pyiceberg/io/pyarrow.py:1730: in to_table
    first_batch = next(batches)
pyiceberg/io/pyarrow.py:1781: in to_record_batches
    for batches in executor.map(batches_for_task, tasks):
../../.pyenv/versions/3.12.11/lib/python3.12/concurrent/futures/_base.py:619: in result_iterator
    yield _result_or_cancel(fs.pop())
../../.pyenv/versions/3.12.11/lib/python3.12/concurrent/futures/_base.py:317: in _result_or_cancel
    return fut.result(timeout)
../../.pyenv/versions/3.12.11/lib/python3.12/concurrent/futures/_base.py:456: in result
    return self.__get_result()
../../.pyenv/versions/3.12.11/lib/python3.12/concurrent/futures/_base.py:401: in __get_result
    raise self._exception
../../.pyenv/versions/3.12.11/lib/python3.12/concurrent/futures/thread.py:59: in run
    result = self.fn(*self.args, **self.kwargs)
pyiceberg/io/pyarrow.py:1778: in batches_for_task
    return list(self._record_batches_from_scan_tasks_and_deletes([task], deletes_per_file))
pyiceberg/io/pyarrow.py:1818: in _record_batches_from_scan_tasks_and_deletes
    for batch in batches:
pyiceberg/io/pyarrow.py:1600: in _task_to_record_batches
    fragment_scanner = ds.Scanner.from_fragment(
pyarrow/_dataset.pyx:3792: in pyarrow._dataset.Scanner.from_fragment
    ???
pyarrow/_dataset.pyx:3547: in pyarrow._dataset._populate_builder
    ???
pyarrow/_compute.pyx:2884: in pyarrow._compute._bind
    ???
pyarrow/error.pxi:155: in pyarrow.lib.pyarrow_internal_check_status
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   pyarrow.lib.ArrowNotImplementedError: Function 'equal' has no kernel matching input types (extension<arrow.uuid>, extension<arrow.uuid>)

pyarrow/error.pxi:92: ArrowNotImplementedError

kevinjqliu · 2026-01-05T19:31:36Z

The UUID support is a gift that keeps on giving.

😄

geruh · 2026-01-16T03:52:25Z

+    """
+    identifier = "default.test_write_uuid_in_pyiceberg_and_scan"
+
+    catalog = load_catalog("default", type="in-memory")


nit: do we need these lines with the session catlaog

No, I fixed it in 1751f0c

geruh

LGTM! Just have a small nit on the test. One minor note, this changes how UUID columns are returned from Arrow as users get bytes instead of UUID. Worth noting in release notes.

ndrluis · 2026-02-04T11:22:58Z

ping @Fokko @kevinjqliu

Fokko · 2026-02-07T19:29:46Z

I share the concerd raised by @geruh:

One minor note, this changes how UUID columns are returned from Arrow as users get bytes instead of UUID. Worth noting in release notes.

Another option to fix this would be to add a flag to fallback to binary[16], similar to the backward compatibility flag: https://py.iceberg.apache.org/configuration/#backward-compatibility

Ideally we fix this upstream in Arrow. Once this is fixed, we would break the API again by changing from Binary to UUID. @ndrluis WDYT?

ndrluis · 2026-02-08T15:51:27Z

I agree that the API change is a problem. I think it’s better for us to raise an exception when a user tries to filter by UUID column, I believe this approach is better for maintaining the current API and ensuring consistency between files written by Java and Python clients, since Java writes UUIDs to the metadata.

Once it’s fixed on the Arrow side, we can simply remove the exception and upgrade the Arrow version.

WDYT @Fokko @geruh @kevinjqliu?

Fokko · 2026-02-18T20:18:00Z

@ndrluis Yes, that sounds like a good approach. I would refrain from changing the APIs back and forth 👍

PyArrow does not support filtering on UUID-typed columns. This commit raises a NotImplementedError with a clear message when such a filter is attempted

ndrluis · 2026-02-28T16:41:28Z

@geruh @kevinjqliu @Fokko, could you please review it again?

ndrluis · 2026-03-25T14:33:13Z

ping @geruh @kevinjqliu @Fokko

github-actions · 2026-04-25T00:35:08Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that's incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

ndrluis force-pushed the uuid-fix branch from a159a4c to d84868d Compare January 5, 2026 17:40

Fokko self-requested a review January 5, 2026 17:46

kevinjqliu reviewed Jan 5, 2026

View reviewed changes

Comment thread pyiceberg/io/pyarrow.py Outdated

Comment thread tests/integration/test_writes/test_writes.py Outdated

geruh reviewed Jan 16, 2026

View reviewed changes

geruh approved these changes Jan 16, 2026

View reviewed changes

ndrluis force-pushed the uuid-fix branch from 1751f0c to 3e017c4 Compare February 28, 2026 15:57

fix: raise NotImplementedError when filtering by UUID column

8f28a6e

PyArrow does not support filtering on UUID-typed columns. This commit raises a NotImplementedError with a clear message when such a filter is attempted

ndrluis force-pushed the uuid-fix branch from 3e017c4 to 8f28a6e Compare February 28, 2026 16:13

ndrluis changed the title ~~fix: Use binary(16) for UUID type to ensure Spark compatibility~~ fix: raise NotImplementedError when filtering by UUID column Feb 28, 2026

ndrluis requested review from geruh and kevinjqliu February 28, 2026 16:38

github-actions Bot added the stale label Apr 25, 2026

ndrluis removed the stale label Apr 25, 2026

ndrluis mentioned this pull request May 8, 2026

Error when filtering by UUID in table scan #2372

Open

3 tasks

Conversation

ndrluis commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

kevinjqliu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kevinjqliu commented Jan 5, 2026

Uh oh!

geruh Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

ndrluis Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

geruh left a comment

Choose a reason for hiding this comment

Uh oh!

ndrluis commented Feb 4, 2026

Uh oh!

Fokko commented Feb 7, 2026

Uh oh!

ndrluis commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fokko commented Feb 18, 2026

Uh oh!

ndrluis commented Feb 28, 2026

Uh oh!

ndrluis commented Mar 25, 2026

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ndrluis commented Jan 5, 2026 •

edited

Loading

kevinjqliu left a comment •

edited

Loading

ndrluis commented Feb 8, 2026 •

edited

Loading