Skip to content

fix: raise NotImplementedError when filtering by UUID column#2881

Open
ndrluis wants to merge 1 commit intoapache:mainfrom
ndrluis:uuid-fix
Open

fix: raise NotImplementedError when filtering by UUID column#2881
ndrluis wants to merge 1 commit intoapache:mainfrom
ndrluis:uuid-fix

Conversation

@ndrluis
Copy link
Collaborator

@ndrluis ndrluis commented Jan 5, 2026

Closes #2372

Rationale for this change

Python and Rust Arrow implementations don't recognize Java's UUID metadata for filtering. Reading works, but filtering returns the following error:

ArrowNotImplementedError: Function 'equal' has no kernel matching input types (extension<arrow.uuid>, extension<arrow.uuid>)

While one approach would be to change the UUIDType Arrow schema conversion from pa.uuid() to pa.binary(16), this alter the returned data representation, breaking the existing API contract.
Instead, this change raises an explicit exception when UUID filtering is attempted, preserving API compatibility without changing how UUID data is returned.

Are these changes tested?

Yes

Are there any user-facing changes?

No

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a revert of #2007
I think this is a new issue with Spark, Iceberg, and UUID. The previous fix (apache/iceberg#13324) and its spark 4 backport (apache/iceberg#13573) is already included in 1.10.0

I've include the stacktrace from spark for debugging

EDIT:
The Java-side UUID fix in 1.10.1 is actually apache/iceberg#14027
I had to run make test-integration-rebuild to update the docker image cache

the new stacktrace is

    
>       result = table.scan(row_filter=EqualTo("uuid_col", uuid.UUID("00000000-0000-0000-0000-000000000000").bytes)).to_arrow()

tests/integration/test_writes/test_writes.py:2588: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pyiceberg/table/__init__.py:2027: in to_arrow
    ).to_table(self.plan_files())
pyiceberg/io/pyarrow.py:1730: in to_table
    first_batch = next(batches)
pyiceberg/io/pyarrow.py:1781: in to_record_batches
    for batches in executor.map(batches_for_task, tasks):
../../.pyenv/versions/3.12.11/lib/python3.12/concurrent/futures/_base.py:619: in result_iterator
    yield _result_or_cancel(fs.pop())
../../.pyenv/versions/3.12.11/lib/python3.12/concurrent/futures/_base.py:317: in _result_or_cancel
    return fut.result(timeout)
../../.pyenv/versions/3.12.11/lib/python3.12/concurrent/futures/_base.py:456: in result
    return self.__get_result()
../../.pyenv/versions/3.12.11/lib/python3.12/concurrent/futures/_base.py:401: in __get_result
    raise self._exception
../../.pyenv/versions/3.12.11/lib/python3.12/concurrent/futures/thread.py:59: in run
    result = self.fn(*self.args, **self.kwargs)
pyiceberg/io/pyarrow.py:1778: in batches_for_task
    return list(self._record_batches_from_scan_tasks_and_deletes([task], deletes_per_file))
pyiceberg/io/pyarrow.py:1818: in _record_batches_from_scan_tasks_and_deletes
    for batch in batches:
pyiceberg/io/pyarrow.py:1600: in _task_to_record_batches
    fragment_scanner = ds.Scanner.from_fragment(
pyarrow/_dataset.pyx:3792: in pyarrow._dataset.Scanner.from_fragment
    ???
pyarrow/_dataset.pyx:3547: in pyarrow._dataset._populate_builder
    ???
pyarrow/_compute.pyx:2884: in pyarrow._compute._bind
    ???
pyarrow/error.pxi:155: in pyarrow.lib.pyarrow_internal_check_status
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   pyarrow.lib.ArrowNotImplementedError: Function 'equal' has no kernel matching input types (extension<arrow.uuid>, extension<arrow.uuid>)

pyarrow/error.pxi:92: ArrowNotImplementedError

@kevinjqliu
Copy link
Contributor

The UUID support is a gift that keeps on giving.

😄

"""
identifier = "default.test_write_uuid_in_pyiceberg_and_scan"

catalog = load_catalog("default", type="in-memory")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we need these lines with the session catlaog

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I fixed it in 1751f0c

Copy link
Member

@geruh geruh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just have a small nit on the test. One minor note, this changes how UUID columns are returned from Arrow as users get bytes instead of UUID. Worth noting in release notes.

@ndrluis
Copy link
Collaborator Author

ndrluis commented Feb 4, 2026

ping @Fokko @kevinjqliu

@Fokko
Copy link
Contributor

Fokko commented Feb 7, 2026

I share the concerd raised by @geruh:

One minor note, this changes how UUID columns are returned from Arrow as users get bytes instead of UUID. Worth noting in release notes.

Another option to fix this would be to add a flag to fallback to binary[16], similar to the backward compatibility flag: https://py.iceberg.apache.org/configuration/#backward-compatibility

Ideally we fix this upstream in Arrow. Once this is fixed, we would break the API again by changing from Binary to UUID. @ndrluis WDYT?

@ndrluis
Copy link
Collaborator Author

ndrluis commented Feb 8, 2026

I agree that the API change is a problem. I think it’s better for us to raise an exception when a user tries to filter by UUID column, I believe this approach is better for maintaining the current API and ensuring consistency between files written by Java and Python clients, since Java writes UUIDs to the metadata.

Once it’s fixed on the Arrow side, we can simply remove the exception and upgrade the Arrow version.

WDYT @Fokko @geruh @kevinjqliu?

@Fokko
Copy link
Contributor

Fokko commented Feb 18, 2026

@ndrluis Yes, that sounds like a good approach. I would refrain from changing the APIs back and forth 👍

PyArrow does not support filtering on UUID-typed columns. This commit
raises a NotImplementedError with a clear message when such a filter
is attempted
@ndrluis ndrluis changed the title fix: Use binary(16) for UUID type to ensure Spark compatibility fix: raise NotImplementedError when filtering by UUID column Feb 28, 2026
@ndrluis ndrluis requested review from geruh and kevinjqliu February 28, 2026 16:38
@ndrluis
Copy link
Collaborator Author

ndrluis commented Feb 28, 2026

@geruh @kevinjqliu @Fokko, could you please review it again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error when filtering by UUID in table scan

4 participants