-
Notifications
You must be signed in to change notification settings - Fork 58
Description
What happens?
Title
Regression in 1.3.0+: union_by_name fails with "Can't change source type (NULL) to target type (VARCHAR[])" when reading parquet files with mixed NULL/LIST types
DuckDB Version
- Working version: 1.2.2
- Broken versions: 1.3.0, 1.3.1 (and later)
Environment
- OS: Linux
- Python: 3.12.9
- pandas: (latest)
Description
Starting with DuckDB 1.3.0, reading multiple parquet files with union_by_name=True fails when:
- Some parquet files have a column stored as NULL type (because all values are null in that file)
- Other parquet files have the same column properly typed as VARCHAR[] (array/list of strings)
This worked correctly in DuckDB 1.2.2 but now throws:
BinderException: Binder Error: Can't change source type ("NULL") to target type (VARCHAR[]), type conversion not allowed
Expected Behavior
When union_by_name=True is set, DuckDB should merge schemas gracefully, treating NULL-typed columns as compatible with any target type (similar to how pandas handles this).
Actual Behavior
DuckDB 1.3.0+ throws a BinderException and refuses to read the files, even though union_by_name=True is explicitly designed to handle schema variations across multiple files.
Root Cause Analysis
Investigation shows:
- When a parquet file has ALL NULL values for a column, it's stored with NULL type (e.g.,
INT32withNullType()logical type) - Other files with actual data store the same column as
BYTE_ARRAYwithStringType()or complex types likeListType() - The error specifically mentions
VARCHAR[](array type) suggesting it happens with nested/complex types - This regression appeared between versions 1.2.2 and 1.3.0
How to Reproduce
attached files to test see duckdbtest.zip
import duckdb
print(f"DuckDB version: {duckdb.__version__}")
# Fails with 1.3.0+
try:
result = duckdb.read_parquet(
"duckdb_bug_test_files/*.parquet",
union_by_name=True
).df()
print(f"SUCCESS: Read {len(result)} rows")
except Exception as e:
print(f"FAILED: {type(e).__name__}: {e}")To Reproduce
this is only in python SDK
OS:
Linux x86
DuckDB Version:
v1.2.2, v1.3.0 and later
DuckDB Client:
Python
Hardware:
No response
Full Name:
Zack Dai
Affiliation:
Zack Dai
Did you include all relevant configuration (e.g., CPU architecture, Linux distribution) to reproduce the issue?
- Yes, I have
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant data sets for reproducing the issue?
Yes