Skip to content

Python SDK read_parquet union_by_name fail after version 1.2.2 #259

@DaiZack

Description

@DaiZack

duckdbtest.zip

What happens?

Title

Regression in 1.3.0+: union_by_name fails with "Can't change source type (NULL) to target type (VARCHAR[])" when reading parquet files with mixed NULL/LIST types

DuckDB Version

  • Working version: 1.2.2
  • Broken versions: 1.3.0, 1.3.1 (and later)

Environment

  • OS: Linux
  • Python: 3.12.9
  • pandas: (latest)

Description

Starting with DuckDB 1.3.0, reading multiple parquet files with union_by_name=True fails when:

  1. Some parquet files have a column stored as NULL type (because all values are null in that file)
  2. Other parquet files have the same column properly typed as VARCHAR[] (array/list of strings)

This worked correctly in DuckDB 1.2.2 but now throws:

BinderException: Binder Error: Can't change source type ("NULL") to target type (VARCHAR[]), type conversion not allowed

Expected Behavior

When union_by_name=True is set, DuckDB should merge schemas gracefully, treating NULL-typed columns as compatible with any target type (similar to how pandas handles this).

Actual Behavior

DuckDB 1.3.0+ throws a BinderException and refuses to read the files, even though union_by_name=True is explicitly designed to handle schema variations across multiple files.

Root Cause Analysis

Investigation shows:

  • When a parquet file has ALL NULL values for a column, it's stored with NULL type (e.g., INT32 with NullType() logical type)
  • Other files with actual data store the same column as BYTE_ARRAY with StringType() or complex types like ListType()
  • The error specifically mentions VARCHAR[] (array type) suggesting it happens with nested/complex types
  • This regression appeared between versions 1.2.2 and 1.3.0

How to Reproduce

attached files to test see duckdbtest.zip

import duckdb
print(f"DuckDB version: {duckdb.__version__}")

# Fails with 1.3.0+
try:
    result = duckdb.read_parquet(
        "duckdb_bug_test_files/*.parquet",
        union_by_name=True
    ).df()
    print(f"SUCCESS: Read {len(result)} rows")
except Exception as e:
    print(f"FAILED: {type(e).__name__}: {e}")

To Reproduce

this is only in python SDK

OS:

Linux x86

DuckDB Version:

v1.2.2, v1.3.0 and later

DuckDB Client:

Python

Hardware:

No response

Full Name:

Zack Dai

Affiliation:

Zack Dai

Did you include all relevant configuration (e.g., CPU architecture, Linux distribution) to reproduce the issue?

  • Yes, I have

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant data sets for reproducing the issue?

Yes

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions