Skip to content

Support pluggable cast resolution to support casting to and from extension types #20755

@paleolimbot

Description

@paleolimbot

Is your feature request related to a problem or challenge?

After #18136 we can represent casts to an extension type in the logical plan and Substrait, and after #20676 we will be able to represent casts to an extension type in SQL. While these can now be intercepted by an optimizer rule or logical plan modification, they will currently error if passed to the default planner.

Expr::Cast(Cast { expr, field }) => {
if !field.metadata().is_empty() {
let (_, src_field) = expr.to_field(input_dfschema)?;
return plan_err!(
"Cast from {} to {} is not supported",
format_type_and_metadata(
src_field.data_type(),
Some(src_field.metadata()),
),
format_type_and_metadata(field.data_type(), Some(field.metadata()))
);
}
expressions::cast(
create_physical_expr(expr, input_dfschema, execution_props)?,
input_schema,
field.data_type().clone(),
)
}

At this point we have access to the ExecutionProps, which has a reference to the ConfigOptions and given some ability to plug in casting behaviour we could resolve this expression to provide built-in support for things like casting UUID strings to into a UUID value and vice versa.

Related to this is the desire to customize how a cast between two non-extension types happens (e.g., for a Spark compatible cast #11201).

Describe the solution you'd like

I would personally like to resolve this using an extension type registry of some kind, where we add members to the trait proposed in #20312 for things like can_cast_explicit() and cast_to_explicit(). We could so something simpler (e.g., resolve a ScalarUDF) but I don't think this would scale to some of the other types of casts that happen internally (e.g., implicit casts as part of function argument inputs, #20748). It would also be nice to keep this as a "cast" to keep physical optimizations that special case the cast.

The extension registry approach won't handle anything about non-extension types. We could separate the concept of resolving a cast from the extension name (e.g., CastResolver::resolve_cast(from: &Field, to: &Field, options: _), where the CastResolver implementation is responsible for peeking at the extension names or not).

I also may be missing some internal infrastructure for this that is already in place!

Describe alternatives you've considered

A workaround is to just use a scalar function to do casting and avoid internal casts and cast expressions. This is roughly what we do in SedonaDB (e.g., we don't use a Signature and do our own type matching/internal casting).

Additional context

For reference, DuckDB's implementation of a cast registry is here:

https://github.com/duckdb/duckdb/blob/dc11eadd8f0a7c600f0034810706605ebe10d5b9/src/include/duckdb/function/cast/default_casts.hpp

https://github.com/duckdb/duckdb/blob/dc11eadd8f0a7c600f0034810706605ebe10d5b9/src/function/cast/cast_function_set.cpp

My reading of this is that it's a series of default casts and a flat list of other cast overloads that are tried in reverse until one of them matches.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions