-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Is your feature request related to a problem or challenge?
After #18136 we can represent casts to an extension type in the logical plan and Substrait, and after #20676 we will be able to represent casts to an extension type in SQL. While these can now be intercepted by an optimizer rule or logical plan modification, they will currently error if passed to the default planner.
datafusion/datafusion/physical-expr/src/planner.rs
Lines 291 to 309 in 678d1ad
| Expr::Cast(Cast { expr, field }) => { | |
| if !field.metadata().is_empty() { | |
| let (_, src_field) = expr.to_field(input_dfschema)?; | |
| return plan_err!( | |
| "Cast from {} to {} is not supported", | |
| format_type_and_metadata( | |
| src_field.data_type(), | |
| Some(src_field.metadata()), | |
| ), | |
| format_type_and_metadata(field.data_type(), Some(field.metadata())) | |
| ); | |
| } | |
| expressions::cast( | |
| create_physical_expr(expr, input_dfschema, execution_props)?, | |
| input_schema, | |
| field.data_type().clone(), | |
| ) | |
| } |
At this point we have access to the ExecutionProps, which has a reference to the ConfigOptions and given some ability to plug in casting behaviour we could resolve this expression to provide built-in support for things like casting UUID strings to into a UUID value and vice versa.
Related to this is the desire to customize how a cast between two non-extension types happens (e.g., for a Spark compatible cast #11201).
Describe the solution you'd like
I would personally like to resolve this using an extension type registry of some kind, where we add members to the trait proposed in #20312 for things like can_cast_explicit() and cast_to_explicit(). We could so something simpler (e.g., resolve a ScalarUDF) but I don't think this would scale to some of the other types of casts that happen internally (e.g., implicit casts as part of function argument inputs, #20748). It would also be nice to keep this as a "cast" to keep physical optimizations that special case the cast.
The extension registry approach won't handle anything about non-extension types. We could separate the concept of resolving a cast from the extension name (e.g., CastResolver::resolve_cast(from: &Field, to: &Field, options: _), where the CastResolver implementation is responsible for peeking at the extension names or not).
I also may be missing some internal infrastructure for this that is already in place!
Describe alternatives you've considered
A workaround is to just use a scalar function to do casting and avoid internal casts and cast expressions. This is roughly what we do in SedonaDB (e.g., we don't use a Signature and do our own type matching/internal casting).
Additional context
For reference, DuckDB's implementation of a cast registry is here:
My reading of this is that it's a series of default casts and a flat list of other cast overloads that are tried in reverse until one of them matches.