-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Is your feature request related to a problem or challenge?
(In a recent community sync I offered to review cast usage with an eye towards supporting customizable rules a la Spark and/or extension type supported casting)
After #18136 we can represent casts to an extension type in the logical plan and Substrait, and after #20676 we will be able to represent casts to an extension type in SQL. While these can now be intercepted by an optimizer rule or logical plan modification, they will currently error if passed to the default planner. In both of these contexts we have ConfigOptions and/or a session, so custom cast rules (e.g., #11201 ) or extension type-aware casts (e.g., discussed in #20312 ) could be looked up or applied.
While there are many places that create logical or physical casts whose execution is handled during the normal sequence of expression evaluation, there are a few high traffic places that depend on casting outside execution:
Signatures that "coerce" (i.e. implicitly cast with special rules) values before they are passed to UDFsColumnarValue::cast_to()andScalarValue::cast_to(), used within many scalar functions (some of them could probably use the signature and possibly predate some recent improvements)- Raw calls to
arrow::compute::cast()
Casting implementations are mostly funnelled through arrow::compute::cast(); however, there are a number of custom tweaks and/or certain types of casts that affect eligibility for certain optimizations:
try_cast_literal_to_type(): https://github.com/apache/datafusion/blob/678d1ad7f4590e74e7bae0326292949617da0f57/datafusion/expr-common/src/casts.rs- The physical cast expression has a list of types:
datafusion/datafusion/physical-expr/src/expressions/cast.rs
Lines 118 to 144 in 678d1ad
/// Check if casting from the specified source type to the target type is a /// widening cast (e.g. from `Int8` to `Int16`). pub fn check_bigger_cast(cast_type: &DataType, src: &DataType) -> bool { if cast_type.eq(src) { return true; } matches!( (src, cast_type), (Int8, Int16 | Int32 | Int64) | (Int16, Int32 | Int64) | (Int32, Int64) | (UInt8, UInt16 | UInt32 | UInt64) | (UInt16, UInt32 | UInt64) | (UInt32, UInt64) | ( Int8 | Int16 | Int32 | UInt8 | UInt16 | UInt32, Float32 | Float64 ) | (Int64 | UInt64, Float64) | (Utf8, LargeUtf8) ) } /// Check if the cast is a widening cast (e.g. from `Int8` to `Int16`). pub fn is_bigger_cast(&self, src: &DataType) -> bool { Self::check_bigger_cast(&self.cast_type, src) } ScalarValue::cast_to()has accumulated a few modifications on top of the arrow cast kernel:datafusion/datafusion/common/src/scalar/mod.rs
Lines 3888 to 3920 in 678d1ad
/// Try to cast this value to a ScalarValue of type `data_type` with [`CastOptions`] pub fn cast_to_with_options( &self, target_type: &DataType, cast_options: &CastOptions<'static>, ) -> Result<Self> { let source_type = self.data_type(); if let Some(multiplier) = date_to_timestamp_multiplier(&source_type, target_type) && let Some(value) = self.date_scalar_value_as_i64() { ensure_timestamp_in_bounds(value, multiplier, &source_type, target_type)?; } let scalar_array = self.to_array()?; // For struct types, use name-based casting logic that matches fields by name // and recursively casts nested structs. The field name wrapper is arbitrary // since cast_column only uses the DataType::Struct field definitions inside. let cast_arr = match target_type { DataType::Struct(_) => { // Field name is unused; only the struct's inner field names matter let target_field = Field::new("_", target_type.clone(), true); crate::nested_struct::cast_column( &scalar_array, &target_field, cast_options, )? } _ => cast_with_options(&scalar_array, target_type, cast_options)?, }; ScalarValue::try_from_array(&cast_arr, 0) } ColumnarValue::cast_to()has the same modifications except with an independent implementation (?) of timestamp bounds ()datafusion/datafusion/expr-common/src/columnar_value.rs
Lines 277 to 304 in 678d1ad
/// Cast this [ColumnarValue] to the specified `DataType` /// /// # Struct Casting Behavior /// /// When casting struct types, fields are matched **by name** rather than position: /// - Source fields are matched to target fields using case-sensitive name comparison /// - Fields are reordered to match the target schema /// - Missing target fields are filled with null arrays /// - Extra source fields are ignored /// /// For non-struct types, uses Arrow's standard positional casting. pub fn cast_to( &self, cast_type: &DataType, cast_options: Option<&CastOptions<'static>>, ) -> Result<ColumnarValue> { let cast_options = cast_options.cloned().unwrap_or(DEFAULT_CAST_OPTIONS); match self { ColumnarValue::Array(array) => { let casted = cast_array_by_name(array, cast_type, &cast_options)?; Ok(ColumnarValue::Array(casted)) } ColumnarValue::Scalar(scalar) => Ok(ColumnarValue::Scalar( scalar.cast_to_with_options(cast_type, &cast_options)?, )), } } } cast_struct_column(), which is used for the last two plus thePhysicalExprAdapterfor name-based struct casting ()datafusion/datafusion/common/src/nested_struct.rs
Lines 26 to 53 in 678d1ad
/// Cast a struct column to match target struct fields, handling nested structs recursively. /// /// This function implements struct-to-struct casting with the assumption that **structs should /// always be allowed to cast to other structs**. However, the source column must already be /// a struct type - non-struct sources will result in an error. /// /// ## Field Matching Strategy /// - **By Name**: Source struct fields are matched to target fields by name (case-sensitive) /// - **No Positional Mapping**: Structs with no overlapping field names are rejected /// - **Type Adaptation**: When a matching field is found, it is recursively cast to the target field's type /// - **Missing Fields**: Target fields not present in the source are filled with null values /// - **Extra Fields**: Source fields not present in the target are ignored /// /// ## Nested Struct Handling /// - Nested structs are handled recursively using the same casting rules /// - Each level of nesting follows the same field matching and null-filling strategy /// - This allows for complex struct transformations while maintaining data integrity /// /// # Arguments /// * `source_col` - The source array to cast (must be a struct array) /// * `target_fields` - The target struct field definitions to cast to /// /// # Returns /// A `Result<ArrayRef>` containing the cast struct array /// /// # Errors /// Returns a `DataFusionError::Plan` if the source column is not a struct type fn cast_struct_column( - Intervals probably should have a custom cast to ensure that narrowing casts preserve containment although I didn't see that here:
datafusion/datafusion/expr-common/src/interval_arithmetic.rs
Lines 423 to 433 in 678d1ad
/// Casts this interval to `data_type` using `cast_options`. pub fn cast_to( &self, data_type: &DataType, cast_options: &CastOptions, ) -> Result<Self> { Self::try_new( cast_scalar_value(&self.lower, data_type, cast_options)?, cast_scalar_value(&self.upper, data_type, cast_options)?, ) } - "Type coercion", which is sort of like an "implicit" cast (or a cast with a cost of 0 in DuckDB terms): https://github.com/apache/datafusion/blob/678d1ad7f4590e74e7bae0326292949617da0f57/datafusion/expr/src/type_coercion/functions.rs /
datafusion/datafusion/expr-common/src/signature.rs
Lines 1037 to 1052 in 678d1ad
#[derive(Debug, Clone, Eq, PartialOrd)] pub enum Coercion { /// Coercion that only accepts arguments exactly matching the desired type. Exact { /// The required type for the argument desired_type: TypeSignatureClass, }, /// Coercion that accepts the desired type and can implicitly coerce from other types. Implicit { /// The primary desired type for the argument desired_type: TypeSignatureClass, /// Rules for implicit coercion from other types implicit_coercion: ImplicitCoercion, }, } - "default cast for" in the LogicalType:
datafusion/datafusion/common/src/types/native.rs
Lines 275 to 281 in 678d1ad
/// Returns the default casted type for the given arrow type /// /// For types like String or Date, multiple arrow types mapped to the same logical type /// If the given arrow type is one of them, we return the same type /// Otherwise, we define the default casted type for the given arrow type fn default_cast_for(&self, origin: &DataType) -> Result<DataType> { use DataType::*;
Many of these have a short circuit along the lines of:
// If types are already equal, no cast needed
if array.data_type() == cast_type {
return Ok(Arc::clone(array));
}...which is problematic in the context of extension types as we don't have anything that can actually check whether two extension types are "equal" (equality of field refs is too strict; storage type equality is too lax) until we have an extension type registry.
Describe the solution you'd like
I would personally love some consolidation around casting so that a wider range of internal casts can consider extension metadata. A motivating use case might be to support the new datetime extension type as many of the departures from arrow::compute::cast() are related to datetime handling.
A good first step might be to create a drop-in replacement to arrow::compute::cast() in datafusion-common whose main purpose would be to track usage (and perhaps consolidate some of the specific datetime-related departures from arrow::compute::cast()).
A good next step might be to create a drop-in replacement that operates on Fields or FieldRefs (we kind of have this in datafusion-common already to handle special rules around casting structs).
After these are consolidated and we have a way to look up registered casts for extension types, we can perhaps add a variant that pipes through a ConfigOptions or session reference (which would let those internal casts behave identically to cast expressions).
Describe alternatives you've considered
No response
Additional context
No response