Skip to content

Consolidate non-expression cast behaviour #20748

@paleolimbot

Description

@paleolimbot

Is your feature request related to a problem or challenge?

(In a recent community sync I offered to review cast usage with an eye towards supporting customizable rules a la Spark and/or extension type supported casting)

After #18136 we can represent casts to an extension type in the logical plan and Substrait, and after #20676 we will be able to represent casts to an extension type in SQL. While these can now be intercepted by an optimizer rule or logical plan modification, they will currently error if passed to the default planner. In both of these contexts we have ConfigOptions and/or a session, so custom cast rules (e.g., #11201 ) or extension type-aware casts (e.g., discussed in #20312 ) could be looked up or applied.

While there are many places that create logical or physical casts whose execution is handled during the normal sequence of expression evaluation, there are a few high traffic places that depend on casting outside execution:

  • Signatures that "coerce" (i.e. implicitly cast with special rules) values before they are passed to UDFs
  • ColumnarValue::cast_to() and ScalarValue::cast_to(), used within many scalar functions (some of them could probably use the signature and possibly predate some recent improvements)
  • Raw calls to arrow::compute::cast()

Casting implementations are mostly funnelled through arrow::compute::cast(); however, there are a number of custom tweaks and/or certain types of casts that affect eligibility for certain optimizations:

  • try_cast_literal_to_type(): https://github.com/apache/datafusion/blob/678d1ad7f4590e74e7bae0326292949617da0f57/datafusion/expr-common/src/casts.rs
  • The physical cast expression has a list of types:
    /// Check if casting from the specified source type to the target type is a
    /// widening cast (e.g. from `Int8` to `Int16`).
    pub fn check_bigger_cast(cast_type: &DataType, src: &DataType) -> bool {
    if cast_type.eq(src) {
    return true;
    }
    matches!(
    (src, cast_type),
    (Int8, Int16 | Int32 | Int64)
    | (Int16, Int32 | Int64)
    | (Int32, Int64)
    | (UInt8, UInt16 | UInt32 | UInt64)
    | (UInt16, UInt32 | UInt64)
    | (UInt32, UInt64)
    | (
    Int8 | Int16 | Int32 | UInt8 | UInt16 | UInt32,
    Float32 | Float64
    )
    | (Int64 | UInt64, Float64)
    | (Utf8, LargeUtf8)
    )
    }
    /// Check if the cast is a widening cast (e.g. from `Int8` to `Int16`).
    pub fn is_bigger_cast(&self, src: &DataType) -> bool {
    Self::check_bigger_cast(&self.cast_type, src)
    }
  • ScalarValue::cast_to() has accumulated a few modifications on top of the arrow cast kernel:
    /// Try to cast this value to a ScalarValue of type `data_type` with [`CastOptions`]
    pub fn cast_to_with_options(
    &self,
    target_type: &DataType,
    cast_options: &CastOptions<'static>,
    ) -> Result<Self> {
    let source_type = self.data_type();
    if let Some(multiplier) = date_to_timestamp_multiplier(&source_type, target_type)
    && let Some(value) = self.date_scalar_value_as_i64()
    {
    ensure_timestamp_in_bounds(value, multiplier, &source_type, target_type)?;
    }
    let scalar_array = self.to_array()?;
    // For struct types, use name-based casting logic that matches fields by name
    // and recursively casts nested structs. The field name wrapper is arbitrary
    // since cast_column only uses the DataType::Struct field definitions inside.
    let cast_arr = match target_type {
    DataType::Struct(_) => {
    // Field name is unused; only the struct's inner field names matter
    let target_field = Field::new("_", target_type.clone(), true);
    crate::nested_struct::cast_column(
    &scalar_array,
    &target_field,
    cast_options,
    )?
    }
    _ => cast_with_options(&scalar_array, target_type, cast_options)?,
    };
    ScalarValue::try_from_array(&cast_arr, 0)
    }
  • ColumnarValue::cast_to() has the same modifications except with an independent implementation (?) of timestamp bounds (
    /// Cast this [ColumnarValue] to the specified `DataType`
    ///
    /// # Struct Casting Behavior
    ///
    /// When casting struct types, fields are matched **by name** rather than position:
    /// - Source fields are matched to target fields using case-sensitive name comparison
    /// - Fields are reordered to match the target schema
    /// - Missing target fields are filled with null arrays
    /// - Extra source fields are ignored
    ///
    /// For non-struct types, uses Arrow's standard positional casting.
    pub fn cast_to(
    &self,
    cast_type: &DataType,
    cast_options: Option<&CastOptions<'static>>,
    ) -> Result<ColumnarValue> {
    let cast_options = cast_options.cloned().unwrap_or(DEFAULT_CAST_OPTIONS);
    match self {
    ColumnarValue::Array(array) => {
    let casted = cast_array_by_name(array, cast_type, &cast_options)?;
    Ok(ColumnarValue::Array(casted))
    }
    ColumnarValue::Scalar(scalar) => Ok(ColumnarValue::Scalar(
    scalar.cast_to_with_options(cast_type, &cast_options)?,
    )),
    }
    }
    }
    )
  • cast_struct_column(), which is used for the last two plus the PhysicalExprAdapter for name-based struct casting (
    /// Cast a struct column to match target struct fields, handling nested structs recursively.
    ///
    /// This function implements struct-to-struct casting with the assumption that **structs should
    /// always be allowed to cast to other structs**. However, the source column must already be
    /// a struct type - non-struct sources will result in an error.
    ///
    /// ## Field Matching Strategy
    /// - **By Name**: Source struct fields are matched to target fields by name (case-sensitive)
    /// - **No Positional Mapping**: Structs with no overlapping field names are rejected
    /// - **Type Adaptation**: When a matching field is found, it is recursively cast to the target field's type
    /// - **Missing Fields**: Target fields not present in the source are filled with null values
    /// - **Extra Fields**: Source fields not present in the target are ignored
    ///
    /// ## Nested Struct Handling
    /// - Nested structs are handled recursively using the same casting rules
    /// - Each level of nesting follows the same field matching and null-filling strategy
    /// - This allows for complex struct transformations while maintaining data integrity
    ///
    /// # Arguments
    /// * `source_col` - The source array to cast (must be a struct array)
    /// * `target_fields` - The target struct field definitions to cast to
    ///
    /// # Returns
    /// A `Result<ArrayRef>` containing the cast struct array
    ///
    /// # Errors
    /// Returns a `DataFusionError::Plan` if the source column is not a struct type
    fn cast_struct_column(
    )
  • Intervals probably should have a custom cast to ensure that narrowing casts preserve containment although I didn't see that here:
    /// Casts this interval to `data_type` using `cast_options`.
    pub fn cast_to(
    &self,
    data_type: &DataType,
    cast_options: &CastOptions,
    ) -> Result<Self> {
    Self::try_new(
    cast_scalar_value(&self.lower, data_type, cast_options)?,
    cast_scalar_value(&self.upper, data_type, cast_options)?,
    )
    }
  • "Type coercion", which is sort of like an "implicit" cast (or a cast with a cost of 0 in DuckDB terms): https://github.com/apache/datafusion/blob/678d1ad7f4590e74e7bae0326292949617da0f57/datafusion/expr/src/type_coercion/functions.rs /
    #[derive(Debug, Clone, Eq, PartialOrd)]
    pub enum Coercion {
    /// Coercion that only accepts arguments exactly matching the desired type.
    Exact {
    /// The required type for the argument
    desired_type: TypeSignatureClass,
    },
    /// Coercion that accepts the desired type and can implicitly coerce from other types.
    Implicit {
    /// The primary desired type for the argument
    desired_type: TypeSignatureClass,
    /// Rules for implicit coercion from other types
    implicit_coercion: ImplicitCoercion,
    },
    }
  • "default cast for" in the LogicalType:
    /// Returns the default casted type for the given arrow type
    ///
    /// For types like String or Date, multiple arrow types mapped to the same logical type
    /// If the given arrow type is one of them, we return the same type
    /// Otherwise, we define the default casted type for the given arrow type
    fn default_cast_for(&self, origin: &DataType) -> Result<DataType> {
    use DataType::*;

Many of these have a short circuit along the lines of:

// If types are already equal, no cast needed
if array.data_type() == cast_type {
    return Ok(Arc::clone(array));
}

...which is problematic in the context of extension types as we don't have anything that can actually check whether two extension types are "equal" (equality of field refs is too strict; storage type equality is too lax) until we have an extension type registry.

Describe the solution you'd like

I would personally love some consolidation around casting so that a wider range of internal casts can consider extension metadata. A motivating use case might be to support the new datetime extension type as many of the departures from arrow::compute::cast() are related to datetime handling.

A good first step might be to create a drop-in replacement to arrow::compute::cast() in datafusion-common whose main purpose would be to track usage (and perhaps consolidate some of the specific datetime-related departures from arrow::compute::cast()).

A good next step might be to create a drop-in replacement that operates on Fields or FieldRefs (we kind of have this in datafusion-common already to handle special rules around casting structs).

After these are consolidated and we have a way to look up registered casts for extension types, we can perhaps add a variant that pipes through a ConfigOptions or session reference (which would let those internal casts behave identically to cast expressions).

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions