Skip to content

Improve type coercion and casting #8302

@jayzhan211

Description

@jayzhan211

Is your feature request related to a problem or challenge?

I think there is room for improvement in type coerceion or casting.

Background

comparison_coercion is widely used in datafusion, a lossless conversion
https://github.com/apache/arrow-datafusion/blob/main/datafusion/expr/src/type_coercion/binary.rs

can_coerce_from is used mainly for signature, a lossless conversion
https://github.com/apache/arrow-datafusion/blob/main/datafusion/expr/src/type_coercion/functions.rs

can_cast_types is from arrow-cast, which is a lossy conversion. It is also used in some comparison_coercion building block. https://github.com/apache/arrow-rs/blob/df69ef57d055453c399fa925ad315d19211d7ab2/arrow-cast/src/cast.rs#L76-L273

Not sure if there is other coercion I missed

Proposal

comparison_coercion and can_coerce_from seem like doing the similar thing, maybe we can just have one lossless conversion. If lossless conversion is useful for arrow-rs, we can introduce a lossless version of can_cast_types, then rely on it for datafusion.

Lossy conversion vs Lossless

I think the definition for lossy is that the value is not recoverable after casting back, otherwise it is lossless.

Lossy

  • Int32 to Int16 / Int8

Lossless

  • Int32 to Int64

Describe the solution you'd like

  1. Replace can_coerce_from with comparison_coercion's building block numeric coercion, list coercion, string coercion, null coercion, etc
  2. Split list_coercion from string_coercion to make each building block of coercion clear on the task it focus on. list_coercion do list/fixed size list/large list coercion, string_coercion do utf/large utf coercion.
  3. Introduce these lossless coercion to arrow-rs?

Known issue or question I have

  • Introduce list_coercion that currently exist in string_concat_coercion
  • No list coercion for can_coerce_from
  • Decimal128 can cast to Float64 in can_coerce_from, why?

Describe alternatives you've considered

If there are many customize conversion need, then this change might not be helpful at all. We need other approach to let type casting / coercion easy to use.

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions