Skip to content

Prototype implementing DataFusion functions / operators using arrow-udf liibrary #11413

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

Related to the discussion on #11192 with @Xuanwo

RisingWave has a library for automatically creating vectorized implementations of functions (e.g. that operate on arrow arrays) from scalar implementations

The library is here: https://github.com/risingwavelabs/arrow-udf

A blog post describing it is here: https://risingwave.com/blog/simplifying-sql-function-implementation-with-rust-procedural-macro/

DataFusion uses macros to do something similar in binary.rs but they are pretty hard to read / understand in my opinon:

macro_rules! compute_utf8_op {
($LEFT:expr, $RIGHT:expr, $OP:ident, $DT:ident) => {{
let ll = $LEFT
.as_any()
.downcast_ref::<$DT>()
.expect("compute_op failed to downcast left side array");
let rr = $RIGHT
.as_any()
.downcast_ref::<$DT>()
.expect("compute_op failed to downcast right side array");
Ok(Arc::new(paste::expr! {[<$OP _utf8>]}(&ll, &rr)?))
}};
}

One main benefit I can see to switching to https://github.com/risingwavelabs/arrow-udf is that we could then extend arrow-udf to support Dictionary and StringView and maybe other types to generate fast kernels for multiple different array layouts.

Describe the solution you'd like

I think it would be great if someone could evaluate the feasibility of using the macros in https://github.com/risingwavelabs/arrow-udf to implement Datafusion's operations (and maybe eventually functions etc)

Describe alternatives you've considered

I suggest a POC that picks one or two functions (maybe string equality or regexp_match or something) and tries to use arrow-udfs function macro instead.

Here is an example of how to use it: https://docs.rs/arrow-udf/0.3.0/arrow_udf/

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions