Skip to content

ARROW-10002: [Rust] Remove trait specialization from arrow crate#8485

Closed
jorgecarleitao wants to merge 2 commits intoapache:masterfrom
jorgecarleitao:simp_types
Closed

ARROW-10002: [Rust] Remove trait specialization from arrow crate#8485
jorgecarleitao wants to merge 2 commits intoapache:masterfrom
jorgecarleitao:simp_types

Conversation

@jorgecarleitao
Copy link
Copy Markdown
Member

@jorgecarleitao jorgecarleitao commented Oct 18, 2020

This PR removes trait specialization by leveraging the compiler to remove trivial if statements.

I verified that the assembly code was the same in a simple example. I do not know if this generalizes to our use-case, but I suspect so as LLVM is (hopefully) removing trivial branches like if a != a.

The change get_data_type() to DATA_TYPE is not necessary. I did it before realizing this. IMO it makes it more explicit that this is not a function, but a constant, but we can revert it.

@jorgecarleitao
Copy link
Copy Markdown
Member Author

fyi @andygrove @alamb

}
}

fn as_datetime<T: ArrowPrimitiveType>(v: i64) -> Option<NaiveDateTime> {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This places these functions outside the impl, so that we can use them:

When used with ArrowNumericType, it can convert itself to i64 and thus this is simple.

When we use it with PrimitiveType, we use to_usize to convert to i64.

Comment thread rust/arrow/src/array/array.rs
Comment thread rust/arrow/src/array/array.rs
Comment thread rust/arrow/src/array/builder.rs
let buffer = MutableBuffer::new(capacity * mem::size_of::<T::Native>());
#[inline]
fn new(capacity: usize) -> Self {
let buffer = if T::DATA_TYPE == DataType::Boolean {
Copy link
Copy Markdown
Member Author

@jorgecarleitao jorgecarleitao Oct 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that after templating, this will be either if a == a or if a == b, which I am hoping the compiler to optimize out. I placed an inline to help the compiler a bit, but I am not sure it is really needed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using an if for specialization rather than the compiler is totally legit

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I used a similar trick for the simd sum aggregation that the compiler was also able to optimize away (https://github.com/apache/arrow/pull/8370/files#diff-53a8522c4a482c9566b9032e1bd2c7990bab29640857cb40411beb7158189e75R625)

Copy link
Copy Markdown
Contributor

@Dandandan Dandandan Dec 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgecarleitao

I see this in profiling for the BufferBuilderTrait append function:

image

Which accounts for ~9% of the instruction fetches according to callgrind when running the TPHC 12 benchmark.

So it seems the if T::DATA_TYPE == DataType::Boolean is not being optimized out and goes into the partial eq implementation (or the profiler is wrong), maybe because the implementation there is not easy to optimize / specialize for this case. I will check if another implementation (pattern match?) solves this.

return false;
}

if T::DATA_TYPE == DataType::Boolean {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we use the same principle: this will be a constant if a == a or if a == b.

T: ArrowNumericType,
{
let array_type = T::get_data_type();
let array_type = T::DATA_TYPE;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could actually be removed by making filter_array_impl a generic over T. I left for a future exercise.

Comment thread rust/arrow/src/lib.rs
//!
//! The parquet implementation is on a [separate crate](https://crates.io/crates/parquet)

#![feature(specialization)]
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Copy link
Copy Markdown
Contributor

@nevi-me nevi-me Oct 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#arewestableyet? This is great, I've just compiled Arrow on stable Rust. I think it's just parquet that still uses it after this PR. If DataFusion doesn't use any nightly features, one could make parquet optional, and then Parquet will be the only one left.

@github-actions
Copy link
Copy Markdown

@jorgecarleitao jorgecarleitao changed the title ARROW-10002: [Rust] Remove trait specialization ARROW-10002: [Rust] Remove trait specialization fro arrow crate Oct 18, 2020
@jorgecarleitao jorgecarleitao changed the title ARROW-10002: [Rust] Remove trait specialization fro arrow crate ARROW-10002: [Rust] Remove trait specialization from arrow crate Oct 18, 2020
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really nice change -- it makes sense to me and I (non bindingly) vote to :shipit: !

write!(f, "PrimitiveArray<{:?}>\n[\n", T::DATA_TYPE)?;
print_long_array(self, f, |array, index, f| match T::DATA_TYPE {
DataType::Date32(_) | DataType::Date64(_) => {
match array.value_as_date(index) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the primary difference here is that all types get converted to i64 and then converted back to the appropriate (potentially smaller) native type by the code (and the compiler is presumably smart enough to make it all happen efficiently)

And by structuring the code in that way, we only need a single type parameter (i64->T rather than all combinations of input and output types?)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The journey to usize via num-traits, then to i64, feels long, but I see the compiler optimises the function nicely in release mode.

pub extern fn convert_u32_to_i64(value: u32) -> i64 {
    num::ToPrimitive::to_usize(&value).unwrap() as i64
}
playground::convert_u32_to_i64:
	movl	%edi, %eax
	retq

let buffer = MutableBuffer::new(capacity * mem::size_of::<T::Native>());
#[inline]
fn new(capacity: usize) -> Self {
let buffer = if T::DATA_TYPE == DataType::Boolean {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using an if for specialization rather than the compiler is totally legit

@vertexclique
Copy link
Copy Markdown
Contributor

For the sake of creating less friction at the user side, can you make get_data_type const fn with the same name rather than const with capitals? Still it will generate exactly the same code at the call site.

@nevi-me
Copy link
Copy Markdown
Contributor

nevi-me commented Oct 18, 2020

For the sake of creating less friction at the user side, can you make get_data_type const fn with the same name rather than const with capitals? Still it will generate exactly the same code at the call site.

I think this won't be possible as functions in traits can't be const

@vertexclique
Copy link
Copy Markdown
Contributor

@nevi-me true :/. We go to datatype from generics. const fn also accepts only Sized any other trait bound won't work.

Copy link
Copy Markdown
Contributor

@vertexclique vertexclique left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need this got merged. I have some ideas on top of this. 🚀 great stuff here.

Copy link
Copy Markdown
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed this and compiled it with nightly, thanks for this @jorgecarleitao. Arrow will be stable in the next release (if one doesn't use simd) 🎆

@nevi-me nevi-me closed this in a6c6ef6 Oct 18, 2020
kszucs pushed a commit that referenced this pull request Oct 19, 2020
This PR removes trait specialization by leveraging the compiler to remove trivial `if` statements.

I verified that the assembly code was the same in a [simple example](https://rust.godbolt.org/z/qrcW8W). I do not know if this generalizes to our use-case, but I suspect so as LLVM is (hopefully) removing trivial branches like `if a != a`.

The change `get_data_type()` to `DATA_TYPE` is not necessary. I did it before realizing this. IMO it makes it more explicit that this is not a function, but a constant, but we can revert it.

Closes #8485 from jorgecarleitao/simp_types

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
@jorgecarleitao jorgecarleitao deleted the simp_types branch October 28, 2020 04:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants