Fix compute_record_batch_statistics wrong with projection#8489
Fix compute_record_batch_statistics wrong with projection#8489alamb merged 40 commits intoapache:mainfrom
compute_record_batch_statistics wrong with projection#8489Conversation
# Conflicts: # datafusion/physical-plan/src/joins/hash_join_utils.rs
# Conflicts: # datafusion/physical-plan/src/joins/hash_join_utils.rs
| .iter() | ||
| .flatten() | ||
| .map(|b| { | ||
| b.columns() |
There was a problem hiding this comment.
The RecordBatch.project method is not used, because there is a clone internally, so there is no need to generate a new RecordBatch here.
| let expected = Statistics { | ||
| num_rows: Precision::Exact(3), | ||
| total_byte_size: Precision::Exact(464), // this might change a bit if the way we compute the size changes | ||
| total_byte_size: Precision::Exact(byte_size), |
There was a problem hiding this comment.
I'm not sure if this is appropriate, if you have any good suggestions please leave a message
There was a problem hiding this comment.
I think this is ok and a nice way to make the code less brittle to future changes in arrow's layout
There was a problem hiding this comment.
I'm curious as to why the previous code was Precision::Exact(464).
There was a problem hiding this comment.
I think it happens to be the (current) size of the record batch in the test:
let batch = RecordBatch::try_new(
Arc::clone(&schema),
vec![
Arc::new(Float32Array::from(vec![1., 2., 3.])),
Arc::new(Float64Array::from(vec![9., 8., 7.])),
Arc::new(UInt64Array::from(vec![4, 5, 6])),
],
)?;
alamb
left a comment
There was a problem hiding this comment.
This makes sense to me -- thank you @Asura7969 and I apologize for the delay in reviewing. cc @Dandandan
| let expected = Statistics { | ||
| num_rows: Precision::Exact(3), | ||
| total_byte_size: Precision::Exact(464), // this might change a bit if the way we compute the size changes | ||
| total_byte_size: Precision::Exact(byte_size), |
There was a problem hiding this comment.
I think this is ok and a nice way to make the code less brittle to future changes in arrow's layout
| --------------------CoalesceBatchesExec: target_batch_size=8192 | ||
| ----------------------RepartitionExec: partitioning=Hash([col0@0], 4), input_partitions=1 | ||
| ------------------------MemoryExec: partitions=1, partition_sizes=[3] | ||
| ----------------ProjectionExec: expr=[col0@2 as col0, col1@3 as col1, col2@4 as col2, col0@0 as col0, col1@1 as col1] |
There was a problem hiding this comment.
Looks to me like the change is due to the fact the join inputs were reordered and this projection puts the columns back the expected way.
Same thing with the projection below
| let total_byte_size = batches | ||
| .iter() | ||
| .flatten() | ||
| .map(|b| { | ||
| projection | ||
| .iter() | ||
| .map(|index| b.column(*index).get_array_memory_size()) | ||
| .sum::<usize>() | ||
| }) | ||
| .sum(); |
| let expected = Statistics { | ||
| num_rows: Precision::Exact(3), | ||
| total_byte_size: Precision::Exact(464), // this might change a bit if the way we compute the size changes | ||
| total_byte_size: Precision::Exact(byte_size), |
There was a problem hiding this comment.
I think it happens to be the (current) size of the record batch in the test:
let batch = RecordBatch::try_new(
Arc::clone(&schema),
vec![
Arc::new(Float32Array::from(vec![1., 2., 3.])),
Arc::new(Float64Array::from(vec![9., 8., 7.])),
Arc::new(UInt64Array::from(vec![4, 5, 6])),
],
)?;|
Thanks again @Asura7969 @ |
|
Thanks @Asura7969 and @alamb @jackwener for the review |
Which issue does this PR close?
Closes #8406.
Rationale for this change
Projection size is not calculated correctly
What changes are included in this PR?
Select the correct columns, not all
Are these changes tested?
test_compute_record_batch_statisticsAre there any user-facing changes?