deduplicate variadic buffers in MutableArrayData::extend for ByteView arrays#6808
deduplicate variadic buffers in MutableArrayData::extend for ByteView arrays#6808onursatici wants to merge 4 commits intoapache:mainfrom
Conversation
| _ => vec![], | ||
| let (variadic_data_buffers, buffer_to_idx) = match &data_type { | ||
| DataType::BinaryView | DataType::Utf8View => { | ||
| let mut buffer_to_idx = HashMap::new(); |
There was a problem hiding this comment.
I wonder if building a hashmap / vec would be overly expensive (though we would need to run benchmarks to be sure)
There was a problem hiding this comment.
happy to run benchmarks, any particular in mind or should I create one with criterion specific to this?
There was a problem hiding this comment.
I think the ones in cast are probably a good place to start
|
@alamb @tustvold I did add a string view case for the interleave benchmark and ran on main, this PR (interleave-deduplicated), and #6779 (interleave-specific-impl) I believe the penalty introduced by this PR would be mitigated for interleave's case if we also merge #6779, for other cases it feels like the read / transfer over the wire improvements might outweigh the cost. Happy to hear your thoughts |
|
Thank you @onursatici -- I hope to find time to review this PR this weekend or early next week |
alamb
left a comment
There was a problem hiding this comment.
I (again) apologize for the delay in reviewing this PR. We are stretched quite thin as always
In general, I think this PR needs some tests to show it is working as well as ensure we don't break this functionality with some future PR.
Thank you for running the benchmarks. They seem promising and I will give them a more careful look if we proceed with this PR
|
@alamb no worries and thank you for having a look. I added some tests now checking the deduplication and remapping behaviour, let me know whenever you have time if this looks good, happy holidays! |
I have merged #6779 now I think one of the potential performance concerns is that |
|
Thank you for looking into this, I am inclined to agree with your assessment that the returns of this are probably not worthwhile to include as part of the general purpose MutableArrayData. I do think this sort of optimisation is possibly relevant in some places, e.g. DataFusion when coalescing multiple RecordBatch, but potentially something to be included as part of a more holistic rework of how StringViewArray "compaction" occurs. I am not sure where that leaves this PR, but I would be inclined to close it. |
Which issue does this PR close?
Closes #.
Rationale for this change
MutableArrayData adds all variadic buffers from input arrays together, potentially duplicating the same buffers in the output array.
What changes are included in this PR?
extendnow checks if the same buffer is added from some other input array and changes the views to be appended to point to the new deduplicated buffer indicesAre there any user-facing changes?