ARROW-9420: [Rust][DataFusion] Added repartion physical plan#7729
ARROW-9420: [Rust][DataFusion] Added repartion physical plan#7729jorgecarleitao wants to merge 5 commits intoapache:masterfrom jorgecarleitao:repartition
Conversation
|
One thing that is not clear to me yet is the idiom to handle RecordBatch and partitions. My understanding is that a Partition can be executed in parallel (thread), but a RecordBatch is generally executed on the same thread, i.e. we normally loop through each RecordBatch using the same thread. Is the goal of RecordBatch to split a partition in smaller chunks of data to avoid too much memory usage? In this PR, I have not merged all the RecordBatches within a given partition in a single batch, and instead kept them separate. I am not sure if this is the correct approach here. |
|
Another point of contest here is that I have not tested what happens to rows with one key whose value is null. |
This reduces * the runtime complexity of this operation from O(N*(1 + M)) to O(N*M) (N=number of rows, M=number of aggregations), * the memory footprint from O(N*M) acumulators to O(M) accumulators * the code complexity via DRY.
This is currently not very performant as there is no threading involved, but it is nevertheless an important plan to have.
|
I don't think we are ready to take repartitioning at this point. |
This is written on top of #7687, so we should merge the other first.
This does not include any optimization to actually use this operation. We need to work out in a future PR.