refactor count_distinct to not to have update and merge#5408
refactor count_distinct to not to have update and merge#5408Dandandan merged 1 commit intoapache:mainfrom
Conversation
| let arr = &values[0]; | ||
| (0..arr.len()).try_for_each(|index| { | ||
| let scalar = ScalarValue::try_from_array(arr, index)?; | ||
| let scalar = vec![scalar]; |
There was a problem hiding this comment.
We can avoid creating a Vec here
There was a problem hiding this comment.
Hello @Dandandan, thank you for reviewing my work. I'm currently having difficulty finding a way to avoid using vec in this context. Could you please provide me with some guidance on how to refactor it?
There was a problem hiding this comment.
We can modify DistinctScalarValues from
struct DistinctScalarValues(Vec<ScalarValue>)
to
struct DistinctScalarValues(ScalarValue)
and update the code accordingly.
This might provide some small performance increase.
| let arr = &states[0]; | ||
| (0..arr.len()).try_for_each(|index| { | ||
| let scalar = ScalarValue::try_from_array(arr, index)?; | ||
| let scalar = vec![scalar]; |
There was a problem hiding this comment.
We can rewrite the code to not create the Vec
04eeb2b to
8158d6b
Compare
| let arr = &values[0]; | ||
| (0..arr.len()).try_for_each(|index| { | ||
| let scalar = ScalarValue::try_from_array(arr, index)?; | ||
| if !ScalarValue::is_null(&scalar) { |
There was a problem hiding this comment.
This check can be done on the array arr already instead of on the scalar.
8158d6b to
16cb4c1
Compare
|
@Weijun-H The tests are failing, could you take a look? |
| } | ||
| fn state(&self) -> Result<Vec<ScalarValue>> { | ||
| let mut cols_out = self | ||
| .state_data_types |
There was a problem hiding this comment.
state_data_types can be simplified to be of type DataType too instead of Vec<DataType>
There was a problem hiding this comment.
(and code to be updated accordingly
There was a problem hiding this comment.
state_data_types can be simplified to be of type
DataTypetoo instead ofVec<DataType>
Do you mean state_data_types in DistinctCountAccumulator and DistinctCount?
There was a problem hiding this comment.
Yes, that doesn't need to be a Vec, as it always use a single state column, this will simplify the code some more.
|
@Dandandan, I reviewed the failed test and identified the root cause of the issue. It appears that the problem occurred because there was a modification made to the structure of the |
Ok. I think we have to update (or remove) the tests to update the expectations that we no longer support multiple columns. |
16cb4c1 to
11798fc
Compare
| state_data_types: Vec<DataType>, | ||
| state_data_types: DataType, | ||
| /// The input arguments | ||
| exprs: Vec<Arc<dyn PhysicalExpr>>, |
There was a problem hiding this comment.
I think we should also update exprs: Vec<Arc<dyn PhysicalExpr>> to exprs: Arc<dyn PhysicalExpr> and DistinctCount::new to communicate that multiple columns are no longer supported.
There was a problem hiding this comment.
Maybe input_data_types: Vec<DataType> also need to be changed to DataType, because it also need one DataType?
There was a problem hiding this comment.
Yeah, as well.
If you like to do some more refactoring, you could remove data_type: DataType from the DistinctCount struct as well and just keep a int64 in the DistinctCountAccumulator instead of using a ScalarValue + count_data_type .
11798fc to
2517567
Compare
| values: HashSet<DistinctScalarValues, RandomState>, | ||
| state_data_types: Vec<DataType>, | ||
| state_data_types: DataType, | ||
| count_data_type: DataType, |
There was a problem hiding this comment.
Can you remove this as well (it's basically unused by now)?
|
TY @Weijun-H I think it's looking great! |
|
|
||
| #[derive(Debug, PartialEq, Eq, Hash, Clone)] | ||
| struct DistinctScalarValues(Vec<ScalarValue>); | ||
| struct DistinctScalarValues(ScalarValue); |
There was a problem hiding this comment.
We could use HashSet<DistinctScalarValues, RandomState> instead and remove this struct.
2517567 to
a328f9c
Compare
| data_type: DataType, | ||
| /// The DataType used to hold the state for each input | ||
| state_data_types: Vec<DataType>, | ||
| state_data_types: DataType, |
There was a problem hiding this comment.
| state_data_types: DataType, | |
| state_data_type: DataType, |
| state_data_types: DataType, | ||
| /// The input arguments | ||
| exprs: Vec<Arc<dyn PhysicalExpr>>, | ||
| exprs: Arc<dyn PhysicalExpr>, |
There was a problem hiding this comment.
| exprs: Arc<dyn PhysicalExpr>, | |
| expr: Arc<dyn PhysicalExpr>, |
| input_data_types: Vec<DataType>, | ||
| exprs: Vec<Arc<dyn PhysicalExpr>>, | ||
| input_data_types: DataType, | ||
| exprs: Arc<dyn PhysicalExpr>, |
There was a problem hiding this comment.
| exprs: Arc<dyn PhysicalExpr>, | |
| expr: Arc<dyn PhysicalExpr>, |
| pub fn new( | ||
| input_data_types: Vec<DataType>, | ||
| exprs: Vec<Arc<dyn PhysicalExpr>>, | ||
| input_data_types: DataType, |
There was a problem hiding this comment.
| input_data_types: DataType, | |
| input_data_type: DataType, |
| values: HashSet<DistinctScalarValues, RandomState>, | ||
| state_data_types: Vec<DataType>, | ||
| count_data_type: DataType, | ||
| state_data_types: DataType, |
There was a problem hiding this comment.
| state_data_types: DataType, | |
| state_data_type: DataType, |
a328f9c to
98fcc97
Compare
98fcc97 to
fb4a17a
Compare
Dandandan
left a comment
There was a problem hiding this comment.
Nice, I think this is a great step forward 👍
Thank you for your patient guidance. |
|
FYI @alamb this has some backwards-incompatible changes. I think it removes some unnecessary complexity, making it easier to improve performance later on. |
|
Merging this in 24 hours if no other comments |
alamb
left a comment
There was a problem hiding this comment.
Looks great to me -- thank you @Weijun-H and @Dandandan
|
Benchmark runs are scheduled for baseline = ddd64e7 and contender = d11820a. d11820a is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
| } | ||
|
|
||
| fn size(&self) -> usize { | ||
| if self.count_data_type.is_primitive() { |
There was a problem hiding this comment.
I think we lost the size support for variable sized values during this PR, is it expected? @alamb
There was a problem hiding this comment.
It was not intended -- if someone has time to fix it that would be great, otherwise I will try and get a PR up later today
Which issue does this PR close?
Parts #1598
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?