Is your feature request related to a problem or challenge?
GenericDistinctBuffer was introduced by #18348 to make it easier to write distinct accumulators. Currently it uses Hashable to represent the items being stored:
|
pub struct GenericDistinctBuffer<T: ArrowPrimitiveType> { |
|
pub values: HashSet<Hashable<T::Native>, RandomState>, |
|
data_type: DataType, |
|
} |
|
/// A wrapper around a type to provide hash for floats |
|
#[derive(Copy, Clone, Debug)] |
|
pub struct Hashable<T>(pub T); |
|
|
|
impl<T: ToByteSlice> std::hash::Hash for Hashable<T> { |
|
fn hash<H: std::hash::Hasher>(&self, state: &mut H) { |
|
self.0.to_byte_slice().hash(state) |
|
} |
|
} |
Hashable is mainly required because f32/f64 aren't hashable natively, so this wrapper lets us hash them + all other native types (e.g. i32, u64). However I wonder if there is a way to use Hashable only for floats, whilst using the native Hash implementation for other native types (potential performance benefit?).
Describe the solution you'd like
Be able to use GenericDistinctBuffer generically over both f32 and i32 types without needing to force both through Hashable interface. This could be done by having duplicated versions, one that takes Hashable and one that takes natively hashable types, but ideally want a single solution.
- If we had specialization this would be much easier 😅
Describe alternatives you've considered
No response
Additional context
If we could get this to work, we could fold PrimitiveDistinctCountAccumulator together with FloatDistinctCountAccumulator:
|
pub struct PrimitiveDistinctCountAccumulator<T> |
|
where |
|
T: ArrowPrimitiveType + Send, |
|
T::Native: Eq + Hash, |
|
{ |
|
values: HashSet<T::Native, RandomState>, |
|
data_type: DataType, |
|
} |
|
#[derive(Debug)] |
|
pub struct FloatDistinctCountAccumulator<T: ArrowPrimitiveType> { |
|
values: GenericDistinctBuffer<T>, |
|
} |
Also when making these changes we'd need to benchmark as likely there are performance implications; I initially tried forcing PrimitiveDistinctCountAccumulator to use Hashable but I kept hitting regressions on micro benchmark with count(distinct) queries, so need to keep that in mind.
Is your feature request related to a problem or challenge?
GenericDistinctBufferwas introduced by #18348 to make it easier to write distinct accumulators. Currently it usesHashableto represent the items being stored:datafusion/datafusion/functions-aggregate-common/src/utils.rs
Lines 185 to 188 in e42a0b6
datafusion/datafusion/functions-aggregate-common/src/utils.rs
Lines 73 to 81 in e42a0b6
Hashableis mainly required becausef32/f64aren't hashable natively, so this wrapper lets us hash them + all other native types (e.g.i32,u64). However I wonder if there is a way to useHashableonly for floats, whilst using the nativeHashimplementation for other native types (potential performance benefit?).Describe the solution you'd like
Be able to use
GenericDistinctBuffergenerically over bothf32andi32types without needing to force both throughHashableinterface. This could be done by having duplicated versions, one that takesHashableand one that takes natively hashable types, but ideally want a single solution.Describe alternatives you've considered
No response
Additional context
If we could get this to work, we could fold
PrimitiveDistinctCountAccumulatortogether withFloatDistinctCountAccumulator:datafusion/datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs
Lines 44 to 51 in e42a0b6
datafusion/datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs
Lines 126 to 129 in e42a0b6
Also when making these changes we'd need to benchmark as likely there are performance implications; I initially tried forcing
PrimitiveDistinctCountAccumulatorto useHashablebut I kept hitting regressions on micro benchmark withcount(distinct)queries, so need to keep that in mind.