feat: multiple columns in count distinct#20460
Conversation
There was a problem hiding this comment.
Thanks @Mark1626 for driving this 💪
Before going to code review lets expand tests a little bit to support possible cases, specifically:
- mixed nulls in values
- different column datatypes
- 3+ cols
- different col order
- duplicates like
select count(distinct a, a), select count(distinct a, a, b, b)`
Once we have tests passed, we most likely got the code is stable and ready for review
|
@comphead Sure I'll expand the tests, should all these new one be in @Dandandan I'll try using |
There was a problem hiding this comment.
Does sliding accumulator support distinct on multi column? We should add a test for it and block if it doesn't work. (ex. count(distinct a, b) over ...)
| .iter() | ||
| .map(|field| { | ||
| Arc::new(Field::new( | ||
| format_state_name(args.name, "count distinct"), |
There was a problem hiding this comment.
same column names will look identical here. we should include original field name or col index to differentiate
There was a problem hiding this comment.
Does this comment still need to be addressed?
I dont have strong opinion tbh, lets have in one |
|
I've addressed the review comments. The It's using a datafusion/datafusion/optimizer/src/single_distinct_to_groupby.rs Lines 65 to 67 in 6713439 |
|
I fixed the issue with The slliding accumulator isn't supporting distinct on multi column at the moment and is showing an incorrect result right now, I'll see if I can re-use the new accumulator there |
|
Bumping this up, any review comments on this? |
| .iter() | ||
| .map(|field| { | ||
| Arc::new(Field::new( | ||
| format_state_name(args.name, "count distinct"), |
There was a problem hiding this comment.
Does this comment still need to be addressed?
| // De-duplicate args so that e.g. count(distinct c, c) | ||
| // is treated as count(distinct c). | ||
| // is_single_distinct_agg already verified that all | ||
| // unique distinct args across aggregates refer to the | ||
| // same single field. | ||
| let mut seen = HashSet::new(); | ||
| args.retain(|arg| { | ||
| seen.insert(arg.schema_name().to_string()) | ||
| }); |
There was a problem hiding this comment.
This seems a bit odd to handle here; what happens in the case this rule doesn't fire (e.g. theres another aggregate which causes this rule to not do rewrite)
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> N/A ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Some PRs are being omitted from stale check because they were in a cache, and the workflow appears to not have permission to delete cache so they are forever stuck as unprocessed. For example in this run: https://github.com/apache/datafusion/actions/runs/24756695077/job/72431314533 Seeing this in logs: ``` [apache#20473] issue skipped due being processed during the previous run [apache#20460] pull request skipped due being processed during the previous run [apache#20448] issue skipped due being processed during the previous run [apache#20443] issue skipped due being processed during the previous run [apache#20435] issue skipped due being processed during the previous run [apache#20418] issue skipped due being processed during the previous run [apache#20417] pull request skipped due being processed during the previous run [apache#20416] pull request skipped due being processed during the previous run [apache#20403] pull request skipped due being processed during the previous run ``` And at the end we see this warning: ``` Warning: Error delete _state: [403] Resource not accessible by integration - https://docs.github.com/rest/actions/cache#delete-github-actions-caches-for-a-repository-using-a-cache-key ``` stale workflow uses a cache in case it hits the `operations-per-run` limit meant to prevent API rate limiting (we have default of 30), so it seems we previously hit this limit and some issues/PRs were cached, and have never been uncached since so are never processed again. See: https://github.com/actions/stale#operations-per-run ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Give permission to stale workflow to run github actions (like delete cache). See recommended permissions: https://github.com/actions/stale#recommended-permissions ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
Which issue does this PR close?
Closes #5619
What changes are included in this PR?
MultiColumnDistinctCountAccumulatorAre these changes tested?