Skip to content

feat: multiple columns in count distinct#20460

Open
Mark1626 wants to merge 7 commits intoapache:mainfrom
Mark1626:feat/count-distinct-multi
Open

feat: multiple columns in count distinct#20460
Mark1626 wants to merge 7 commits intoapache:mainfrom
Mark1626:feat/count-distinct-multi

Conversation

@Mark1626
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #5619

What changes are included in this PR?

  1. Introduce a separate accumulator for multi column distinct count MultiColumnDistinctCountAccumulator
  2. I used some parts of Count distinct support multiple expressions #5939 for reference, however it was old so I had to reimplement this

Are these changes tested?

  1. Unit tests have been added
  2. I've tested this with a couple of queries in the cli
with data AS (
  select * from (values
    ('a', 1, 'x'),
    ('a', 2, 'x'),
    ('b', 2, 'y'),
    ('b', 2, 'z'),
    ('c', 3, 'z')
  ) AS t(col1, col2, col3)
)
select count(distinct (col1, col2)) FROM data;

@github-actions github-actions Bot added core Core DataFusion crate functions Changes to functions implementation labels Feb 21, 2026
@github-actions github-actions Bot added the sqllogictest SQL Logic Tests (.slt) label Feb 21, 2026
Comment thread datafusion/functions-aggregate/src/count.rs Outdated
Comment thread datafusion/core/tests/sql/aggregates/basic.rs
Copy link
Copy Markdown
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Mark1626 for driving this 💪

Before going to code review lets expand tests a little bit to support possible cases, specifically:

  • mixed nulls in values
  • different column datatypes
  • 3+ cols
  • different col order
  • duplicates like select count(distinct a, a), select count(distinct a, a, b, b)`

Once we have tests passed, we most likely got the code is stable and ready for review

@Mark1626
Copy link
Copy Markdown
Contributor Author

@comphead Sure I'll expand the tests, should all these new one be in .slt?

@Dandandan I'll try using struct.Row, I was wondering how I can improve performance

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does sliding accumulator support distinct on multi column? We should add a test for it and block if it doesn't work. (ex. count(distinct a, b) over ...)

.iter()
.map(|field| {
Arc::new(Field::new(
format_state_name(args.name, "count distinct"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same column names will look identical here. we should include original field name or col index to differentiate

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this comment still need to be addressed?

Comment thread datafusion/functions-aggregate/src/count.rs Outdated
Comment thread datafusion/functions-aggregate/src/count.rs Outdated
@comphead
Copy link
Copy Markdown
Contributor

@comphead Sure I'll expand the tests, should all these new one be in .slt?

I dont have strong opinion tbh, lets have in one .slt for now

@Mark1626
Copy link
Copy Markdown
Contributor Author

Mark1626 commented Mar 1, 2026

I've addressed the review comments. The single_distinct_to_groupby.rs optimizer rule throws an error for select count(distinct c, c)

It's using a HashSet so distinct c, c is considered as a single column. I trying to see if something can be done for this

fn is_single_distinct_agg(aggr_expr: &[Expr]) -> Result<bool> {
let mut fields_set = HashSet::new();
let mut aggregate_count = 0;

@github-actions github-actions Bot added the optimizer Optimizer rules label Mar 1, 2026
@Mark1626
Copy link
Copy Markdown
Contributor Author

Mark1626 commented Mar 1, 2026

I fixed the issue with single_distinct_to_groupby, count(distinct a, a) is rewritten as count(distinct a).

The slliding accumulator isn't supporting distinct on multi column at the moment and is showing an incorrect result right now, I'll see if I can re-use the new accumulator there

@Mark1626 Mark1626 requested review from Dandandan and comphead March 3, 2026 05:47
@Mark1626
Copy link
Copy Markdown
Contributor Author

Mark1626 commented Apr 2, 2026

Bumping this up, any review comments on this?

.iter()
.map(|field| {
Arc::new(Field::new(
format_state_name(args.name, "count distinct"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this comment still need to be addressed?

Comment on lines +193 to +201
// De-duplicate args so that e.g. count(distinct c, c)
// is treated as count(distinct c).
// is_single_distinct_agg already verified that all
// unique distinct args across aggregates refer to the
// same single field.
let mut seen = HashSet::new();
args.retain(|arg| {
seen.insert(arg.schema_name().to_string())
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit odd to handle here; what happens in the case this rule doesn't fire (e.g. theres another aggregate which causes this rule to not do rewrite)

zzcclp pushed a commit to zzcclp/arrow-datafusion that referenced this pull request Apr 23, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

N/A

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

Some PRs are being omitted from stale check because they were in a
cache, and the workflow appears to not have permission to delete cache
so they are forever stuck as unprocessed.

For example in this run:
https://github.com/apache/datafusion/actions/runs/24756695077/job/72431314533

Seeing this in logs:

```
[apache#20473]            issue skipped due being processed during the previous run
[apache#20460]            pull request skipped due being processed during the previous run
[apache#20448]            issue skipped due being processed during the previous run
[apache#20443]            issue skipped due being processed during the previous run
[apache#20435]            issue skipped due being processed during the previous run
[apache#20418]            issue skipped due being processed during the previous run
[apache#20417]            pull request skipped due being processed during the previous run
[apache#20416]            pull request skipped due being processed during the previous run
[apache#20403]            pull request skipped due being processed during the previous run
```

And at the end we see this warning:

```
Warning: Error delete _state: [403] Resource not accessible by integration - https://docs.github.com/rest/actions/cache#delete-github-actions-caches-for-a-repository-using-a-cache-key
```

stale workflow uses a cache in case it hits the `operations-per-run`
limit meant to prevent API rate limiting (we have default of 30), so it
seems we previously hit this limit and some issues/PRs were cached, and
have never been uncached since so are never processed again. See:
https://github.com/actions/stale#operations-per-run

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Give permission to stale workflow to run github actions (like delete
cache). See recommended permissions:

https://github.com/actions/stale#recommended-permissions

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate functions Changes to functions implementation optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Count() and Count(Distinct )should accept multiple exprs

6 participants