feat: multiple columns in count distinct by Mark1626 · Pull Request #20460 · apache/datafusion

Mark1626 · 2026-02-21T11:38:49Z

Which issue does this PR close?

What changes are included in this PR?

Introduce a separate accumulator for multi column distinct count MultiColumnDistinctCountAccumulator
I used some parts of Count distinct support multiple expressions #5939 for reference, however it was old so I had to reimplement this

Are these changes tested?

Unit tests have been added
I've tested this with a couple of queries in the cli

with data AS (
  select * from (values
    ('a', 1, 'x'),
    ('a', 2, 'x'),
    ('b', 2, 'y'),
    ('b', 2, 'z'),
    ('c', 3, 'z')
  ) AS t(col1, col2, col3)
)
select count(distinct (col1, col2)) FROM data;

comphead

Thanks @Mark1626 for driving this 💪

Before going to code review lets expand tests a little bit to support possible cases, specifically:

mixed nulls in values
different column datatypes
3+ cols
different col order
duplicates like select count(distinct a, a), select count(distinct a, a, b, b)`

Once we have tests passed, we most likely got the code is stable and ready for review

Mark1626 · 2026-02-23T06:02:06Z

@comphead Sure I'll expand the tests, should all these new one be in .slt?

@Dandandan I'll try using struct.Row, I was wondering how I can improve performance

jonathanc-n · 2026-02-23T08:00:20Z

Does sliding accumulator support distinct on multi column? We should add a test for it and block if it doesn't work. (ex. count(distinct a, b) over ...)

jonathanc-n · 2026-02-23T08:00:26Z

+                    .iter()
+                    .map(|field| {
+                        Arc::new(Field::new(
+                            format_state_name(args.name, "count distinct"),


same column names will look identical here. we should include original field name or col index to differentiate

Does this comment still need to be addressed?

comphead · 2026-02-23T16:07:26Z

@comphead Sure I'll expand the tests, should all these new one be in .slt?

I dont have strong opinion tbh, lets have in one .slt for now

…mulator

Mark1626 · 2026-03-01T11:49:35Z

I've addressed the review comments. The single_distinct_to_groupby.rs optimizer rule throws an error for select count(distinct c, c)

It's using a HashSet so distinct c, c is considered as a single column. I trying to see if something can be done for this

datafusion/datafusion/optimizer/src/single_distinct_to_groupby.rs

Lines 65 to 67 in 6713439

    
           fn is_single_distinct_agg(aggr_expr: &[Expr]) -> Result<bool> { 
        
               let mut fields_set = HashSet::new(); 
        
               let mut aggregate_count = 0;

Mark1626 · 2026-03-01T12:11:03Z

I fixed the issue with single_distinct_to_groupby, count(distinct a, a) is rewritten as count(distinct a).

The slliding accumulator isn't supporting distinct on multi column at the moment and is showing an incorrect result right now, I'll see if I can re-use the new accumulator there

Mark1626 · 2026-04-02T04:25:44Z

Bumping this up, any review comments on this?

Jefffrey · 2026-04-18T09:04:25Z

+                    .iter()
+                    .map(|field| {
+                        Arc::new(Field::new(
+                            format_state_name(args.name, "count distinct"),


Does this comment still need to be addressed?

Jefffrey · 2026-04-18T09:07:32Z

+                                // De-duplicate args so that e.g. count(distinct c, c)
+                                // is treated as count(distinct c).
+                                // is_single_distinct_agg already verified that all
+                                // unique distinct args across aggregates refer to the
+                                // same single field.
+                                let mut seen = HashSet::new();
+                                args.retain(|arg| {
+                                    seen.insert(arg.schema_name().to_string())
+                                });


This seems a bit odd to handle here; what happens in the case this rule doesn't fire (e.g. theres another aggregate which causes this rule to not do rewrite)

## Which issue does this PR close?  N/A ## Rationale for this change  Some PRs are being omitted from stale check because they were in a cache, and the workflow appears to not have permission to delete cache so they are forever stuck as unprocessed. For example in this run: https://github.com/apache/datafusion/actions/runs/24756695077/job/72431314533 Seeing this in logs: ``` [apache#20473] issue skipped due being processed during the previous run [apache#20460] pull request skipped due being processed during the previous run [apache#20448] issue skipped due being processed during the previous run [apache#20443] issue skipped due being processed during the previous run [apache#20435] issue skipped due being processed during the previous run [apache#20418] issue skipped due being processed during the previous run [apache#20417] pull request skipped due being processed during the previous run [apache#20416] pull request skipped due being processed during the previous run [apache#20403] pull request skipped due being processed during the previous run ``` And at the end we see this warning: ``` Warning: Error delete _state: [403] Resource not accessible by integration - https://docs.github.com/rest/actions/cache#delete-github-actions-caches-for-a-repository-using-a-cache-key ``` stale workflow uses a cache in case it hits the `operations-per-run` limit meant to prevent API rate limiting (we have default of 30), so it seems we previously hit this limit and some issues/PRs were cached, and have never been uncached since so are never processed again. See: https://github.com/actions/stale#operations-per-run ## What changes are included in this PR?  Give permission to stale workflow to run github actions (like delete cache). See recommended permissions: https://github.com/actions/stale#recommended-permissions ## Are these changes tested?  ## Are there any user-facing changes?

feat: multiple columns in count distinct

ac48a2b

github-actions Bot added core Core DataFusion crate functions Changes to functions implementation labels Feb 21, 2026

fix: clippy and slt expected result

183f2fc

github-actions Bot added the sqllogictest SQL Logic Tests (.slt) label Feb 21, 2026

fix: slt typo

ec37334

Dandandan reviewed Feb 21, 2026

View reviewed changes

Comment thread datafusion/functions-aggregate/src/count.rs Outdated

comphead mentioned this pull request Feb 21, 2026

Add support for COUNT(DISTINCT expr, expr1, ...) apache/datafusion-comet#2292

Open

comphead reviewed Feb 21, 2026

View reviewed changes

Comment thread datafusion/core/tests/sql/aggregates/basic.rs

comphead reviewed Feb 21, 2026

View reviewed changes

jonathanc-n reviewed Feb 23, 2026

View reviewed changes

zhangxffff mentioned this pull request Feb 24, 2026

perf: Use Arrow vectorized eq kernel for IN list with column references #20528

Merged

feat: Use arrow-row and add more UTs for MultiColumnDistinctCountAccu…

cc63fc7

…mulator

Mark1626 added 2 commits March 1, 2026 17:26

lint: Fix clippy errors

343a9ff

feat: Handle single_distinct_to_groupby for distinct with multi columns

16f7785

github-actions Bot added the optimizer Optimizer rules label Mar 1, 2026

feat: Support multi count accumulator for sliding window

1ec2237

Mark1626 requested review from Dandandan and comphead March 3, 2026 05:47

Jefffrey reviewed Apr 18, 2026

View reviewed changes

Conversation

Mark1626 commented Feb 21, 2026

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Uh oh!

Uh oh!

Uh oh!

comphead left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mark1626 commented Feb 23, 2026

Uh oh!

jonathanc-n Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

jonathanc-n Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Jefffrey Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

comphead commented Feb 23, 2026

Uh oh!

Mark1626 commented Mar 1, 2026

Uh oh!

Mark1626 commented Mar 1, 2026

Uh oh!

Mark1626 commented Apr 2, 2026

Uh oh!

Jefffrey Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Jefffrey Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

comphead left a comment •

edited

Loading