Skip to content

perf: Use Hashbrown for array_distinct#20538

Merged
mbutrovich merged 1 commit intoapache:mainfrom
neilconway:neilc/array-distinct-hashbrown
Feb 25, 2026
Merged

perf: Use Hashbrown for array_distinct#20538
mbutrovich merged 1 commit intoapache:mainfrom
neilconway:neilc/array-distinct-hashbrown

Conversation

@neilconway
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

N/A

Rationale for this change

#20364 recently optimized array_distinct to use batched row conversion. As part of that PR, std::HashSet was used. This PR just replaces std::HashSet with hashbrown::HashSet, which measurably improves performance.

What changes are included in this PR?

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@github-actions github-actions Bot added the functions Changes to functions implementation label Feb 25, 2026
@neilconway
Copy link
Copy Markdown
Contributor Author

Benchmarks:

  ┌────────────────────┬──────────────┬───────────┬─────────────┐
  │     Benchmark      │ std::HashSet │ hashbrown │ Improvement │
  ├────────────────────┼──────────────┼───────────┼─────────────┤
  │ high_duplicate/10  │ 145.00 µs    │ 68.94 µs  │ -52.4%      │
  ├────────────────────┼──────────────┼───────────┼─────────────┤
  │ high_duplicate/50  │ 719.46 µs    │ 350.68 µs │ -51.3%      │
  ├────────────────────┼──────────────┼───────────┼─────────────┤
  │ high_duplicate/100 │ 1.404 ms     │ 674.83 µs │ -52.0%      │
  ├────────────────────┼──────────────┼───────────┼─────────────┤
  │ low_duplicate/10   │ 180.33 µs    │ 107.87 µs │ -40.0%      │
  ├────────────────────┼──────────────┼───────────┼─────────────┤
  │ low_duplicate/50   │ 849.26 µs    │ 530.89 µs │ -37.6%      │
  ├────────────────────┼──────────────┼───────────┼─────────────┤
  │ low_duplicate/100  │ 1.706 ms     │ 987.72 µs │ -42.0%      │
  └────────────────────┴──────────────┴───────────┴─────────────┘

Copy link
Copy Markdown
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @neilconway it is amazing how 1 line can change so much

@mbutrovich mbutrovich added this pull request to the merge queue Feb 25, 2026
Merged via the queue into apache:main with commit e894a03 Feb 25, 2026
28 checks passed
@neilconway neilconway deleted the neilc/array-distinct-hashbrown branch February 25, 2026 18:58
@neilconway
Copy link
Copy Markdown
Contributor Author

Thanks @neilconway it is amazing how 1 line can change so much

Indeed; the choice of hash function in std seems a bit silly, at least for our use-case. #19869 might be worth considering...

de-bgunter pushed a commit to de-bgunter/datafusion that referenced this pull request Mar 24, 2026
## Which issue does this PR close?

N/A

## Rationale for this change

apache#20364 recently optimized `array_distinct` to use batched row
conversion. As part of that PR, `std::HashSet` was used. This PR just
replaces `std::HashSet` with `hashbrown::HashSet`, which measurably
improves performance.

## What changes are included in this PR?

## Are these changes tested?

Yes.

## Are there any user-facing changes?

No.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants