perf: optimize `HashTableLookupExpr::evaluate` by UBarney · Pull Request #19602 · apache/datafusion

UBarney · 2026-01-02T08:19:38Z

Which issue does this PR close?

Closes #.

Rationale for this change

The previous implementation of HashTableLookupExpr::evaluate relied on per-row calls to get_matched_indices, which incurred unnecessary performance overhead:

Memory Overhead: Each per-row call triggered small Vec allocations and potential resizes, leading to pressure on the memory allocator.
Redundant Computation: get_matched_indices traverses the entire hash chain to find all matches, which is unnecessary when we only need to verify the existence of a key.

Performance Results (TPC-H)

The following TPC-H results were obtained with DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true:

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ baseline@9a9ff ┃  optimized ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │      679.51 ms │  728.06 ms │  1.07x slower │
│ QQuery 2     │      388.33 ms │  384.11 ms │     no change │
│ QQuery 3     │      864.38 ms │  856.27 ms │     no change │
│ QQuery 4     │      458.46 ms │  468.26 ms │     no change │
│ QQuery 5     │     1614.26 ms │ 1525.65 ms │ +1.06x faster │
│ QQuery 6     │      611.20 ms │  610.06 ms │     no change │
│ QQuery 7     │      950.39 ms │  940.13 ms │     no change │
│ QQuery 8     │     1214.86 ms │ 1218.21 ms │     no change │
│ QQuery 9     │     2657.61 ms │ 2482.09 ms │ +1.07x faster │
│ QQuery 10    │     1050.70 ms │ 1001.96 ms │     no change │
│ QQuery 11    │      383.92 ms │  347.27 ms │ +1.11x faster │
│ QQuery 12    │      963.14 ms │  920.78 ms │     no change │
│ QQuery 13    │      473.68 ms │  480.97 ms │     no change │
│ QQuery 14    │      363.36 ms │  345.27 ms │     no change │
│ QQuery 15    │      960.56 ms │  955.05 ms │     no change │
│ QQuery 16    │      281.95 ms │  267.34 ms │ +1.05x faster │
│ QQuery 17    │     5306.43 ms │ 4983.21 ms │ +1.06x faster │
│ QQuery 18    │     3415.11 ms │ 3016.52 ms │ +1.13x faster │
│ QQuery 19    │      761.67 ms │  759.49 ms │     no change │
│ QQuery 20    │      650.20 ms │  642.40 ms │     no change │
│ QQuery 21    │     3111.85 ms │ 2833.05 ms │ +1.10x faster │
│ QQuery 22    │      141.75 ms │  143.06 ms │     no change │
└──────────────┴────────────────┴────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary             ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (baseline@9a9ff)   │ 27303.30ms │
│ Total Time (optimized)        │ 25909.21ms │
│ Average Time (baseline@9a9ff) │  1241.06ms │
│ Average Time (optimized)      │  1177.69ms │
│ Queries Faster                │          7 │
│ Queries Slower                │          1 │
│ Queries with No Change        │         14 │
│ Queries with Failure          │          0 │
└───────────────────────────────┴────────────┘

Note that Q1 does not involve HashJoin.

Note on Configuration

Benchmarks were conducted with DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true because HashTableLookupExpr::evaluate is NOT invoked under default settings.

I manually added dbg!(&num_rows) at L335 in partitioned_hash_eval.rs and confirmed that the logic path is only triggered when this flag is enabled. Under default settings, HashTableLookupExpr::evaluate is not called; . I am uncertain if this current behavior is intentional.

What changes are included in this PR?

Added JoinHashMapType::contain_hashes: A new trait method that processes
a batch of hashes and updates a bitmask for existing keys.
Refactored HashTableLookupExpr::evaluate: Switched from per-row lookups to
the new batch API.

Are these changes tested?

Yes

Are there any user-facing changes?

NO

…ists

Copilot

Pull request overview

This PR optimizes HashTableLookupExpr::evaluate by replacing per-row hash table lookups with a batch processing API. The previous implementation made individual get_matched_indices calls for each row, causing unnecessary memory allocations and redundant computations. The new approach uses a single batch call that sets bits in a buffer for all matching hashes at once, resulting in notable performance improvements across multiple TPC-H queries (up to 1.13x faster on Q18).

Key Changes:

Introduced set_bits_if_exists trait method for batch hash lookups
Refactored HashTableLookupExpr::evaluate to use the new batch API
Added comprehensive test coverage for the new functionality

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
`datafusion/physical-plan/src/joins/join_hash_map.rs`	Adds `set_bits_if_exists` trait method to `JoinHashMapType` and implements it for `JoinHashMapU32` and `JoinHashMapU64` with corresponding helper function and tests
`datafusion/physical-plan/src/joins/stream_join_utils.rs`	Implements `set_bits_if_exists` for `PruningJoinHashMap` to support the new batch lookup API
`datafusion/physical-plan/src/joins/hash_join/partitioned_hash_eval.rs`	Refactors `HashTableLookupExpr::evaluate` to use the new batch API instead of per-row lookups

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Dandandan · 2026-01-02T11:55:41Z

-            }
-        }
+        self.hash_map
+            .set_bits_if_exists(hash_array.values(), buf.as_slice_mut());


I think we could use with_hashes / reuse hashes buffer instead of allocating a new one each time.

Done. Reusing the hashes buffer now. Thanks!

Dandandan · 2026-01-02T11:56:27Z

Very nice! I added 2 suggestions which could maybe improve it even further.

adriangb · 2026-01-02T22:49:34Z

Under default settings, HashTableLookupExpr::evaluate is not called; . I am uncertain if this current behavior is intentional.

Yes that's per design. It's only used when the filter is pushed down into and evaluated row by row by the Parquet machinery. There's work to make that the default: #19477

…er: &mut [u8])` -> `contain_hashes(&self, hash_values: &[u64]) -> BooleanArray`

UBarney · 2026-01-03T15:10:26Z

Very nice! I added 2 suggestions which could maybe improve it even further.

Thanks for reviewing! I've applied both suggestions. Could you please take another look? @Dandandan

Dandandan · 2026-01-03T15:52:05Z


+        // Optimization: if hash_expr is HashExpr, compute hashes directly into callback
+        // to avoid redundant allocations and copies.
+        if let Some(hash_expr) = self.hash_expr.as_any().downcast_ref::<HashExpr>() {


Isn't this always the case? We can remove the hashexpr (only store the inner expressions) while it is constructed to simplify the code?

That's a great point. Storing the inner expressions directly is indeed a better approach, especially since the perfect hash join(which is being worked on in that separate pending PR) needs direct access to on_columns to identify the join key columns.

I initially considered changing HashTableLookupExpr.hash_expr to a concrete HashExpr type (which I've actually already done in another unmerged PR for perfect hash join). However, I think that accessing the columns through hash_expr.on_columns feels a bit clunky 😂.

adriangb · 2026-01-04T15:25:47Z

It might be interesting to re-run https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/#hash-join-dynamic-filters and see if the numbers are even better now!

UBarney · 2026-01-05T06:59:46Z

It might be interesting to re-run https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/#hash-join-dynamic-filters and see if the numbers are even better now!

I re-ran the benchmark(SELECT * FROM small_table JOIN large_table ON small_table.k = large_table.k WHERE small_table.v >= 50;), but the execution time remains almost unchanged at around 7ms. It seems the min-max filter (k@0 >= 50 AND k@0 <= 1000) is already filtering out the majority of the data, which is likely why we don't see a significant difference.
@adriangb

Dandandan · 2026-01-06T10:12:59Z

Thank you @UBarney!

## Which issue does this PR close?  - Closes #. ## Rationale for this change  The previous implementation of `HashTableLookupExpr::evaluate` relied on per-row calls to `get_matched_indices`, which incurred unnecessary performance overhead: 1. **Memory Overhead**: Each per-row call triggered small `Vec` allocations and potential resizes, leading to pressure on the memory allocator. 2. **Redundant Computation**: `get_matched_indices` traverses the entire hash chain to find all matches, which is unnecessary when we only need to verify the existence of a key. ### Performance Results (TPC-H) The following TPC-H results were obtained with **`DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true`:** ``` ┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Query ┃ baseline@9a9ff ┃ optimized ┃ Change ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ QQuery 1 │ 679.51 ms │ 728.06 ms │ 1.07x slower │ │ QQuery 2 │ 388.33 ms │ 384.11 ms │ no change │ │ QQuery 3 │ 864.38 ms │ 856.27 ms │ no change │ │ QQuery 4 │ 458.46 ms │ 468.26 ms │ no change │ │ QQuery 5 │ 1614.26 ms │ 1525.65 ms │ +1.06x faster │ │ QQuery 6 │ 611.20 ms │ 610.06 ms │ no change │ │ QQuery 7 │ 950.39 ms │ 940.13 ms │ no change │ │ QQuery 8 │ 1214.86 ms │ 1218.21 ms │ no change │ │ QQuery 9 │ 2657.61 ms │ 2482.09 ms │ +1.07x faster │ │ QQuery 10 │ 1050.70 ms │ 1001.96 ms │ no change │ │ QQuery 11 │ 383.92 ms │ 347.27 ms │ +1.11x faster │ │ QQuery 12 │ 963.14 ms │ 920.78 ms │ no change │ │ QQuery 13 │ 473.68 ms │ 480.97 ms │ no change │ │ QQuery 14 │ 363.36 ms │ 345.27 ms │ no change │ │ QQuery 15 │ 960.56 ms │ 955.05 ms │ no change │ │ QQuery 16 │ 281.95 ms │ 267.34 ms │ +1.05x faster │ │ QQuery 17 │ 5306.43 ms │ 4983.21 ms │ +1.06x faster │ │ QQuery 18 │ 3415.11 ms │ 3016.52 ms │ +1.13x faster │ │ QQuery 19 │ 761.67 ms │ 759.49 ms │ no change │ │ QQuery 20 │ 650.20 ms │ 642.40 ms │ no change │ │ QQuery 21 │ 3111.85 ms │ 2833.05 ms │ +1.10x faster │ │ QQuery 22 │ 141.75 ms │ 143.06 ms │ no change │ └──────────────┴────────────────┴────────────┴───────────────┘ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ Benchmark Summary ┃ ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ Total Time (baseline@9a9ff) │ 27303.30ms │ │ Total Time (optimized) │ 25909.21ms │ │ Average Time (baseline@9a9ff) │ 1241.06ms │ │ Average Time (optimized) │ 1177.69ms │ │ Queries Faster │ 7 │ │ Queries Slower │ 1 │ │ Queries with No Change │ 14 │ │ Queries with Failure │ 0 │ └───────────────────────────────┴────────────┘ ``` Note that Q1 does not involve `HashJoin`. #### Note on Configuration Benchmarks were conducted with `DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true` because `HashTableLookupExpr::evaluate` is **NOT** invoked under default settings. I manually added `dbg!(&num_rows)` at [L335 in `partitioned_hash_eval.rs`](https://github.com/apache/datafusion/blob/9a9ff8d6162b7391736b0b7c82c00cb35b0652a1/datafusion/physical-plan/src/joins/hash_join/partitioned_hash_eval.rs#L335) and confirmed that the logic path is only triggered when this flag is enabled. Under default settings, `HashTableLookupExpr::evaluate` is not called; . I am uncertain if this current behavior is intentional. ## What changes are included in this PR? - Added `JoinHashMapType::contain_hashes`: A new trait method that processes a batch of hashes and updates a bitmask for existing keys. - Refactored `HashTableLookupExpr::evaluate`: Switched from per-row lookups to the new batch API.  ## Are these changes tested? Yes  ## Are there any user-facing changes? NO

github-actions Bot added the physical-plan Changes to the physical-plan crate label Jan 2, 2026

perf: optimize HashTableLookupExpr::evaluate by adding set_bits_if_ex…

e49ed78

…ists

UBarney marked this pull request as ready for review January 2, 2026 08:23

Dandandan reviewed Jan 2, 2026

View reviewed changes

Comment thread datafusion/physical-plan/src/joins/hash_join/partitioned_hash_eval.rs Outdated

Dandandan requested a review from Copilot January 2, 2026 10:53

Copilot started reviewing on behalf of Dandandan January 2, 2026 10:53 View session

Copilot AI reviewed Jan 2, 2026

View reviewed changes

apache deleted a comment from alamb-ghbot Jan 2, 2026

Dandandan reviewed Jan 2, 2026

View reviewed changes

UBarney added 2 commits January 3, 2026 09:45

`JoinHashMapType::set_bits_if_exists(&self, hash_values: &[u64], buff…

07bb819

…er: &mut [u8])` -> `contain_hashes(&self, hash_values: &[u64]) -> BooleanArray`

use with_hashes to compute hash in HashTableLookupExpr

2ab1a49

UBarney force-pushed the opt_htlExpr_eval branch from 1e684a1 to 2ab1a49 Compare January 3, 2026 14:39

UBarney requested a review from Dandandan January 3, 2026 15:09

Dandandan reviewed Jan 3, 2026

View reviewed changes

Comment thread datafusion/physical-plan/src/joins/hash_join/partitioned_hash_eval.rs Outdated

github-actions Bot added the proto Related to proto crate label Jan 4, 2026

inline HashExpr in HashTableLookupExpr

97864c9

UBarney force-pushed the opt_htlExpr_eval branch from 898905f to 97864c9 Compare January 4, 2026 09:53

Dandandan approved these changes Jan 4, 2026

View reviewed changes

UBarney mentioned this pull request Jan 6, 2026

perfect hash join #19411

Merged

Dandandan added this pull request to the merge queue Jan 6, 2026

Merged via the queue into apache:main with commit 5c2ee36 Jan 6, 2026
32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize `HashTableLookupExpr::evaluate`#19602

perf: optimize `HashTableLookupExpr::evaluate`#19602
Dandandan merged 4 commits intoapache:mainfrom
UBarney:opt_htlExpr_eval

UBarney commented Jan 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Dandandan Jan 2, 2026

Uh oh!

UBarney Jan 3, 2026

Uh oh!

Dandandan commented Jan 2, 2026

Uh oh!

adriangb commented Jan 2, 2026

Uh oh!

UBarney commented Jan 3, 2026

Uh oh!

Dandandan Jan 3, 2026 •

edited

Loading

Uh oh!

UBarney Jan 4, 2026

Uh oh!

Uh oh!

adriangb commented Jan 4, 2026

Uh oh!

UBarney commented Jan 5, 2026

Uh oh!

Uh oh!

Dandandan commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

UBarney commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

Performance Results (TPC-H)

Note on Configuration

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Dandandan Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

UBarney Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Jan 2, 2026

Uh oh!

adriangb commented Jan 2, 2026

Uh oh!

UBarney commented Jan 3, 2026

Uh oh!

Dandandan Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

UBarney Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adriangb commented Jan 4, 2026

Uh oh!

UBarney commented Jan 5, 2026

Uh oh!

Uh oh!

Dandandan commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

UBarney commented Jan 2, 2026 •

edited

Loading

Dandandan Jan 3, 2026 •

edited

Loading