Skip to content

Questionable hash seed reuse between RepartitionExec and HashJoinExec #15620

@ctsk

Description

@ctsk

Describe the bug

HashJoinExec and RepartitionExec both instantiate a ahash::RandomState with the same seed:

Partitioning::Hash(exprs, num_partitions) => BatchPartitionerState::Hash {
exprs,
num_partitions,
// Use fixed random hash
random_state: ahash::RandomState::with_seeds(0, 0, 0, 0),
hash_buffer: vec![],
},

let random_state = RandomState::with_seeds(0, 0, 0, 0);

This means that both operators compute the same hash functions. This is problematic when a HashJoinExec has a RepartitionExec as a child: Because the RepartitionExec partitions based on the lowest k bits of the hash, each HashJoinStream finds that all the hashes it computes have the same k lowest bits! In theory this could make the HashTable work less efficiently / have more collisions than expected.

Why in theory? Because despite my best attempts, I could not construct a benchmark that showed that changing the hash seed for HJ made a performance difference. I suspect this is due to a combination of the fact that the underlying hashtable uses open addressing, hashbrown storing a bitmask based on the highest bits in the buckets to do some early filtering and other bottlenecks in the HJ.

Is there anything I missed?

To Reproduce

Use this branch: https://github.com/ctsk/datafusion/tree/experiment/collision-reproducer

I patched the HashJoinExec so that it emits a warning when all row hashes of a batch share common least significant bits.

Run RUST_LOG=warn ./bench.sh run tpch and see lots of warning emitted.

Expected behavior

No response

Additional context

Fix is simple: Change the hash seed in HashJoinExec - or don't provide any seed (use Default::default() like we do in AggregationExec.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions