Perf: Window topn optimisation by SubhamSinghal · Pull Request #21479 · apache/datafusion

SubhamSinghal · 2026-04-08T17:50:08Z

Which issue does this PR close?

Related to Optimize "per partition" top-k : ROW_NUMBER < 5 / TopK #6899.

Rationale for this change

Queries like SELECT *, ROW_NUMBER() OVER (PARTITION BY pk ORDER BY val) as rn FROM t WHERE rn <= K are extremely common in analytics ("top N per group"). The current plan sorts the entire dataset O(N log N), computes ROW_NUMBER for all rows, then filters. With 10M rows, 1K partitions, and K=3, we sort all 10M rows but only keep 3K.

This PR introduces a PartitionedTopKExec operator that replaces the SortExec, maintaining a per-partition TopK heap (reusing DataFusion's existing TopK implementation). Cost drops to O(N log K) time and O(K × P × row_size) memory.

What changes are included in this PR?

New physical operator: PartitionedTopKExec (physical-plan/src/sorts/partitioned_topk.rs)

Reads unsorted input, groups rows by partition key using RowConverter, feeds sub-batches to a per-partition TopK heap
Emits only the top-K rows per partition in sorted (partition_keys, order_keys) order
Reuses the existing TopK implementation for heap management, sort key comparison, eviction, and batch compaction

New optimizer rule: WindowTopN (physical-optimizer/src/window_topn.rs)

Detects the pattern:

FilterExec(rn <= K)
  [optional ProjectionExec]
    BoundedWindowAggExec(ROW_NUMBER PARTITION BY ... ORDER BY ...)
      SortExec(partition_keys, order_keys)

And replaces it with:

[optional ProjectionExec]
  BoundedWindowAggExec(ROW_NUMBER PARTITION BY ... ORDER BY ...)
    PartitionedTopKExec(fetch=K)

Both FilterExec and SortExec are removed.

Supported predicates: rn <= K, rn < K, K >= rn, K > rn.

The rule only fires for ROW_NUMBER with a PARTITION BY clause. Global top-K (no PARTITION BY) is already handled by
SortExec with fetch.

Config flag: datafusion.optimizer.enable_window_topn (default: true)

Benchmark results (H2O groupby Q8, 10M rows, top-2 per partition):

cargo run --release --example h2o_window_topn_bench

Scenario	Enabled (ms)	Disabled (ms)	Speedup
100 partitions (100K rows/part)	43	174	4.0x
1K partitions (10K rows/part)	71	146	2.1x
10K partitions (1K rows/part)	619	128	0.2x (regression)
100K partitions (100 rows/part)	4368	135	0.03x (regression)

The 100K-partition regression is expected: per-partition TopK overhead (RowConverter, MemoryReservation per instance)
dominates when partitions are very numerous with few rows each. For the common case (moderate partition cardinality), the
optimization provides 2-3x speedup.

Are these changes tested?

Yes:

7 unit tests (core/tests/physical_optimizer/window_topn.rs): basic ROW_NUMBER, rn < K, flipped predicates, non-window column filter, config disabled, no partition by, projection between filter and window
5 SLT tests (sqllogictest/test_files/window_topn.slt): correctness verification, EXPLAIN plan validation, rn < K, no-partition-by case, config disabled fallback

Are there any user-facing changes?

No breaking API changes. The optimization is disabled by default and transparent to users. It can be enabled via:

SET datafusion.optimizer.enable_window_topn = true;

2010YOUY01

Thank you — this PR looks really nice.

I took a quick look and left a few suggestions. I’ll review the optimizer rewrite and execution side more carefully later.

2010YOUY01 · 2026-04-09T04:23:25Z

+// specific language governing permissions and limitations
+// under the License.
+
+// Standalone H2O groupby Q8 benchmark: PartitionedTopKExec enabled vs disabled


We could keep this benchmark in this PR, but it would be great to clean it up later.
To make benchmark maintenance easier, we could directly add queries representing this workload to h2o window benchmark, so that similar benchmarks won't get scattered to multiple places.

datafusion/benchmarks/bench.sh

Line 123 in e1ad871

h2o_small_window: Extended h2oai benchmark with small dataset (1e7 rows) for window, default file format is csv

Though the issue is now the h2o benchmark counts the dataset loading time, so we can't isolate the target executor's processing time, so we could add an option to eliminate the data loading time later 🤔

Though the issue is now the h2o benchmark counts the dataset loading time, so we can't isolate the target executor's processing time, so we could add an option to eliminate the data loading time later

Shall I keep benchmark query in h2o benchmark in this PR or shall we do it once we eliminate data loading time?

I prefer to move the benchmark in this PR into the h2o framework. We could directly add queries to the h2o-window queries, since it's not a standard benchmark.

2010YOUY01 · 2026-04-09T04:34:08Z

+        // Step 1: Match FilterExec at the top
+        let filter = plan.downcast_ref::<FilterExec>()?;
+
+        // Don't handle filters with projections


I'm curious why skipping this

The filter's column indices would point to the projected schema, not the window exec's output schema, so our index-based matching for the ROW_NUMBER column would be wrong without resolving the projection mapping. Skipping this case for simplicity right now.

Yes, it's a good idea to keep things simpler at start.

Could you file a PR for this follow-up work? I'm happy to do it also.

2010YOUY01 · 2026-04-09T04:39:24Z

+        )?))
+    }
+
+    fn apply_expressions(


Not related to this PR, but I’m curious why this is a required ExecutionPlan API and when it is used, given that different operators can hold expressions for very different purposes 🤔

2010YOUY01 · 2026-04-09T04:48:11Z

+# Tests for Window TopN optimization: PartitionedTopKExec
+
+statement ok
+CREATE TABLE window_topn_t (id INT, pk INT, val INT) AS VALUES


I suggest moving the main test coverage here, instead of keeping it in unit tests across different layers such as optimizer tests. Once we have solid coverage here, it is less likely to get lost during local refactors.

We can also extend the coverage with more edge cases, for example:

predicates such as rn < 2, 2 > rn, etc.

mixing other window expressions with row_number()

empty or overlapping partition / order keys, such as ... OVER (ORDER BY id) or ... OVER (PARTITION BY id ORDER BY id, customer)

different sort options such as ASC, DESC, and NULLS FIRST

the QUALIFY clause https://datafusion.apache.org/user-guide/sql/select.html#qualify-clause

and more

added tests for these cases

Dandandan · 2026-04-10T07:01:40Z

datafusion.optimizer.enable_window_topn

If it has regressions as large as 0.03x it should off by default (and we should look if we can automatically enable it via a heuristic / stats based on partition cardinality / rows)

2010YOUY01

I have reviewed it carefully, and it looks good to me.

I think it’s ready to go once the output batch coalescing is addressed (see comment). The other suggestions are preferably to be handled in follow-up PRs to keep this PR simple and focused.

2010YOUY01 · 2026-04-12T03:04:49Z

+        // Step 1: Match FilterExec at the top
+        let filter = plan.downcast_ref::<FilterExec>()?;
+
+        // Don't handle filters with projections


Yes, it's a good idea to keep things simpler at start.

Could you file a PR for this follow-up work? I'm happy to do it also.

2010YOUY01 · 2026-04-12T03:31:01Z

+        }};
+    }
+
+    // ---------- Accumulation phase ----------


Optimization to try as follow-up:
To make it faster, we might want to add a fast path for single partition keys like PARTITION BY a, since we don't have to do row conversion here.

Co-authored-by: Yongting You <2010youy01@gmail.com>

2010YOUY01

Thanks again!

I plan to merge it after 1-2 days, in case others want to review it again.

This reverts commit 936db37.

mbutrovich · 2026-04-16T02:01:30Z

This seems to have compilation issues against main, despite not having merge conflicts. I've opened a revert PR, sorry about that @SubhamSinghal. I think we definitely want this optimization in.

2010YOUY01 · 2026-04-16T03:19:31Z

This seems to have compilation issues against main, despite not having merge conflicts. I've opened a revert PR, sorry about that @SubhamSinghal. I think we definitely want this optimization in.

Thanks for catching this timely, we got that fixed.

Probably we should manually re-trigger CI before merging for large PRs, until the merge queue is able to handle this 🤔

## Which issue does this PR close?  `main` is not able to compile due to merge race by #21479 and #21573 This PR fixes the conflict ## Rationale for this change  ## What changes are included in this PR?  ## Are these changes tested?  ## Are there any user-facing changes?

## Which issue does this PR close? - Related to apache#6899. ## Rationale for this change Queries like `SELECT *, ROW_NUMBER() OVER (PARTITION BY pk ORDER BY val) as rn FROM t WHERE rn <= K` are extremely common in analytics ("top N per group"). The current plan sorts the **entire** dataset O(N log N), computes ROW_NUMBER for all rows, then filters. With 10M rows, 1K partitions, and K=3, we sort all 10M rows but only keep 3K. This PR introduces a `PartitionedTopKExec` operator that replaces the `SortExec`, maintaining a per-partition `TopK` heap (reusing DataFusion's existing `TopK` implementation). Cost drops to O(N log K) time and O(K × P × row_size) memory. ## What changes are included in this PR? **New physical operator: `PartitionedTopKExec`** (`physical-plan/src/sorts/partitioned_topk.rs`) - Reads unsorted input, groups rows by partition key using `RowConverter`, feeds sub-batches to a per-partition `TopK` heap - Emits only the top-K rows per partition in sorted `(partition_keys, order_keys)` order - Reuses the existing `TopK` implementation for heap management, sort key comparison, eviction, and batch compaction **New optimizer rule: `WindowTopN`** (`physical-optimizer/src/window_topn.rs`) Detects the pattern: ```text FilterExec(rn <= K) [optional ProjectionExec] BoundedWindowAggExec(ROW_NUMBER PARTITION BY ... ORDER BY ...) SortExec(partition_keys, order_keys) ``` And replaces it with: ```text [optional ProjectionExec] BoundedWindowAggExec(ROW_NUMBER PARTITION BY ... ORDER BY ...) PartitionedTopKExec(fetch=K) ``` Both `FilterExec` and `SortExec` are removed. Supported predicates: `rn <= K`, `rn < K`, `K >= rn`, `K > rn`. The rule only fires for `ROW_NUMBER` with a `PARTITION BY` clause. Global top-K (no `PARTITION BY`) is already handled by `SortExec` with `fetch`. **Config flag:** `datafusion.optimizer.enable_window_topn` (default: `true`) **Benchmark results** (H2O groupby Q8, 10M rows, top-2 per partition): cargo run --release --example h2o_window_topn_bench | Scenario | Enabled (ms) | Disabled (ms) | Speedup | |----------|-------------|--------------|---------| | 100 partitions (100K rows/part) | 43 | 174 | 4.0x | | 1K partitions (10K rows/part) | 71 | 146 | 2.1x | | 10K partitions (1K rows/part) | 619 | 128 | 0.2x (regression) | | 100K partitions (100 rows/part) | 4368 | 135 | 0.03x (regression) | The 100K-partition regression is expected: per-partition `TopK` overhead (RowConverter, MemoryReservation per instance) dominates when partitions are very numerous with few rows each. For the common case (moderate partition cardinality), the optimization provides 2-3x speedup. ## Are these changes tested? Yes: - **7 unit tests** (`core/tests/physical_optimizer/window_topn.rs`): basic ROW_NUMBER, `rn < K`, flipped predicates, non-window column filter, config disabled, no partition by, projection between filter and window - **5 SLT tests** (`sqllogictest/test_files/window_topn.slt`): correctness verification, EXPLAIN plan validation, `rn < K`, no-partition-by case, config disabled fallback ## Are there any user-facing changes? No breaking API changes. The optimization is disabled by default and transparent to users. It can be enabled via: ```sql SET datafusion.optimizer.enable_window_topn = true; ``` --------- Co-authored-by: Subham Singhal <subhamsinghal@Subhams-MacBook-Air.local> Co-authored-by: Yongting You <2010youy01@gmail.com>

## Which issue does this PR close?  `main` is not able to compile due to merge race by apache#21479 and apache#21573 This PR fixes the conflict ## Rationale for this change  ## What changes are included in this PR?  ## Are these changes tested?  ## Are there any user-facing changes?

Subham Singhal added 2 commits April 8, 2026 22:42

Benchmark window topn optimisation

38fa07a

Lint fix

52147dd

github-actions Bot added optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate physical-plan Changes to the physical-plan crate labels Apr 8, 2026

2010YOUY01 reviewed Apr 9, 2026

View reviewed changes

2010YOUY01 changed the title ~~Benchmark: Window topn optimisation~~ Perf: Window topn optimisation Apr 9, 2026

Resolve comment

48fd178

github-actions Bot added the documentation Improvements or additions to documentation label Apr 9, 2026

Subham Singhal added 2 commits April 9, 2026 19:59

Adds UT

5c2c0fb

Fix build failure

ca5a1ae

2010YOUY01 reviewed Apr 12, 2026

View reviewed changes

Subham Singhal and others added 5 commits April 12, 2026 11:21

Adds BatchCoaleser

ad73410

Apply suggestions from code review

ec15954

Co-authored-by: Yongting You <2010youy01@gmail.com>

Fix linting

c03de69

Fix build failure

26076d9

Merge branch 'main' into window-topn-partitioned-topk-exec

87c9e84

This was referenced Apr 13, 2026

Simplify WindowTopN by moving it before EnforceSorting #21594

Open

Support FilterExec with embedded projections in WindowTopN optimization #21596

Open

2010YOUY01 approved these changes Apr 14, 2026

View reviewed changes

Subham Singhal added 2 commits April 14, 2026 19:37

Adds h2o benchmark

da2cb09

Merge branch 'main' into window-topn-partitioned-topk-exec

edcf73f

2010YOUY01 added this pull request to the merge queue Apr 16, 2026

Merged via the queue into apache:main with commit 936db37 Apr 16, 2026
36 checks passed

mbutrovich added a commit that referenced this pull request Apr 16, 2026

Revert "Perf: Window topn optimisation (#21479)"

469ef1b

This reverts commit 936db37.

mbutrovich mentioned this pull request Apr 16, 2026

Revert "Perf: Window topn optimisation" #21661

Closed

2010YOUY01 mentioned this pull request Apr 16, 2026

fix: Fix compilation error on main #21664

Merged

SubhamSinghal deleted the window-topn-partitioned-topk-exec branch April 18, 2026 10:17

Conversation

SubhamSinghal commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

2010YOUY01 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Apr 10, 2026

Uh oh!

2010YOUY01 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mbutrovich commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

2010YOUY01 commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SubhamSinghal commented Apr 8, 2026 •

edited

Loading

mbutrovich commented Apr 16, 2026 •

edited

Loading

2010YOUY01 commented Apr 16, 2026 •

edited

Loading