Is your feature request related to a problem or challenge?
Currently datafusion supports topk_aggregation(#7192 ), and I've noticed two potential areas for optimization:
topk_aggregation only supports ordering by aggregations (order by agg), but doesn't support ordering by columns
topk_aggregation cannot be used in cases where there is no order by clause
Describe the solution you'd like
-
For the first one, I think we can directly use the current priority queue solution
-
For the second one:
The following query which without an ORDER BY clause is a non-deterministic query where returning aggregate results for any 10 keys is valid. Therefore, the simplest optimization method is to make the aggregate operation work in an ordered manner.
SELECT "UserID", MIN("AdvEngineID") FROM hits GROUP BY "UserID" order by MIN("AdvEngineID") LIMIT 10;
When using ORDER BY, it runs faster
> SELECT "UserID", MIN("AdvEngineID") FROM hits GROUP BY "UserID" order by MIN("AdvEngineID") LIMIT 10;
+---------------------+-----------------------+
| UserID | min(hits.AdvEngineID) |
+---------------------+-----------------------+
| 8265925904823819813 | 0 |
| 4187744066815097140 | 0 |
| 7970073217952173230 | 0 |
| 1427747163043990597 | 0 |
| 844431718317972727 | 0 |
| 5777095312382298493 | 0 |
| 1726996514541263413 | 0 |
| 9126602271142633613 | 0 |
| 137961262398427389 | 0 |
| 1015195936484711824 | 0 |
+---------------------+-----------------------+
10 row(s) fetched.
Elapsed 0.389 seconds.
> SELECT "UserID", MIN("AdvEngineID") FROM hits GROUP BY "UserID" LIMIT 10;
+--------------------+-----------------------+
| UserID | min(hits.AdvEngineID) |
+--------------------+-----------------------+
| 572919489234519776 | 0 |
| 573305205053433738 | 0 |
| 573316808606264402 | 0 |
| 573764012887855352 | 0 |
| 573878311420505425 | 0 |
| 574390391777319344 | 0 |
| 574747808491966822 | 0 |
| 574822330152391359 | 0 |
| 574882842092706724 | 0 |
| 576132580282803970 | 0 |
+--------------------+-----------------------+
10 row(s) fetched.
Elapsed 0.632 seconds.
Describe alternatives you've considered
No response
Additional context
Just found that the TopKAggregation only support group by one column, I'm confused.
|
let group_key = aggr.group_expr().expr().iter().exactly_one().ok()?; |
Is your feature request related to a problem or challenge?
Currently datafusion supports topk_aggregation(#7192 ), and I've noticed two potential areas for optimization:
topk_aggregationonly supports ordering by aggregations (order by agg), but doesn't support ordering by columnstopk_aggregationcannot be used in cases where there is noorder byclauseDescribe the solution you'd like
For the first one, I think we can directly use the current priority queue solution
For the second one:
The following query which without an
ORDER BYclause is a non-deterministic query where returning aggregate results for any 10 keys is valid. Therefore, the simplest optimization method is to make the aggregate operation work in an ordered manner.When using ORDER BY, it runs faster
Describe alternatives you've considered
No response
Additional context
Just found that the
TopKAggregationonly support group by one column, I'm confused.datafusion/datafusion/physical-optimizer/src/topk_aggregation.rs
Line 57 in 0f5634e