Skip to content

Optimize "LIMIT" queries for speed / memory with special TopK operator #7196

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

This pattern is common:

SELECT c1, c2
FROM t
ORDER BY c3
LIMIT 10

For example we have queries in IOx like the following (this is the same pattern @NGA-TRAN describes on #7162)

SELECT tag, value1, ...
FROM t
WHERE other_column = 'foo'
ORDER BY time
LIMIT 10

Describe the solution you'd like

If the data IS NOT already sorted, what happens today is a plan like

LIMIT(fetch=10)
  SORT(sort_exprs=[c3] fetch=10)
    SCAN(...)

And the Sort can take partial advantage of the fetch -- and it will be better after @gruuya 's change in #7180

We can probably do better still with a special operator like the following that uses some specialized structure (perhaps some type of heap)

TOPK(fetch=10, sort_exprs=[c3])
    SCAN(...)

Describe alternatives you've considered

If the data is already sorted the right way, DataFusion can just read first N values and stop as described on #7162

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions