Skip to content

Don't scan first column on empty projection #3214

@Dandandan

Description

@Dandandan

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Depends on: #2603

When we perform without needing the like SELECT COUNT(1) FROM table, the plan always reads the first column (whatever this is). This is inefficient: in case of formats like Parquet we can avoid scanning / reading the column and just produce the row counts. For non-columnar formats it can avoid unnecessary parsing (or implementing a fast path, i.e. only counting lines).

Projection: Count(1)
  TableScan: test projection=[a]

Should become:

Projection: Count(1)
  TableScan: test projection=[]

Describe the solution you'd like
We can push the responsibility of dealing with producing an array with a certain number of rows into the individual readers / other parts of the plans. They should produce RecordBatches with the number of rows.
We should remove the line projection.insert(0); from projection push down.

Describe alternatives you've considered

Additional context
Some queries in the ClickBench benchmark show this performance issue (https://benchmark.clickhouse.com/ ):

| logical_plan  | Projection: #COUNT(UInt8(1))                                                                                                       |
|               |   Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1))]]                                                                                |
|               |     TableScan: hits projection=[WatchID]                                                                                           |

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformanceMake DataFusion faster

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions