Is your feature request related to a problem or challenge?
This is a follow on to #7036
As @bmmeijers says in #7036, datafusion can make much better plans if you tell it about the sort order of files.
It is possible now to specify the order of a parquet file
$ datafusion-cli
DataFusion CLI v29.0.0
❯ create external table cpu(time timestamp) stored as parquet location 'cpu.parquet' with order (time desc);
0 rows in set. Query took 0.001 seconds.
❯ select * from cpu;
+---------------------+
| time |
+---------------------+
| 2022-09-30T12:55:00 |
+---------------------+
1 row in set. Query took 0.003 seconds.
❯ explain select * from cpu order by time desc;
+---------------+-----------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+-----------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Sort: cpu.time DESC NULLS FIRST |
| | TableScan: cpu projection=[time] |
| physical_plan | ParquetExec: file_groups={1 group: [[Users/alamb/Downloads/cpu.parquet]]}, projection=[time], output_ordering=[time@0 DESC] |
| | |
+---------------+-----------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.001 seconds.
However, it is not possible to specify the time without also specifying all of the schema, which is redundant given the schema is stored in the parquet files:
❯ create external table cpu stored as parquet location 'cpu.parquet' with order (time desc);
Error during planning: Provide a schema before specifying the order while creating a table.
Even though DataFusion can infer the schema automatically
❯ create external table cpu stored as parquet location 'cpu.parquet';
0 rows in set. Query took 0.002 seconds.
❯ select * from cpu;
+-----+---------------------+
| v | time |
+-----+---------------------+
| 1.0 | 2023-03-01T00:00:00 |
| 2.0 | 2023-03-02T00:00:00 |
+-----+---------------------+
2 rows in set. Query took 0.002 seconds.
Describe the solution you'd like
I would like to be able to specify the sort order for parquet files without also specifying the schema
Given this parquet file: cpu.zip
I would like this to work and produce a table both columns v and time ordered by time:
❯ create external table cpu stored as parquet location 'cpu.parquet' with order (time);
Error during planning: Provide a schema before specifying the order while creating a table.
Describe alternatives you've considered
No response
Additional context
No response
Is your feature request related to a problem or challenge?
This is a follow on to #7036
As @bmmeijers says in #7036, datafusion can make much better plans if you tell it about the sort order of files.
It is possible now to specify the order of a parquet file
However, it is not possible to specify the time without also specifying all of the schema, which is redundant given the schema is stored in the parquet files:
Even though DataFusion can infer the schema automatically
Describe the solution you'd like
I would like to be able to specify the sort order for parquet files without also specifying the schema
Given this parquet file: cpu.zip
I would like this to work and produce a table both columns
vandtimeordered bytime:Describe alternatives you've considered
No response
Additional context
No response