feat: 2061 create external table ddl table partition cols#2099
feat: 2061 create external table ddl table partition cols#2099alamb merged 12 commits intoapache:masterfrom
Conversation
| }; | ||
|
|
||
| /// CSV file read option | ||
| #[derive(Copy, Clone)] |
There was a problem hiding this comment.
Arvo and Json read options do not use Copy trait. Removing it allows to use Vec<String>
| schema_infer_max_records: 1000, | ||
| delimiter: b',', | ||
| file_extension: ".csv", | ||
| file_extension: DEFAULT_CSV_EXTENSION, |
There was a problem hiding this comment.
replace with same constant
|
|
||
| ListingOptions { | ||
| format: Arc::new(file_format), | ||
| collect_stat: true, |
There was a problem hiding this comment.
this true align with the old code in datafusion/src/logical_plan/builder.rs
| } | ||
| } | ||
|
|
||
| fn parse_partitions(&mut self) -> Result<Vec<String>, ParserError> { |
There was a problem hiding this comment.
similar to parse_columns. I didn't reuse parse_columns because partitions only have identifier/name, there is no type or other options like column
There was a problem hiding this comment.
sorry missed this comment yesterday 👍
28f6a8b to
9492d17
Compare
| table_partition_cols: vec![], | ||
| }; | ||
| let listing_options = options | ||
| .parquet_pruning(parquet_pruning) |
There was a problem hiding this comment.
this is for following test use
disable parquet pruning has one use case from #723
…Table-DDL-table_partition_cols # Conflicts: # ballista/rust/client/src/context.rs # datafusion/src/execution/context.rs
|
will merge master or rebase, because of the whole folder re-org in #2081 |
…Table-DDL-table_partition_cols
| pub schema_infer_max_records: usize, | ||
| /// File extension; only files with this extension are selected for data input. | ||
| /// Defaults to ".csv". | ||
| /// Defaults to DEFAULT_CSV_EXTENSION. |
| } | ||
|
|
||
| /// Helper to convert these user facing options to `ListingTable` options | ||
| pub fn to_listing_options(&self, target_partitions: usize) -> ListingOptions { |
| target_partitions: usize, | ||
| table_name: impl Into<String>, | ||
| ) -> Result<Self> { | ||
| // TODO remove hard coded enable_pruning |
| return Ok(partitions); | ||
| } | ||
|
|
||
| loop { |
There was a problem hiding this comment.
Given this code to parse a comma separated list is duplicated in parse_columns below, perhaps we could refactor into a common function to reduce the replication -- not needed for this PR though
There was a problem hiding this comment.
yes, it maybe possible. Or maybe we can expose this as public method in sqlparser crate?
(agree not needed for this PR)
This is a copy of the equivalent implementation in sqlparser
| LOCATION '/path/to/aggregate_test_100.csv'; | ||
| ``` | ||
|
|
||
| If data sources are already partitioned in Hive style, `PARTITIONED BY` can be used for partition pruning. |
b97bf56 to
e8993eb
Compare
|
Thanks again @jychen7 |
Which issue does this PR close?
Closes #2061
Rationale for this change
support partition pruning using CreateExternalTable DDL
What changes are included in this PR?
read_parquetandregister_parquetIt maybe easier to review by commits, since first commit adds the functionality and other commits just modify a few dependent file/tests that use
read_parquetandregister_parquetwith default ParquetReadOptionsAre there any user-facing changes?
No change if user is using SQL.
However, there is change in
context.read_parquetandcontext.register_parquetparams, receiving a ParquetReadOptions fortable_partition_cols. It is more align with other file format, since Csv, Avro, Json already have their own ReadOptionsManual Test in datafusion-cli
create tmp sample file
run in datafusion-cli