feat: 2061 create external table ddl table partition cols by jychen7 · Pull Request #2099 · apache/datafusion

jychen7 · 2022-03-27T04:45:50Z

Which issue does this PR close?

Closes #2061

Rationale for this change

support partition pruning using CreateExternalTable DDL

What changes are included in this PR?

datafusion CreateExternalTable DDL
datafusion and ballista context about read_parquet and register_parquet

It maybe easier to review by commits, since first commit adds the functionality and other commits just modify a few dependent file/tests that use read_parquet and register_parquet with default ParquetReadOptions

Are there any user-facing changes?

No change if user is using SQL.
However, there is change in context.read_parquet and context.register_parquet params, receiving a ParquetReadOptions for table_partition_cols. It is more align with other file format, since Csv, Avro, Json already have their own ReadOptions

Manual Test in datafusion-cli

create tmp sample file

echo "1,2" > tmp/year=2022/data.csv
echo "3,4" > tmp/year=2021/data.csv

run in datafusion-cli

❯ CREATE EXTERNAL TABLE t1 (a INT, b INT) STORED AS CSV LOCATION 'tmp';
0 rows in set. Query took 0.002 seconds.
❯ select * from t1;
+---+---+
| a | b |
+---+---+
| 3 | 4 |
| 1 | 2 |
+---+---+
2 rows in set. Query took 0.012 seconds.
❯ CREATE EXTERNAL TABLE t2 (a INT, b INT) STORED AS CSV PARTITIONED BY (year) LOCATION 'tmp';
0 rows in set. Query took 0.001 seconds.
❯ select * from t2;
+---+---+------+
| a | b | year |
+---+---+------+
| 1 | 2 | 2022 |
| 3 | 4 | 2021 |
+---+---+------+
2 rows in set. Query took 0.011 seconds.
❯ select * from t2 where year = '2022';
+---+---+------+
| a | b | year |
+---+---+------+
| 1 | 2 | 2022 |
+---+---+------+
1 row in set. Query took 0.013 seconds.

…atafusion

jychen7 · 2022-03-27T04:50:26Z

 };

 /// CSV file read option
-#[derive(Copy, Clone)]


Arvo and Json read options do not use Copy trait. Removing it allows to use Vec<String>

jychen7 · 2022-03-27T04:50:47Z

            schema_infer_max_records: 1000,
            delimiter: b',',
-            file_extension: ".csv",
+            file_extension: DEFAULT_CSV_EXTENSION,


replace with same constant

jychen7 · 2022-03-27T04:52:51Z

+
+        ListingOptions {
+            format: Arc::new(file_format),
+            collect_stat: true,


this true align with the old code in datafusion/src/logical_plan/builder.rs

https://github.com/apache/arrow-datafusion/blob/81592947e8814327ebdbd1fbc3d4a090796e37a3/datafusion/src/logical_plan/builder.rs#L284-L290

jychen7 · 2022-03-27T04:54:38Z

        }
    }

+    fn parse_partitions(&mut self) -> Result<Vec<String>, ParserError> {


similar to parse_columns. I didn't reuse parse_columns because partitions only have identifier/name, there is no type or other options like column

https://github.com/apache/arrow-datafusion/blob/f50e8000ee4de8432d7b59b63c18a8e0171a072b/datafusion/src/sql/parser.rs#L227-L229

sorry missed this comment yesterday 👍

…ntext::SessionConfig https://github.com/apache/arrow-datafusion/blob/73ea6e16f5c8f34526c01490a5ec277a68f33791/datafusion/tests/parquet_pruning.rs#L143

jychen7 · 2022-03-27T22:06:26Z

-            table_partition_cols: vec![],
-        };
+        let listing_options = options
+            .parquet_pruning(parquet_pruning)


this is for following test use

https://github.com/apache/arrow-datafusion/blob/a09e1aeb5fa279e2a14554c3dad9dfb17d9326e7/datafusion/tests/parquet_pruning.rs#L163-L174

disable parquet pruning has one use case from #723

…Table-DDL-table_partition_cols # Conflicts: # ballista/rust/client/src/context.rs # datafusion/src/execution/context.rs

jychen7 · 2022-03-30T00:33:21Z

will merge master or rebase, because of the whole folder re-org in #2081

…Table-DDL-table_partition_cols

alamb

Thank you @jychen7 -- Sorry for the delay reviewing this PR. This PR looks really good 🏅

I think the addition of read options to the parquet APIs is very nice

cc @rdettai and @tustvold

Hopefully fixing up the merge conflicts won't be too bad

alamb · 2022-04-02T11:11:43Z

    pub schema_infer_max_records: usize,
    /// File extension; only files with this extension are selected for data input.
-    /// Defaults to ".csv".
+    /// Defaults to DEFAULT_CSV_EXTENSION.


alamb · 2022-04-02T11:13:20Z

+    }
+
+    /// Helper to convert these user facing options to `ListingTable` options
+    pub fn to_listing_options(&self, target_partitions: usize) -> ListingOptions {


alamb · 2022-04-02T11:13:33Z

        target_partitions: usize,
        table_name: impl Into<String>,
    ) -> Result<Self> {
-        // TODO remove hard coded enable_pruning


alamb · 2022-04-02T11:16:04Z

+            return Ok(partitions);
+        }
+
+        loop {


Given this code to parse a comma separated list is duplicated in parse_columns below, perhaps we could refactor into a common function to reduce the replication -- not needed for this PR though

#2099 (comment)

yes, it maybe possible. Or maybe we can expose this as public method in sqlparser crate?
(agree not needed for this PR)

This is a copy of the equivalent implementation in sqlparser

alamb · 2022-04-02T11:18:53Z

 LOCATION '/path/to/aggregate_test_100.csv';
 ```

+If data sources are already partitioned in Hive style, `PARTITIONED BY` can be used for partition pruning.


jychen7 · 2022-04-02T13:46:34Z

Hopefully fixing up the merge conflicts won't be too bad

thanks @alamb , if you mean merge master because of #2081, it is actually super easy (than I thought), since #2081 is just rename and this PR didn't introduce any new file

alamb · 2022-04-03T10:39:59Z

Thanks again @jychen7

jychen7 added 2 commits March 27, 2022 00:22

apache#2061 support "PARTITIONED BY" in CreateExternalTable DDL for d…

6096139

…atafusion

support table_partition_cols in ballista and add ParquetReadOptions

f50e800

github-actions Bot added ballista sql SQL Planner labels Mar 27, 2022

jychen7 commented Mar 27, 2022

View reviewed changes

jychen7 added 5 commits March 27, 2022 09:17

fix a few usage of read_parquet

8b1a258

fix CsvReadOption clone due to removing the copy trait

fce905e

fix CsvReadOption clone due to removing the copy trait

3a91480

fix "missing documentation for a struct field"

8f0a588

fix a few usage of register_parquet

9492d17

jychen7 force-pushed the 2061-CreateExternalTable-DDL-table_partition_cols branch from 28f6a8b to 9492d17 Compare March 27, 2022 15:31

Allow ParquetReadOption to receive parquet_pruning from execution::Co…

a3d2591

…ntext::SessionConfig https://github.com/apache/arrow-datafusion/blob/73ea6e16f5c8f34526c01490a5ec277a68f33791/datafusion/tests/parquet_pruning.rs#L143

houqp added the api change Changes the API exposed to users of the crate label Mar 27, 2022

fix benches import

fd68f72

jychen7 commented Mar 27, 2022

View reviewed changes

jychen7 marked this pull request as ready for review March 27, 2022 22:57

jychen7 changed the title ~~2061 create external table ddl table partition cols~~ feat: 2061 create external table ddl table partition cols Mar 27, 2022

Merge remote-tracking branch 'origin/master' into 2061-CreateExternal…

4aa6124

…Table-DDL-table_partition_cols # Conflicts: # ballista/rust/client/src/context.rs # datafusion/src/execution/context.rs

Merge remote-tracking branch 'origin/master' into 2061-CreateExternal…

a0a89fb

…Table-DDL-table_partition_cols

alamb approved these changes Apr 2, 2022

View reviewed changes

jychen7 mentioned this pull request Apr 2, 2022

Maybe we should eventually move the parquet_pruning option out of SessionContext and into the ParquetReadOptions structure. As a follow on PR #2138

Open

Apply suggestions from code review (lint)

e8993eb

jychen7 force-pushed the 2061-CreateExternalTable-DDL-table_partition_cols branch from b97bf56 to e8993eb Compare April 2, 2022 14:06

alamb approved these changes Apr 3, 2022

View reviewed changes

alamb merged commit d54ba4e into apache:master Apr 3, 2022

jychen7 deleted the 2061-CreateExternalTable-DDL-table_partition_cols branch April 3, 2022 14:23

Conversation

jychen7 commented Mar 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Manual Test in datafusion-cli

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jychen7 Mar 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jychen7 commented Mar 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jychen7 commented Apr 2, 2022

Uh oh!

alamb commented Apr 3, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jychen7 commented Mar 27, 2022 •

edited

Loading

jychen7 Mar 27, 2022 •

edited

Loading

jychen7 commented Mar 30, 2022 •

edited

Loading

alamb left a comment •

edited

Loading