DuckDB plugin#1419
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1419 +/- ##
=======================================
Coverage 69.32% 69.32%
=======================================
Files 305 305
Lines 28671 28671
Branches 2718 2718
=======================================
Hits 19877 19877
Misses 8276 8276
Partials 518 518
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
|
@samhita-alla can we support accepting StructuredDataset as an input? |
|
@kumare3, duckdb_task = DuckDBQuery(
name="duckdb_sd_df",
query="SELECT * FROM pandas_df WHERE i = 2",
inputs=kwtypes(pandas_df=StructuredDataset),
)
@task
def get_pandas_df() -> StructuredDataset:
return StructuredDataset(
dataframe=pd.DataFrame.from_dict({"i": [1, 2, 3, 4], "j": ["one", "two", "three", "four"]})
)
@workflow
def pandas_wf(pandas_df: StructuredDataset) -> pd.DataFrame:
return duckdb_task(pandas_df=pandas_df)
assert isinstance(pandas_wf(pandas_df=get_pandas_df()), pd.DataFrame)Let me know if you're looking for something different. |
Signed-off-by: Samhita Alla <aallasamhita@gmail.com>
Signed-off-by: Samhita Alla <samhitaalla@Samhitas-MacBook-Pro.local> Signed-off-by: Samhita Alla <aallasamhita@gmail.com>
0b75b43 to
1aeb074
Compare
1aeb074 to
d95dce2
Compare
Signed-off-by: Samhita Alla <aallasamhita@gmail.com>
Signed-off-by: Samhita Alla <aallasamhita@gmail.com>
Signed-off-by: Samhita Alla <aallasamhita@gmail.com>
cosmicBboy
left a comment
There was a problem hiding this comment.
looking great! added some comments for docstrings
Signed-off-by: Samhita Alla <aallasamhita@gmail.com>
|
@cosmicBboy, thanks for reviewing the PR! Can you look through it again? |
|
DuckDB api reference is blank, I think we need to update the https://github.com/flyteorg/flytekit/blob/master/doc-requirements.in file with |
|
@samhita-alla we'll also need to invest a bit in enable warnings as errors in the sphinx build process |
…nd errors Signed-off-by: Samhita Alla <aallasamhita@gmail.com>
Signed-off-by: Samhita Alla <aallasamhita@gmail.com>
Signed-off-by: Samhita Alla <aallasamhita@gmail.com>
|
@cosmicBboy, fixed the docs and added a GitHub action to show warnings as errors. |
| name: str, | ||
| query: Union[str, List[str]], | ||
| inputs: Optional[Dict[str, Union[StructuredDataset, list]]] = None, | ||
| **kwargs, |
There was a problem hiding this comment.
Could we add an Output Schema type here like snowflake? if output_schema_type is none, we won't generate output dataset.
And the type should change to StructuredDataset, because we already deprecated FlyteSchema
There was a problem hiding this comment.
@pingsutw, this task isn't helpful if there's no output dataset. The DuckDBQuery task runs some queries and returns the output of a SELECT statement. Hence, it must return the query output, and in this case, it's StructuredDataset. Also, I can definitely add output_schema_type. But the output has to always be a StructuredDataset. So is it necessary? I'm already hard-coding the output type in the initialization. Let me know what you think.
There was a problem hiding this comment.
I see. do we support insert or some other operations? If not, I think we don't need schema type for now
There was a problem hiding this comment.
Yeah, we do. So you can give a bunch of queries to a single DuckDBQuery task. But the last one needs to be a SELECT query because after say, you insert the data, you need to retrieve the data, right? Else, it's of no use. I'm using the non-persistent offering by DuckDB. So all the data will be available only within the query. Does that make sense?
There was a problem hiding this comment.
make sense. Thanks for the explanation.
|
Thanks, Kevin! Will merge this PR after @cosmicBboy approves as well. |
|
@kumare3, let me know if this PR looks good to you. |
* DuckDB integration Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * add sd test and fix import Signed-off-by: Samhita Alla <samhitaalla@Samhitas-MacBook-Pro.local> Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * fix lint error Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * fix lint error Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * list to List Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * lint Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * incorporated suggestions Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * add duckdb to requirements and add gh action to detect doc warnings and errors Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * gh action: python 3.9 Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * docs python 3.8 to 3.9 Signed-off-by: Samhita Alla <aallasamhita@gmail.com> --------- Signed-off-by: Samhita Alla <aallasamhita@gmail.com> Signed-off-by: Samhita Alla <samhitaalla@Samhitas-MacBook-Pro.local> Co-authored-by: Kevin Su <pingsutw@apache.org>
* Create non-root user after apt-get (#1519) * Create non-root user after apt-get Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com> * Create user after pip install Signed-off-by: Kevin Su <pingsutw@apache.org> --------- Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com> Signed-off-by: Kevin Su <pingsutw@apache.org> Co-authored-by: Eduardo Apolinario <eapolinario@users.noreply.github.com> Co-authored-by: Kevin Su <pingsutw@apache.org> * Add root pyflyte reference to docs (#1520) Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com> Co-authored-by: Eduardo Apolinario <eapolinario@users.noreply.github.com> * DuckDB plugin (#1419) * DuckDB integration Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * add sd test and fix import Signed-off-by: Samhita Alla <samhitaalla@Samhitas-MacBook-Pro.local> Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * fix lint error Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * fix lint error Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * list to List Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * lint Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * incorporated suggestions Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * add duckdb to requirements and add gh action to detect doc warnings and errors Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * gh action: python 3.9 Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * docs python 3.8 to 3.9 Signed-off-by: Samhita Alla <aallasamhita@gmail.com> --------- Signed-off-by: Samhita Alla <aallasamhita@gmail.com> Signed-off-by: Samhita Alla <samhitaalla@Samhitas-MacBook-Pro.local> Co-authored-by: Kevin Su <pingsutw@apache.org> * add string as a valid input (#1527) * add string as a valid input Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * isort Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * tests Signed-off-by: Samhita Alla <aallasamhita@gmail.com> * Lint Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com> --------- Signed-off-by: Samhita Alla <aallasamhita@gmail.com> Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com> Co-authored-by: Eduardo Apolinario <eapolinario@users.noreply.github.com> * Add back attempt to use existing serialization settings when running (#1529) Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * update configuration docs, fix some docstrings (#1530) * update configuration docs, fix some docstrings Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * update copy Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * add config init command Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> --------- Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * Revert "Make flytekit comply with PEP-561 (#1516)" (#1532) This reverts commit b3ad158. Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com> Co-authored-by: Eduardo Apolinario <eapolinario@users.noreply.github.com> Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * Failed to initialize FlyteInvalidInputException (#1534) Signed-off-by: Kevin Su <pingsutw@apache.org> * cherry pick pin fsspec commit Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> * Set flytekit<1.3.0 in duckdb tests Signed-off-by: eduardo apolinario <eapolinario@users.noreply.github.com> * Fix flyteidl==1.2.9 in doc-requirements.txt Signed-off-by: eduardo apolinario <eapolinario@users.noreply.github.com> * No duckdb documentation Signed-off-by: eduardo apolinario <eapolinario@users.noreply.github.com> * Linting Signed-off-by: eduardo apolinario <eapolinario@users.noreply.github.com> --------- Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com> Signed-off-by: Kevin Su <pingsutw@apache.org> Signed-off-by: Samhita Alla <aallasamhita@gmail.com> Signed-off-by: Samhita Alla <samhitaalla@Samhitas-MacBook-Pro.local> Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com> Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> Signed-off-by: eduardo apolinario <eapolinario@users.noreply.github.com> Co-authored-by: Eduardo Apolinario <653394+eapolinario@users.noreply.github.com> Co-authored-by: Eduardo Apolinario <eapolinario@users.noreply.github.com> Co-authored-by: Kevin Su <pingsutw@apache.org> Co-authored-by: Samhita Alla <aallasamhita@gmail.com> Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com>

Signed-off-by: Samhita Alla aallasamhita@gmail.com
TL;DR
This PR adds a
DuckDBQuerytask plugin that runs queries using DuckDB as the DBMS.Type
Are all requirements met?
Complete description
Capturing the crucial assumptions I made while building the task plugin:
DuckDBQuerytask parameter that a user needs to send argument to includesqueryand can contemplate adding includesinputs.querycan include a set of queries that'll be run sequentially. The last query needs to be a SELECT query.inputscan include structured dataset or a list of parameters to be sent to the queries.outputis a pyarrow table. Can be converted to any structured dataset compatible type.:memory, i.e., the data is always stored in an in-memory, non-persistent database. It can be set to a file, but it's difficult to make the file accessible to differentDuckDBQuerypods, which otherwise wouldn't make sense because file is persistent, and it needs to be leveraged.Example:
Tracking Issue
Fixes flyteorg/flyte#3246, flyteorg/flyte#2865
Follow-up issue
NA
OR
https://github.com/flyteorg/flyte/issues/