[SPARK-43009][SQL] Parameterized sql() with Any constants#40623
[SPARK-43009][SQL] Parameterized sql() with Any constants#40623MaxGekk wants to merge 15 commits into
sql() with Any constants#40623Conversation
sql() with literal argssql() with constants
sql() with constantssql() with constants
sql() with constantssql() with Any constants
| * "DATE'2023-03-21'". The fragments of string values belonged to SQL comments are skipped | ||
| * while parsing. | ||
| * A map of parameter names to Java/Scala objects that can be converted to SQL literal | ||
| * expressions. See <a href="https://spark.apache.org/docs/latest/sql-ref-datatypes.html"> |
There was a problem hiding this comment.
looking at the doc, I think we should update it to include the new java datetime api like LocalDate, and put then at the beginning to promote them.
| // belonged to SQL comments are skipped while parsing. | ||
| map<string, string> args = 2; | ||
| // (Optional) A map of parameter names to literal expressions. | ||
| map<string, Expression.Literal> args = 2; |
There was a problem hiding this comment.
cc @grundprinzip are these protocol changes ok?
There was a problem hiding this comment.
No, this is an incomatible change.
There was a problem hiding this comment.
would be valid:
| map<string, Expression.Literal> args = 2; | |
| map<string, string> args = 2; | |
| map<string, Expression.Literal> expr_args = 3; |
so might be
| map<string, Expression.Literal> args = 2; | |
| oneof args { | |
| map<string, string> args = 2; | |
| map<string, Expression.Literal> expr_args = 3; | |
| } |
Please run
buf breaking --against "https://github.com/apache/spark/archive/master.zip#strip_components=1,subdir=connector/connect/common/src/main"
| // belonged to SQL comments are skipped while parsing. | ||
| map<string, string> args = 2; | ||
| // (Optional) A map of parameter names to literal expressions. | ||
| map<string, Expression.Literal> args = 2; |
grundprinzip
left a comment
There was a problem hiding this comment.
The Spark Connect changes are breaking.
|
@grundprinzip Spark Connect is not released yet, I think we can still change it? This PR should go to Spark 3.4. cc @xinrong-meng |
|
If it's guaranteed to be included in 3.4 it's not a breaking change. |
|
thanks, merging to master/3.4! |
|
It has conflicts with 3.4, @MaxGekk can you create a backport PR? Thanks! |
|
@cloud-fan I am working on the backport ... |
In the PR, I propose to change API of parameterized SQL, and replace type of argument values from `string` to `Any` in Scala/Java/Python and `Expression.Literal` in protobuf API. Language API can accept `Any` objects from which it is possible to construct literal expressions. ```scala def sql(sqlText: String, args: Map[String, Any]): DataFrame ``` values of the `args` map are wrapped by the `lit()` function which leaves `Column` as is and creates a literal from other Java/Scala objects (for more details see the `Scala` tab at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). ```python def sql(self, sqlQuery: str, args: Optional[Dict[str, Any]] = None, **kwargs: Any) -> DataFrame: ``` Similarly to Scala/Java `sql`, Python's `sql()` accepts Python objects as values of the `args` dictionary (see more details about acceptable Python objects at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). `sql()` converts dictionary values to `Column` literal expressions by `lit()`. ```proto message SqlCommand { // (Required) SQL Query. string sql = 1; // (Optional) A map of parameter names to literal expressions. map<string, Expression.Literal> args = 2; } ``` For example: ```scala scala> val sqlText = """SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name""" sqlText: String = SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name scala> sql(sqlText, args = Map("player_name" -> lit("E'Twaun Moore"))).show(false) +-------------+ |s | +-------------+ |E'Twaun Moore| +-------------+ ``` The current implementation the parameterized `sql()` requires arguments as string values parsed to SQL literal expressions that causes the following issues: 1. SQL comments are skipped while parsing, so, some fragments of input might be skipped. For example, `'Europe -- Amsterdam'`. In this case, `-- Amsterdam` is excluded from the input. 2. Special chars in string values must be escaped, for instance `'E\'Twaun Moore'` No since the parameterized SQL feature apache#38864 hasn't been released yet. By running the affected tests: ``` $ build/sbt "test:testOnly *ParametersSuite" $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.tests.connect.test_connect_basic SparkConnectBasicTests.test_sql_with_args' $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.session SparkSession.sql' ``` Closes apache#40623 from MaxGekk/parameterized-sql-any. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 156a12e) Signed-off-by: Max Gekk <max.gekk@gmail.com>
|
Here is the backport to |
### What changes were proposed in this pull request? In the PR, I propose to change API of parameterized SQL, and replace type of argument values from `string` to `Any` in Scala/Java/Python and `Expression.Literal` in protobuf API. Language API can accept `Any` objects from which it is possible to construct literal expressions. This is a backport of #40623 #### Scala/Java: ```scala def sql(sqlText: String, args: Map[String, Any]): DataFrame ``` values of the `args` map are wrapped by the `lit()` function which leaves `Column` as is and creates a literal from other Java/Scala objects (for more details see the `Scala` tab at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). #### Python: ```python def sql(self, sqlQuery: str, args: Optional[Dict[str, Any]] = None, **kwargs: Any) -> DataFrame: ``` Similarly to Scala/Java `sql`, Python's `sql()` accepts Python objects as values of the `args` dictionary (see more details about acceptable Python objects at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). `sql()` converts dictionary values to `Column` literal expressions by `lit()`. #### Protobuf: ```proto message SqlCommand { // (Required) SQL Query. string sql = 1; // (Optional) A map of parameter names to literal expressions. map<string, Expression.Literal> args = 2; } ``` For example: ```scala scala> val sqlText = """SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name""" sqlText: String = SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name scala> sql(sqlText, args = Map("player_name" -> lit("E'Twaun Moore"))).show(false) +-------------+ |s | +-------------+ |E'Twaun Moore| +-------------+ ``` ### Why are the changes needed? The current implementation the parameterized `sql()` requires arguments as string values parsed to SQL literal expressions that causes the following issues: 1. SQL comments are skipped while parsing, so, some fragments of input might be skipped. For example, `'Europe -- Amsterdam'`. In this case, `-- Amsterdam` is excluded from the input. 2. Special chars in string values must be escaped, for instance `'E\'Twaun Moore'` ### Does this PR introduce _any_ user-facing change? No since the parameterized SQL feature #38864 hasn't been released yet. ### How was this patch tested? By running the affected tests: ``` $ build/sbt "test:testOnly *ParametersSuite" $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.tests.connect.test_connect_basic SparkConnectBasicTests.test_sql_with_args' $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.session SparkSession.sql' ``` Authored-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 156a12e) Closes #40666 from MaxGekk/parameterized-sql-any-3.4-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? In the PR, I propose to change API of parameterized SQL, and replace type of argument values from `string` to `Any` in Scala/Java/Python and `Expression.Literal` in protobuf API. Language API can accept `Any` objects from which it is possible to construct literal expressions. This is a backport of apache#40623 #### Scala/Java: ```scala def sql(sqlText: String, args: Map[String, Any]): DataFrame ``` values of the `args` map are wrapped by the `lit()` function which leaves `Column` as is and creates a literal from other Java/Scala objects (for more details see the `Scala` tab at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). #### Python: ```python def sql(self, sqlQuery: str, args: Optional[Dict[str, Any]] = None, **kwargs: Any) -> DataFrame: ``` Similarly to Scala/Java `sql`, Python's `sql()` accepts Python objects as values of the `args` dictionary (see more details about acceptable Python objects at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). `sql()` converts dictionary values to `Column` literal expressions by `lit()`. #### Protobuf: ```proto message SqlCommand { // (Required) SQL Query. string sql = 1; // (Optional) A map of parameter names to literal expressions. map<string, Expression.Literal> args = 2; } ``` For example: ```scala scala> val sqlText = """SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name""" sqlText: String = SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name scala> sql(sqlText, args = Map("player_name" -> lit("E'Twaun Moore"))).show(false) +-------------+ |s | +-------------+ |E'Twaun Moore| +-------------+ ``` ### Why are the changes needed? The current implementation the parameterized `sql()` requires arguments as string values parsed to SQL literal expressions that causes the following issues: 1. SQL comments are skipped while parsing, so, some fragments of input might be skipped. For example, `'Europe -- Amsterdam'`. In this case, `-- Amsterdam` is excluded from the input. 2. Special chars in string values must be escaped, for instance `'E\'Twaun Moore'` ### Does this PR introduce _any_ user-facing change? No since the parameterized SQL feature apache#38864 hasn't been released yet. ### How was this patch tested? By running the affected tests: ``` $ build/sbt "test:testOnly *ParametersSuite" $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.tests.connect.test_connect_basic SparkConnectBasicTests.test_sql_with_args' $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.session SparkSession.sql' ``` Authored-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 156a12e) Closes apache#40666 from MaxGekk/parameterized-sql-any-3.4-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? This PR aims to support `Parameterized SQL queries` in `sql` API. ### Why are the changes needed? For feature parity, we had better support this GA feature. - apache/spark#38864 (Since Spark 3.4.0) - apache/spark#40623 (Since Spark 3.4.0) - apache/spark#41568 (Since Spark 3.5.0) - apache/spark#48965 (GA Since Spark 4.0.0) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #103 from dongjoon-hyun/SPARK-51986. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
In the PR, I propose to change API of parameterized SQL, and replace type of argument values from
stringtoAnyin Scala/Java/Python andExpression.Literalin protobuf API. Language API can acceptAnyobjects from which it is possible to construct literal expressions.Scala/Java:
values of the
argsmap are wrapped by thelit()function which leavesColumnas is and creates a literal from other Java/Scala objects (for more details see theScalatab at https://spark.apache.org/docs/latest/sql-ref-datatypes.html).Python:
Similarly to Scala/Java
sql, Python'ssql()accepts Python objects as values of theargsdictionary (see more details about acceptable Python objects at https://spark.apache.org/docs/latest/sql-ref-datatypes.html).sql()converts dictionary values toColumnliteral expressions bylit().Protobuf:
For example:
Why are the changes needed?
The current implementation the parameterized
sql()requires arguments as string values parsed to SQL literal expressions that causes the following issues:'Europe -- Amsterdam'. In this case,-- Amsterdamis excluded from the input.'E\'Twaun Moore'Does this PR introduce any user-facing change?
No since the parameterized SQL feature #38864 hasn't been released yet.
How was this patch tested?
By running the affected tests: