Skip to content

Add NULL in list simplifications #8688

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

SELECT .. WHERE NULL IN (1,2,3) and SELECT ... WHERE x in (NULL, 2, 3) are always NULL (and thus will filter out all rows). However, DataFusion will still try and evaluate a predicate:

DataFusion CLI v34.0.0
❯ create table t(x int) as values (1), (2), (3);
0 rows in set. Query took 0.003 seconds.

Note there is a FilterExec with a non trivial expression in both of the following queries:

❯ explain select x from t where x IN (null, 2, 3);
+---------------+---------------------------------------------------------------+
| plan_type     | plan                                                          |
+---------------+---------------------------------------------------------------+
| logical_plan  | Filter: t.x = Int32(NULL) OR t.x = Int32(2) OR t.x = Int32(3) |
|               |   TableScan: t projection=[x]                                 |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                   |
|               |   FilterExec: x@0 = NULL OR x@0 = 2 OR x@0 = 3                |
|               |     MemoryExec: partitions=1, partition_sizes=[1]             |
|               |                                                               |
+---------------+---------------------------------------------------------------+
2 rows in set. Query took 0.002 seconds.

❯ explain select x from t where null IN (x, 2, 3);
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                                |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Filter: Int64(NULL) IN ([CAST(t.x AS Int64), Int64(2), Int64(3)])                                                                                                                                                                                                                                                                                                                                   |
|               |   TableScan: t projection=[x]                                                                                                                                                                                                                                                                                                                                                                       |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                                                                                                                                                                         |
|               |   FilterExec: NULL IN ([CastExpr { expr: Column { name: "x", index: 0 }, cast_type: Int64, cast_options: CastOptions { safe: false, format_options: FormatOptions { safe: true, null: "", date_format: None, datetime_format: None, timestamp_format: None, timestamp_tz_format: None, time_format: None, duration_format: Pretty } } }, Literal { value: Int64(2) }, Literal { value: Int64(3) }]) |
|               |     MemoryExec: partitions=1, partition_sizes=[1]                                                                                                                                                                                                                                                                                                                                                   |
|               |                                                                                                                                                                                                                                                                                                                                                                                                     |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.004 seconds.

In both cases the predicate could have been reduced to a single NULL

Describe the solution you'd like

I would like to extend the ExprSimplifier rules to handle the case of NULL IN (...) and when the InList contains NULL

Here are some similar rules
https://github.com/apache/arrow-datafusion/blob/cc3042a6343457036770267f921bb3b6e726956c/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L474-L549

Describe alternatives you've considered

No response

Additional context

These types of expressions are sometimes generated programmatically during rewrites in IOx.

It also came up with discussions with @yahoNanJing on #8669

I think this would be a good first issue as the patterns exist already and the need is well defined.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions