Skip to content

Improve performance of ClickBench Q21 by removing the cast #16591

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

When looking into a purported regression in #16575 (comment) I noticed something else about clickbench 21:

SELECT "SearchPhrase", MIN("URL"), COUNT(*) AS c FROM hits WHERE "URL" LIKE '%google%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;

About 30% of the overall execution time is going to CASTing!

Image

Describe the solution you'd like

Reproduce like this:

cd benchmarks
./bench.sh data clickbench_partitioned
datafusion-cli -c "SELECT \"SearchPhrase\", MIN(\"URL\"), COUNT(*) AS c FROM 'data/hits_partitioned' WHERE \"URL\" LIKE '%google%' AND \"SearchPhrase\" <> '' GROUP BY \"SearchPhrase\" ORDER BY c DESC LIMIT 10;"

you can run the explain plan like this

(venv) andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion/benchmarks$ datafusion-cli -c "EXPLAIN SELECT \"SearchPhrase\", MIN(\"URL\"), COUNT(*) AS c FROM 'data/hits_partitioned' WHERE \"URL\" LIKE '%google%' AND \"SearchPhrase\" <> '' GROUP BY \"SearchPhrase\" ORDER BY c DESC LIMIT 10;"
DataFusion CLI v48.0.0
+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │  SortPreservingMergeExec  │ |
|               | │    --------------------   │ |
|               | │      c DESClimit: 10      │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       SortExec(TopK)      │ |
|               | │    --------------------   │ |
|               | │          c@2 DESC         │ |
|               | │                           │ |
|               | │         limit: 10         │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       ProjectionExec      │ |
|               | │    --------------------   │ |
|               | │       SearchPhrase:       │ |
|               | │        SearchPhrase       │ |
|               | │                           │ |
|               | │     c: count(Int64(1))    │ |
|               | │                           │ |
|               | │ min(data/hits_partitioned │ |
|               | │           .URL):          │ |
|               | │ min(data/hits_partitioned │ |
|               | │           .URL)           │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       AggregateExec       │ |
|               | │    --------------------   │ |
|               | │           aggr:           │ |
|               | │ min(data/hits_partitioned │ |
|               | │      .URL), count(1)      │ |
|               | │                           │ |
|               | │         group_by:         │ |
|               | │        SearchPhrase       │ |
|               | │                           │ |
|               | │           mode:           │ |
|               | │      FinalPartitioned     │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │    CoalesceBatchesExec    │ |
|               | │    --------------------   │ |
|               | │     target_batch_size:    │ |
|               | │            8192           │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │      RepartitionExec      │ |
|               | │    --------------------   │ |
|               | │ partition_count(in->out): │ |
|               | │          16 -> 16         │ |
|               | │                           │ |
|               | │    partitioning_scheme:   │ |
|               | │ Hash([SearchPhrase@0], 16)│ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       AggregateExec       │ |
|               | │    --------------------   │ |
|               | │           aggr:           │ |
|               | │ min(data/hits_partitioned │ |
|               | │      .URL), count(1)      │ |
|               | │                           │ |
|               | │         group_by:         │ |
|               | │        SearchPhrase       │ |
|               | │                           │ |
|               | │       mode: Partial       │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │    CoalesceBatchesExec    │ |
|               | │    --------------------   │ |
|               | │     target_batch_size:    │ |
|               | │            8192           │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │         FilterExec        │ |
|               | │    --------------------   │ |
|               | │         predicate:        │ |
|               | │ CAST(URL AS Utf8View) LIKE│ |
|               | │        %google% AND       │ |
|               | │        SearchPhrase       │ |
|               | │             !=            │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       DataSourceExec      │ |
|               | │    --------------------   │ |
|               | │         files: 115        │ |
|               | │      format: parquet      │ |
|               | │                           │ |
|               | │         predicate:        │ |
|               | │ CAST(URL AS Utf8View) LIKE│ | <--- this CAST is the problem, for some reason the LIKE is not evaluated natively on the query
|               | │        %google% AND       │ |
|               | │        SearchPhrase       │ |
|               | │             !=            │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+
1 row(s) fetched.
Elapsed 0.058 seconds.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions