Skip to content

[feature](nereids) Support to get partition related table from mv and check the query operator#28064

Merged
morrySnow merged 3 commits intoapache:masterfrom
seawinde:mv_incre_get_related_table
Dec 6, 2023
Merged

[feature](nereids) Support to get partition related table from mv and check the query operator#28064
morrySnow merged 3 commits intoapache:masterfrom
seawinde:mv_incre_get_related_table

Conversation

@seawinde
Copy link
Copy Markdown
Member

@seawinde seawinde commented Dec 6, 2023

Proposed changes

Function 1
check the select query plan is contain the stmt as following or not

 SELECT
 [hint_statement, ...]
 [ALL | DISTINCT | DISTINCTROW | ALL EXCEPT ( col_name1 [, col_name2, col_name3, ...] )]
 elect_expr [, select_expr ...]
 [FROM table_references
 PARTITION partition_list]
 [TABLET tabletid_list]
 [TABLESAMPLE sample_value [ROWS | PERCENT]
 [REPEATABLE pos_seek]]
 [WHERE where_condition]
 [GROUP BY [GROUPING SETS | ROLLUP | CUBE] {col_name | expr | position}]
 [HAVING where_condition]
 [ORDER BY {col_name | expr | position}
 [ASC | DESC], ...]
 [LIMIT {[offset,] row_count | row_count OFFSET offset}]
 [INTO OUTFILE 'file_name']
 if analyzedPlan contains the stmt as following:
 PARTITION partition_list]
 [TABLET tabletid_list] or
 [TABLESAMPLE sample_value [ROWS | PERCENT]
         [REPEATABLE pos_seek]]
 this method will return true.

Function 2

Get related base table info which materialized view plan column reference,
input param plan should be rewritten plan that sub query should be eliminated

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@seawinde
Copy link
Copy Markdown
Member Author

seawinde commented Dec 6, 2023

run buildall

* @param materializedViewPlan this should be rewritten or analyzed plan, should not be physical plan.
* @param column ref column name.
*/
public static RelatedTableInfo getRelatedTableInfo(String column, Plan materializedViewPlan) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value may be empty, is Optional more appropriate?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok , i will fix it

*/
public static final class RelatedTableInfo {
private BaseTableInfo tableInfo;
private boolean pctPossible;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pct: partition change tracking, if true can partition increment build or not

@seawinde
Copy link
Copy Markdown
Member Author

seawinde commented Dec 6, 2023

run buildall

@doris-robot
Copy link
Copy Markdown

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
Tpch sf100 test result on commit b4712abb787eb97c74ccd5b01d1a22c078d5e1d7, data reload: false

run tpch-sf100 query with default conf and session variables
q1	4695	4414	4445	4414
q2	382	173	158	158
q3	1473	1270	1217	1217
q4	1124	952	926	926
q5	3221	3219	3235	3219
q6	249	129	131	129
q7	1011	497	492	492
q8	2217	2234	2191	2191
q9	6698	6728	6820	6728
q10	3214	3253	3268	3253
q11	325	194	199	194
q12	359	220	216	216
q13	4545	3824	3817	3817
q14	247	214	218	214
q15	574	533	533	533
q16	452	401	394	394
q17	1019	591	587	587
q18	7833	8139	7613	7613
q19	1539	1415	1435	1415
q20	556	307	316	307
q21	3085	2710	2734	2710
q22	361	291	309	291
Total cold run time: 45179 ms
Total hot run time: 41018 ms

run tpch-sf100 query with default conf and set session variable runtime_filter_mode=off
q1	4355	4400	4407	4400
q2	269	167	174	167
q3	3538	3527	3517	3517
q4	2390	2407	2384	2384
q5	5794	5758	5759	5758
q6	242	122	121	121
q7	2380	1858	1873	1858
q8	3527	3531	3545	3531
q9	9049	9020	9045	9020
q10	3907	3999	4008	3999
q11	506	388	392	388
q12	775	591	598	591
q13	4300	3575	3577	3575
q14	280	260	268	260
q15	574	525	516	516
q16	496	463	477	463
q17	1884	1868	1868	1868
q18	8711	8235	8236	8235
q19	1737	1743	1770	1743
q20	2274	1945	1951	1945
q21	6565	6209	6188	6188
q22	519	427	430	427
Total cold run time: 64072 ms
Total hot run time: 60954 ms

@doris-robot
Copy link
Copy Markdown

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 44.59 seconds
stream load tsv: 572 seconds loaded 74807831229 Bytes, about 124 MB/s
stream load json: 19 seconds loaded 2358488459 Bytes, about 118 MB/s
stream load orc: 66 seconds loaded 1101869774 Bytes, about 15 MB/s
stream load parquet: 34 seconds loaded 861443392 Bytes, about 24 MB/s
insert into select: 28.6 seconds inserted 10000000 Rows, about 349K ops/s
storage size: 17163698001 Bytes

Copy link
Copy Markdown
Contributor

@zddr zddr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 6, 2023

PR approved by anyone and no changes requested.

@doris-robot
Copy link
Copy Markdown

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
Tpch sf100 test result on commit d0defbe37438b2ae55cee0e617fc48a2d7f93c9e, data reload: false

run tpch-sf100 query with default conf and session variables
q1	4710	4421	4447	4421
q2	364	150	158	150
q3	1468	1277	1270	1270
q4	1111	911	908	908
q5	3152	3180	3139	3139
q6	249	127	123	123
q7	970	493	493	493
q8	2190	2244	2201	2201
q9	6713	6668	6649	6649
q10	3227	3269	3281	3269
q11	323	208	209	208
q12	356	214	209	209
q13	4558	3816	3922	3816
q14	278	215	212	212
q15	571	524	524	524
q16	441	388	379	379
q17	1019	589	589	589
q18	7818	7541	7385	7385
q19	1516	1377	1404	1377
q20	504	337	310	310
q21	3115	2677	2720	2677
q22	360	292	295	292
Total cold run time: 45013 ms
Total hot run time: 40601 ms

run tpch-sf100 query with default conf and set session variable runtime_filter_mode=off
q1	4384	4374	4369	4369
q2	267	160	174	160
q3	3531	3521	3513	3513
q4	2374	2363	2354	2354
q5	5737	5752	5752	5752
q6	237	121	121	121
q7	2386	1877	1849	1849
q8	3512	3514	3522	3514
q9	9058	9048	8986	8986
q10	3909	3975	3985	3975
q11	505	375	369	369
q12	763	587	606	587
q13	4284	3551	3575	3551
q14	296	251	255	251
q15	576	528	517	517
q16	506	448	451	448
q17	1862	1867	1839	1839
q18	8722	8235	8360	8235
q19	1732	1753	1760	1753
q20	2253	1957	1931	1931
q21	6492	6158	6134	6134
q22	498	405	430	405
Total cold run time: 63884 ms
Total hot run time: 60613 ms

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Dec 6, 2023
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 6, 2023

PR approved by at least one committer and no changes requested.

@morrySnow morrySnow merged commit ffd7023 into apache:master Dec 6, 2023
XuJianxu pushed a commit to XuJianxu/doris that referenced this pull request Dec 14, 2023
… check the query operator (apache#28064)

Function 1:
check the select query plan is contain the stmt as following or not

SELECT
[hint_statement, ...]
[ALL | DISTINCT | DISTINCTROW | ALL EXCEPT ( col_name1 [, col_name2, col_name3, ...] )]
elect_expr [, select_expr ...]
[FROM table_references
PARTITION partition_list]
[TABLET tabletid_list]
[TABLESAMPLE sample_value [ROWS | PERCENT]
[REPEATABLE pos_seek]]
[WHERE where_condition]
[GROUP BY [GROUPING SETS | ROLLUP | CUBE] {col_name | expr | position}]
[HAVING where_condition]
[ORDER BY {col_name | expr | position}
[ASC | DESC], ...]
[LIMIT {[offset,] row_count | row_count OFFSET offset}]
[INTO OUTFILE 'file_name']

if analyzedPlan contains the stmt as following

[PARTITION partition_list]
[TABLET tabletid_list] or
[TABLESAMPLE sample_value [ROWS | PERCENT]
[REPEATABLE pos_seek]]

this method will return true.

Function 2:
Get related base table info which materialized view plan column reference,
input param plan should be rewritten plan that sub query should be eliminated
morrySnow pushed a commit that referenced this pull request Dec 2, 2024
…LESAMPLE or tablet and so on (#43030)

Related PR: #28064

Materialized view is as following:

        CREATE MATERIALIZED VIEW mv1
        BUILD IMMEDIATE REFRESH AUTO ON MANUAL  
        DISTRIBUTED BY RANDOM BUCKETS 2  
        PROPERTIES ('replication_num' = '1')  
        AS 
       select * from orders

If run query as following, should rewrite fail by materialized view
above to make sure data correctness

select * from orders TABLET(110);
select * from orders index query_index_test;
select * from orders TABLESAMPLE(20 percent);
select * from orders_partition PARTITION (day_2);

At before, this would rewrite by materialized view succesfully and the
result data is wrong, This pr fix this.
github-actions bot pushed a commit that referenced this pull request Dec 2, 2024
…LESAMPLE or tablet and so on (#43030)

Related PR: #28064

Materialized view is as following:

        CREATE MATERIALIZED VIEW mv1
        BUILD IMMEDIATE REFRESH AUTO ON MANUAL  
        DISTRIBUTED BY RANDOM BUCKETS 2  
        PROPERTIES ('replication_num' = '1')  
        AS 
       select * from orders

If run query as following, should rewrite fail by materialized view
above to make sure data correctness

select * from orders TABLET(110);
select * from orders index query_index_test;
select * from orders TABLESAMPLE(20 percent);
select * from orders_partition PARTITION (day_2);

At before, this would rewrite by materialized view succesfully and the
result data is wrong, This pr fix this.
seawinde added a commit to seawinde/doris that referenced this pull request Dec 6, 2024
…LESAMPLE or tablet and so on (apache#43030)

Related PR: apache#28064

Materialized view is as following:

        CREATE MATERIALIZED VIEW mv1
        BUILD IMMEDIATE REFRESH AUTO ON MANUAL
        DISTRIBUTED BY RANDOM BUCKETS 2
        PROPERTIES ('replication_num' = '1')
        AS
       select * from orders

If run query as following, should rewrite fail by materialized view
above to make sure data correctness

select * from orders TABLET(110);
select * from orders index query_index_test;
select * from orders TABLESAMPLE(20 percent);
select * from orders_partition PARTITION (day_2);

At before, this would rewrite by materialized view succesfully and the
result data is wrong, This pr fix this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants