support page skipping when using vectorized Parquet reader#15211
support page skipping when using vectorized Parquet reader#15211lurnagao-dahua wants to merge 7 commits intoapache:mainfrom
Conversation
|
Hi team! |
|
@lurnagao-dahua Do you have any benchmark numbers? |
|
@wypoon I recall you also had a PR for this before. It didn't get merged then, would you mind sharing why, and maybe take a look at this one? |
|
I believe it's #10399 which @lurnagao-dahua has also commented. |
|
Benchmark and benchmark result: |
Hi, I have added a simple benchmark and the result indicate that it can improve performance, could you please review it when you have free time? |
…nto page-skipping-vectorized-read
Removed the getActualBatchSize method as it is no longer needed.
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |

Parquet Column Index is a new feature in Parquet 1.11 which allows very efficient filtering on page level (some benchmark numbers can be found here), especially when data is sorted. The feature is largely implemented in parquet-mr (via classes such as ColumnIndex and ColumnIndexFilter).
The implementation of this feature was discussed in 193.
The implementation of the vectorized case is based on the implementation in Spark's Parquet reader (see spark-32753), which is in Spark 3.2.
In addition, PositionVectorReader supports position deletion based on readOrderToRowGroupPosMap(ParquetReadState.java#L67).
I look forward to someone interested in reviewing this PR, and I welcome anyone willing to be a co-author with me to improve it together.