Skip to content

ARROW-10783: [Rust][DataFusion] Implement Statistics for Parquet TableProvider#8944

Closed
seddonm1 wants to merge 4 commits intoapache:masterfrom
seddonm1:parquet-statistics
Closed

ARROW-10783: [Rust][DataFusion] Implement Statistics for Parquet TableProvider#8944
seddonm1 wants to merge 4 commits intoapache:masterfrom
seddonm1:parquet-statistics

Conversation

@seddonm1
Copy link
Copy Markdown
Contributor

This is an implementation of Statistics for the Parquet datasource as this PR seems to have gone quiet: #8860

This PR exposes the ParquetMetaData produced by the file reader as even though initially we only consume row_count and total_byte_size there is column level metadata that may be very useful in future for potentially large performance gains.

@github-actions
Copy link
Copy Markdown

let schema = parquet_exec.schema();

let metadata = parquet_exec.metadata();
let (num_rows, total_byte_size) = metadata.row_groups().iter().fold(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be better as two "loops", than it can use .sum() as well instead of fold?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to two iterators

Copy link
Copy Markdown
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one style suggestion

let total_byte_size = metadata
.row_groups()
.iter()
.map(|rg| rg.total_byte_size() as usize)
Copy link
Copy Markdown
Contributor

@Dandandan Dandandan Dec 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the as usize could move to below (Some(num_rows as usize)).

Then it could use .map(RowGroupMetaData::total_byte_size) as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with this .map(RowGroupMetaData::total_byte_size) convention. do you know what this is called so I can look up the Rust docs?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way it is now is called "closure" and it could be replaced by a the name of the function instead. There is something on the clippy docs:
https://rust-lang.github.io/rust-clippy/master/#redundant_closure

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have updated the PR. I should have recognised this from Scala.

@codecov-io
Copy link
Copy Markdown

Codecov Report

Merging #8944 (a3bc7d1) into master (71e37e2) will decrease coverage by 0.00%.
The diff coverage is 88.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8944      +/-   ##
==========================================
- Coverage   83.26%   83.26%   -0.01%     
==========================================
  Files         195      196       +1     
  Lines       48066    48131      +65     
==========================================
+ Hits        40024    40078      +54     
- Misses       8042     8053      +11     
Impacted Files Coverage Δ
rust/parquet/src/file/metadata.rs 91.82% <ø> (+0.77%) ⬆️
rust/parquet/src/file/statistics.rs 93.80% <ø> (ø)
rust/datafusion/src/datasource/parquet.rs 95.62% <80.00%> (+0.30%) ⬆️
rust/datafusion/src/physical_plan/parquet.rs 79.56% <100.00%> (+0.91%) ⬆️
rust/parquet/src/arrow/arrow_reader.rs 90.69% <100.00%> (+0.10%) ⬆️
rust/datafusion/src/execution/context.rs 89.95% <0.00%> (-1.93%) ⬇️
rust/datafusion/tests/custom_sources.rs 75.00% <0.00%> (ø)
rust/parquet/src/schema/types.rs 90.19% <0.00%> (+0.26%) ⬆️
rust/datafusion/src/execution/dataframe_impl.rs 93.28% <0.00%> (+0.85%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6cedab0...a3bc7d1. Read the comment docs.

Copy link
Copy Markdown
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Dec 17, 2020

FYI @XiaokunDing

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.23529% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.26%. Comparing base (71e37e2) to head (a3bc7d1).

Files with missing lines Patch % Lines
rust/datafusion/src/datasource/parquet.rs 80.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8944      +/-   ##
==========================================
- Coverage   83.26%   83.26%   -0.01%     
==========================================
  Files         195      196       +1     
  Lines       48066    48131      +65     
==========================================
+ Hits        40024    40078      +54     
- Misses       8042     8053      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants