GH-43944: [C++][Parquet] Add support for arrow::ArrayStatistics: non zero-copy int based types#43945
Conversation
…: non zero-copy int based types Target types: * `UInt8` * `Int8` * `UInt16` * `Int16` * `UInt32` * `UInt64` * `Date32` * `Time32` * `Time64` * `Duration`
|
|
| array_statistics->is_min_exact = true; | ||
| array_statistics->is_max_exact = true; |
There was a problem hiding this comment.
Can you add correspond comment here? This might be a bit tricky
There was a problem hiding this comment.
Sure. We should document about the discussion at #43595 (comment) , right?
BTW, could you share the e-mail URL for #43595 (comment) ?
I guess no, I'll send a mail to maillist to make it sure
I couldn't find it at https://lists.apache.org/list.html?dev@parquet.apache.org .
Ah, I forgot to add a writer check here. I should have set true only when a writer is Apache Parquet C++. I'll fix it.
There was a problem hiding this comment.
Sorry, let me setup a discussion, generally if it's from Parquet C++, it will works. I'm a bit busy this morning preparing for my tour, I'll try to work it out this noon
There was a problem hiding this comment.
There was a problem hiding this comment.
I've added a comment that we can always use true for integer based min/max.
I didn't need if (::arrow::internal::StartsWith(ctx->reader->metadata()->created_by(), "parquet-cpp-arrow")) for this case based on your e-mail.
There was a problem hiding this comment.
FYI, I found the string and FLBA might being truncated, other types in public impl will not being truncated if exists
| auto array_data = | ||
| ::arrow::ArrayData::Make(field->type(), length, std::move(buffers), null_count); | ||
| auto array_statistics = std::make_shared<::arrow::ArrayStatistics>(); | ||
| array_statistics->null_count = null_count; |
There was a problem hiding this comment.
The null_count for some type ( nested ) would be a bit weird, FYI
There was a problem hiding this comment.
Thanks for the information.
Let's revisit it when we add support for arrow::ArrayStatistics of nested types.
| array_statistics->null_count = null_count; | ||
| auto statistics = metadata->statistics().get(); | ||
| if (statistics) { | ||
| if (statistics->HasDistinctCount()) { |
There was a problem hiding this comment.
Do we need a separate function for the stats conversion?
There was a problem hiding this comment.
I'll do it when I add more target types as the next pull request.
I'll know what is common pattern when I add more target types.
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 262d6f6. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…: non zero-copy int based types (apache#43945) ### Rationale for this change Statistics is useful for fast processing. Target types: * `UInt8` * `Int8` * `UInt16` * `Int16` * `UInt32` * `UInt64` * `Date32` * `Time32` * `Time64` * `Duration` ### What changes are included in this PR? Map `ColumnChunkMetaData` information to `arrow::ArrayStatistics`. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: apache#43944 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
| RETURN_NOT_OK( | ||
| TransferColumnData(record_reader_.get(), field_, descr_, ctx_->pool, &out_)); | ||
| RETURN_NOT_OK(TransferColumnData(record_reader_.get(), | ||
| input_->column_chunk_metadata(), field_, descr_, |
There was a problem hiding this comment.
The call to input_->column_chunk_metadata() fails here if the list of row_groups in input_ is empty, because input_ is not yet initialized properly at this point in that case
Via row_group_metadata() -> RowGroup(-1)
There was a problem hiding this comment.
Good catch!
Could you open a new issue for it?
Rationale for this change
Statistics is useful for fast processing.
Target types:
UInt8Int8UInt16Int16UInt32UInt64Date32Time32Time64DurationWhat changes are included in this PR?
Map
ColumnChunkMetaDatainformation toarrow::ArrayStatistics.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.