Skip to content

DRAFT: [C++][Parquet] Use num_nulls from DataPageV2 to skip null handling#43955

Closed
pitrou wants to merge 1 commit intoapache:mainfrom
pitrou:parquet-null-count
Closed

DRAFT: [C++][Parquet] Use num_nulls from DataPageV2 to skip null handling#43955
pitrou wants to merge 1 commit intoapache:mainfrom
pitrou:parquet-null-count

Conversation

@pitrou
Copy link
Copy Markdown
Member

@pitrou pitrou commented Sep 4, 2024

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Sep 4, 2024

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Sep 4, 2024

@ursabot please benchmark

@ursabot
Copy link
Copy Markdown

ursabot commented Sep 4, 2024

Benchmark runs are scheduled for commit c13fe56. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@github-actions github-actions Bot added the awaiting review Awaiting review label Sep 4, 2024
@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Sep 4, 2024

@mapleFU I came up with this simple optimization. I'm not sure it will make a difference in practice...

max_size -= def_levels_bytes;
}

current_page_may_have_nulls_ = max_def_level_ > 0 || max_rep_level_ > 0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a better name or comment for nulls_ ? Since max_rep_level_ > 0 and null is weird

@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Sep 4, 2024
@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Sep 4, 2024

I didn't go through this carefully, and I think generally this is ok, for page-v1 it's also ok when num_nulls in stats exists and equal to 0?

(Besides, personally we're using column-index to predict. and in parquet community page-v2 is not enabled by default 🤔 Maybe I should pick up the filtering patch ( #39731 ) after my vocation this month... )

@conbench-apache-arrow
Copy link
Copy Markdown

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit c13fe56.

There were 16 benchmark results indicating a performance regression:

The full Conbench report has more details.


void ReadValuesSpaced(int64_t values_to_read, int64_t null_count) override {
if (null_count == 0) {
ReadValuesDense(values_to_read);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked FLBARecordReader, it's ReadDense uses a extra null_bitmap_builder_, sigh

@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Sep 10, 2024

After read this part of code this idea lgtm

@github-actions
Copy link
Copy Markdown

Thank you for your contribution. Unfortunately, this pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you do not have repository permissions to reopen the PR, please tag a maintainer.

@github-actions github-actions Bot added the Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise label Nov 18, 2025
@github-actions github-actions Bot closed this Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting committer review Awaiting committer review Component: C++ Component: Parquet Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants