Skip to content

Add small column on empty projection#7833

Merged
Dandandan merged 7 commits intoapache:mainfrom
ch-sc:add-small-column-on-empty-projection
Oct 18, 2023
Merged

Add small column on empty projection#7833
Dandandan merged 7 commits intoapache:mainfrom
ch-sc:add-small-column-on-empty-projection

Conversation

@ch-sc
Copy link
Copy Markdown
Contributor

@ch-sc ch-sc commented Oct 16, 2023

Which issue does this PR close?

Improves #3214.

Rationale for this change

If a projection is empty, we add the first column of the input schema since some parts of DataFusion still rely on at least having one column. Instead of selecting the first column from the input schema, these changes aim to select a column with a smaller memory size. The memory size is based on the data type.

What changes are included in this PR?

Are these changes tested?

Basic unit tests for new logic are included. All tests that involve query planning and empty projections execute this code.

Are there any user-facing changes?

@github-actions github-actions Bot added optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Oct 16, 2023
Comment thread datafusion/optimizer/src/push_down_projection.rs Outdated
Comment thread datafusion/optimizer/src/push_down_projection.rs Outdated
Comment thread datafusion/sqllogictest/test_files/avro.slt
// Get the projection exprs from columns in the order of the schema
/// Accumulate the memory size of a data type measured in bits.
///
/// Nested types are traversed and increment `nesting` on every level.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment saying that variable-sized types are estimated using some heuristics?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Added a comment about variable sized types. Feel free to rephrase if you think something is missing.

LargeList(f) => nested_size(f.data_type(), nesting),
Struct(fields) => fields
.iter()
.map(|f| nested_size(f.data_type(), nesting))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle we could project a sub-field from a struct instead of the entire struct (all columns).

Copy link
Copy Markdown
Contributor Author

@ch-sc ch-sc Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I will play around with it. Though it sounds like a rare edge case to me where no other "smaller" type would be present in the schema!?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah indeed :)

Copy link
Copy Markdown
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome @ch-sc ! I left a few comments.

This will yield some nice performance improvements for SELECT COUNT(*) from [source] queries even without solving #3214

@Dandandan
Copy link
Copy Markdown
Contributor

Change seems non controversial and has some good tests, so merging seems fine.

Thank you @ch-sc 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants