Add small column on empty projection#7833
Merged
Dandandan merged 7 commits intoapache:mainfrom Oct 18, 2023
Merged
Conversation
…all-column-on-empty-projection
Dandandan
reviewed
Oct 17, 2023
Dandandan
reviewed
Oct 17, 2023
Dandandan
reviewed
Oct 17, 2023
Dandandan
reviewed
Oct 17, 2023
| // Get the projection exprs from columns in the order of the schema | ||
| /// Accumulate the memory size of a data type measured in bits. | ||
| /// | ||
| /// Nested types are traversed and increment `nesting` on every level. |
Contributor
There was a problem hiding this comment.
Can we add a comment saying that variable-sized types are estimated using some heuristics?
Contributor
Author
There was a problem hiding this comment.
Makes sense. Added a comment about variable sized types. Feel free to rephrase if you think something is missing.
Dandandan
reviewed
Oct 17, 2023
| LargeList(f) => nested_size(f.data_type(), nesting), | ||
| Struct(fields) => fields | ||
| .iter() | ||
| .map(|f| nested_size(f.data_type(), nesting)) |
Contributor
There was a problem hiding this comment.
In principle we could project a sub-field from a struct instead of the entire struct (all columns).
Contributor
Author
There was a problem hiding this comment.
Good idea, I will play around with it. Though it sounds like a rare edge case to me where no other "smaller" type would be present in the schema!?
Dandandan
approved these changes
Oct 18, 2023
Contributor
|
Change seems non controversial and has some good tests, so merging seems fine. Thank you @ch-sc 😊 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Improves #3214.
Rationale for this change
If a projection is empty, we add the first column of the input schema since some parts of DataFusion still rely on at least having one column. Instead of selecting the first column from the input schema, these changes aim to select a column with a smaller memory size. The memory size is based on the data type.
What changes are included in this PR?
Are these changes tested?
Basic unit tests for new logic are included. All tests that involve query planning and empty projections execute this code.
Are there any user-facing changes?