Support columns_sorted in row_filters#20497
Conversation
|
@alamb @Ted-Jiang please take a look when you are available. |
|
I think @adriangb also looked at using the sorted mentadata recently One big question I have is why are we proposing this change? Do we have any evidence this help performance (like benchmarks?) |
|
Do you know if any of the benchmark queries evaluate filters on sorted cols? If not, I'll make a new one. |
|
I think there are no optimizations for evaluating predicates on filtered arrays. It would be cool to be able to support binary search for comparison predicates. Locate the relevant rows once, then operate only on that range: |
If the columns is sorted, won't row group / page pruning basically be doing this already? |
You're right. Then, if a column is sorted it's a reason enough to use load page indices and prune. Due how data is distributed this could prune a lot right away. |
|
I'll work on adding a query to the existing bench to capture the benefits and send the results later. |
|
@sdf-jkl could you help me with some napkin math on how this optimization works? Is the idea that applying a row selection when a page index is present is more efficient? I'm not sure if that means we should filter columns that have a page index first or last, and how that would weigh against e.g. the size of the column or the selectivity of the filter 🤔 |
|
Sorry, I think I got things mixed up while working on this. We consider a column Given that, this column is usually a strong candidate for row group/page pruning. So we prune. After pruning, the remaining work goes to This should make the incremental benefit of using a predicate on this column early in Late Materialization likely marginal in many workloads, given most of the pruning value was already captured earlier. |
So the point is that these columns (sorted columns) were likely well pruned by row group / page min/max stats -> they're unlikely to be selective for row pruning -> they should be evaluated last? |
|
They are unlikely to be highly selective for row pruning, but we can't reliably assume they are always less selective than other predicates. My implementation here did the opposite and prioritized them in the evaluation order, which is a mistake. At this point, I think the rule itself might be unnecessary, and we could consider closing the issue. I can clean up the docs and the function placeholder in This is the current doc saying we should prioritized sorted columns: datafusion/datafusion/datasource-parquet/src/row_filter.rs Lines 45 to 60 in 9b7d092 |
|
I agree updating the docs and removing this un-implemented heuristic makes sense. |
| The sorted dataset is automatically generated from the ClickBench partitioned dataset. You can configure the memory used during the sorting process with the `DATAFUSION_MEMORY_GB` environment variable. The default memory limit is 12GB. | ||
| ```bash | ||
| ./bench.sh data data_sorted_clickbench | ||
| ./bench.sh data clickbench_sorted |
There was a problem hiding this comment.
I’m assuming the new version is correct 👍
There was a problem hiding this comment.
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes apache#3476. ## Rationale for this change Improving predicate ordering for predicate pushdown <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? Building on changes from apache#3477 and apache#7528 - Implement the `columns_sorted` function - Change `should_enable_page_index` to use index when choose to reorder predicates in config <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? Yes, unit tests <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? No <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
Which issue does this PR close?
columns_sortedin row_filters #3476.Rationale for this change
Improving predicate ordering for predicate pushdown
What changes are included in this PR?
Building on changes from #3477 and #7528
columns_sortedfunctionshould_enable_page_indexto use index when choose to reorder predicates in configAre these changes tested?
Yes, unit tests
Are there any user-facing changes?
No