Skip to content

Add exact reverse scan support to PushdownSort with limit-after-reverse fix#47

Merged
zhuqi-lucas merged 0 commit intobranch-52from
exact-reverse-scan
Apr 23, 2026
Merged

Add exact reverse scan support to PushdownSort with limit-after-reverse fix#47
zhuqi-lucas merged 0 commit intobranch-52from
exact-reverse-scan

Conversation

@zhuqi-lucas
Copy link
Copy Markdown
Collaborator

@zhuqi-lucas zhuqi-lucas commented Apr 9, 2026

Summary

Adds Exact reverse scan support to the parquet sort pushdown. When enabled, the Sort/TopK operator above a FileScanExec for ORDER BY ... DESC LIMIT N queries can be removed entirely and the limit pushed down to the scan, while still producing globally sorted output.

This builds on top of the existing Inexact reverse scan (which only reverses row group order — TopK must remain to re-sort rows within each RG).

Design

Default (Inexact, backward compatible):
  SortExec: TopK(fetch=N), expr=[ts DESC]
    DataSourceExec: ..., reverse_row_groups=true   <- rows within RG still ASC

With enable_exact_reverse_scan=true:
  DataSourceExec: ..., limit=N, scan_direction=Reversed   <- Sort removed, limit pushed

The ReversedRowGroupStream wraps the parquet stream:

  • Buffers all batches for one row group
  • Reverses batch order, then reverses rows within each batch (via arrow::compute::take)
  • Memory: O(largest row group)

Limit is applied inside ReversedRowGroupStream (after reversal), not at the parquet reader level — applying it before reversal would read the first N rows in forward order, giving wrong results.

Changes

Core

  • datafusion/common/src/config.rs — new enable_exact_reverse_scan: bool session config (default false)
  • datafusion/datasource-parquet/src/source.rs:
    • ParquetSource::with_exact_reverse(true) programmatic API
    • try_reverse_output() returns Exact when enabled, Inexact otherwise
    • Plan display: scan_direction=Reversed (Exact) vs reverse_row_groups=true (Inexact)
  • datafusion/datasource-parquet/src/opener.rs:
    • ReversedRowGroupStream (per-RG buffer + reverse + limit application)
    • Skip parquet reader limit when reverse_rows=true
    • compute_selected_rows_per_rg helper: correctly computes per-RG output row counts when RowSelection is present (page pruning), preventing RG boundary mis-detection
  • datafusion/physical-optimizer/src/pushdown_sort.rs — Exact path removes Sort, pushes fetch via push_fetch_into_plan

Tests

  • 4 unit tests for compute_selected_rows_per_rg (no skip, spanning skips, all-skipped, short selection error)
  • 4 integration tests for exact reverse scan:
    • test_exact_reverse_scan_multi_rg_produces_global_desc: Inexact [7,8,9,4,5,6,1,2,3] vs Exact [9..1]
    • test_exact_reverse_scan_applies_limit_after_reversal: limit=4 yields [9,8,7,6] not [4,3,2,1]
    • test_exact_reverse_scan_with_row_selection_across_rgs: regression test for RowSelection + exact reverse
    • test_exact_reverse_scan_with_row_selection_and_limit: combined case
  • 2 source unit tests (with_exact_reverse returns Exact, default returns Inexact)
  • 4 snapshot tests in pushdown_sort.rs (Sort removed, fetch pushdown, through projection, no-fetch case)
  • 8 end-to-end SLT tests in sort_pushdown.slt (LIMIT, OFFSET+LIMIT, ASC unchanged, toggle on/off, results verified)

Other

  • datafusion/proto-common/src/from_proto/mod.rs and datafusion/proto/src/logical_plan/file_formats.rs — proto deserialization defaults enable_exact_reverse_scan = false
  • docs/source/user-guide/configs.md — auto-generated config doc
  • datafusion/sqllogictest/test_files/information_schema.slt — SHOW ALL output

Backward Compatibility

Fully backward compatible: enable_exact_reverse_scan defaults to false. All existing Inexact behavior is preserved.

Based on

Approach inspired by apache#18817.

@github-actions github-actions Bot removed the core label Apr 9, 2026
@zhuqi-lucas zhuqi-lucas force-pushed the exact-reverse-scan branch 5 times, most recently from 8b58b9f to 03cd09a Compare April 15, 2026 03:48
@github-actions github-actions Bot added the proto label Apr 15, 2026
@zhuqi-lucas zhuqi-lucas force-pushed the exact-reverse-scan branch 3 times, most recently from 64ceaad to 79e89b0 Compare April 15, 2026 06:10
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Apr 15, 2026
@zhuqi-lucas zhuqi-lucas force-pushed the exact-reverse-scan branch 2 times, most recently from 1fe3ede to fd2650a Compare April 15, 2026 06:53
@zhuqi-lucas zhuqi-lucas changed the title Add exact reverse scan with per-RG buffering and tests Add exact reverse scan support to PushdownSort with limit-after-reverse fix Apr 15, 2026
@zhuqi-lucas zhuqi-lucas marked this pull request as ready for review April 15, 2026 07:20
Copilot AI review requested due to automatic review settings April 15, 2026 07:20
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an Exact reverse scan mode for Parquet sort pushdown so ORDER BY ... DESC LIMIT N can eliminate the physical Sort/TopK and push LIMIT into the scan while still producing globally sorted output.

Changes:

  • Introduces enable_exact_reverse_scan Parquet/session config and wires it into ParquetSource reverse-scan pushdown behavior (Exact vs Inexact).
  • Implements per-row-group buffering + row reversal in the Parquet opener path, and applies limit after reversal for correctness.
  • Updates physical optimizer sort pushdown to remove SortExec on Exact and push fetch down; adds unit/snapshot + SLT coverage and config docs.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
docs/source/user-guide/configs.md Documents the new enable_exact_reverse_scan config option.
datafusion/sqllogictest/test_files/sort_pushdown.slt Adds end-to-end SLT cases validating exact reverse behavior, limit/offset handling, and toggle on/off.
datafusion/sqllogictest/test_files/information_schema.slt Updates SHOW ALL expectations to include the new config key.
datafusion/proto/src/logical_plan/file_formats.rs Defaults new option during protobuf → options conversion.
datafusion/proto-common/src/from_proto/mod.rs Defaults new option during protobuf → common options conversion.
datafusion/physical-optimizer/src/pushdown_sort.rs On exact pushdown, removes SortExec and pushes fetch down into the plan tree.
datafusion/datasource-parquet/src/source.rs Adds with_exact_reverse, tracks exact/inexact reverse behavior, and updates plan display.
datafusion/datasource-parquet/src/opener.rs Adds reverse_rows support; wraps stream with ReversedRowGroupStream to reverse rows within RG and apply limit post-reversal.
datafusion/core/tests/physical_optimizer/test_utils.rs Adds test helper to build Parquet exec with exact reverse enabled.
datafusion/core/tests/physical_optimizer/pushdown_sort.rs Adds optimizer snapshot tests covering exact reverse sort removal and fetch pushdown (including through projections).
datafusion/common/src/file_options/parquet_writer.rs Threads the new option through ParquetOptions test scaffolding / writer-option destructuring.
datafusion/common/src/config.rs Adds the enable_exact_reverse_scan Parquet config definition and docs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread datafusion/physical-optimizer/src/pushdown_sort.rs
Comment thread datafusion/datasource-parquet/src/opener.rs Outdated
Comment thread datafusion/proto-common/src/from_proto/mod.rs
Comment thread datafusion/proto/src/logical_plan/file_formats.rs
zhuqi-lucas added a commit that referenced this pull request Apr 16, 2026
Addresses Copilot review comment on PR #47: when `row_selection` is present
(e.g. from page pruning via pushdown_filters), the parquet stream emits only
the selected rows, so seeding `rg_row_counts` from `RowGroupMetaData::num_rows()`
caused ReversedRowGroupStream to mis-detect row-group boundaries and silently
mix batches from multiple row groups, producing wrong ordering.

Fix: new `compute_selected_rows_per_rg` helper walks the RowSelection in
lock-step with the row groups and computes the actual output row count per RG.

Tests added:
- 4 unit tests for compute_selected_rows_per_rg (no skip, spanning skips,
  all-skipped, short selection error)
- test_exact_reverse_scan_multi_rg_produces_global_desc: verifies Inexact
  yields [7,8,9,4,5,6,1,2,3] while Exact yields [9..1] (globally DESC)
- test_exact_reverse_scan_applies_limit_after_reversal: verifies limit=4
  over [1..9] yields [9,8,7,6] (top of forward order, not first N pre-reverse)
- test_exact_reverse_scan_with_row_selection_across_rgs: regression test
  for the row_selection bug — 3 RGs with per-RG selections yield the
  expected [10,9,8,7,6,5,4,3]
- test_exact_reverse_scan_with_row_selection_and_limit: combined case
zhuqi-lucas added a commit that referenced this pull request Apr 16, 2026
Addresses Copilot review comment on PR #47: when `row_selection` is present
(e.g. from page pruning via pushdown_filters), the parquet stream emits only
the selected rows, so seeding `rg_row_counts` from `RowGroupMetaData::num_rows()`
caused ReversedRowGroupStream to mis-detect row-group boundaries and silently
mix batches from multiple row groups, producing wrong ordering.

Fix: new `compute_selected_rows_per_rg` helper walks the RowSelection in
lock-step with the row groups and computes the actual output row count per RG.

Tests added:
- 4 unit tests for compute_selected_rows_per_rg (no skip, spanning skips,
  all-skipped, short selection error)
- test_exact_reverse_scan_multi_rg_produces_global_desc: verifies Inexact
  yields [7,8,9,4,5,6,1,2,3] while Exact yields [9..1] (globally DESC)
- test_exact_reverse_scan_applies_limit_after_reversal: verifies limit=4
  over [1..9] yields [9,8,7,6] (top of forward order, not first N pre-reverse)
- test_exact_reverse_scan_with_row_selection_across_rgs: regression test
  for the row_selection bug — 3 RGs with per-RG selections yield the
  expected [10,9,8,7,6,5,4,3]
- test_exact_reverse_scan_with_row_selection_and_limit: combined case
// pushdown_filters), the stream emits only the selected rows, so
// `RowGroupMetaData::num_rows()` would over-count and cause
// ReversedRowGroupStream to misdetect row-group boundaries.
let rg_row_counts: Vec<usize> = if reverse_rows {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When rg_row_counts contains a leading or middle 0 (an RG where every row is skipped by RowSelection — reachable via pushdown_filters + page pruning), the stream’s rows_remaining_in_rg sits at 0. The first batch of the next RG arrives, saturating_sub(num_rows) stays 0, and flush_buffer() fires immediately — attributing that batch to the empty RG, advancing current_rg, and splitting the real RG’s batches across two flush cycles. Result: wrong batch ordering within the real RG.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good catch!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest PR.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round-trip data flow on a remote executor:
• PushdownSort removes the Sort operator locally (Exact path), leaving only DataSourceExec with reverse_row_groups=true, reverse_rows=true.
• try_from_data_source_exec serializes only FileScanConfig + predicate + table_parquet_options — not the reverse_row_groups/reverse_rows/exact_reverse fields on ParquetSource.
• ParquetOptions → protobuf::ParquetOptions doesn’t include enable_exact_reverse_scan at all, and the reverse path hardcodes false.
• Deserialized plan: DataSourceExec with all reverse flags = false, no Sort above it. Reads files forward → ASC output returned for a DESC query.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest PR.

Copy link
Copy Markdown
Collaborator

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc, the pr won't be contributed to upstream?

@zhuqi-lucas
Copy link
Copy Markdown
Collaborator Author

iirc, the pr won't be contributed to upstream?

I can try to contribute to upstream as an option which is default false.

Copy link
Copy Markdown
Collaborator

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we verify these comments are real risk, could you please also add some slt tests for them?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is still a correctness issue in the exact reverse path here.

rg_row_counts only accounts for row_selection, but the stream can also have a row_filter applied. In arrow-rs, RowFilter is applied after row-group selection / row selection, so the actual number of rows emitted for a row group can be smaller than the count computed here.

On top of that, ReversedRowGroupStream infers row-group boundaries by accumulating emitted batch row counts, but the parquet reader can emit batches that span row-group boundaries. That means the boundary detector can drift and silently buffer batches from multiple row groups together.

As a result, queries like WHERE ... ORDER BY ... DESC LIMIT ... on the exact reverse path may return rows in the wrong order without failing.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xudong963 Fixed in latest PR:

I redesigned the implementation here:

Replaced ReversedRowGroupStream with per-RG independent reading (modeled after our internal ReverseParquetSource). Each RG gets its own ParquetRecordBatchStreamBuilder with .with_row_groups(vec![rg]), so RowFilter is applied independently per-RG. No more rg_row_counts boundary detection. Memory stays O(largest RG). Added SLT test with pushdown_filters=true + WHERE predicate confirming correct results.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the physical-plan roundtrip is losing execution state here.

On decode we reconstruct the parquet source from ParquetSource::new(...).with_table_parquet_options(...), but the reverse-scan runtime flags (reverse_row_groups / reverse_rows) are not restored. On encode, those fields are not serialized either.

That looks dangerous for any path that ships physical plans through protobuf / RemoteExec: the optimizer may already have removed SortExec, but after deserialization the scan may no longer perform reverse scanning at execution time.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added reverse_row_groups and reverse_rows fields to ParquetScanExecNode proto message. Serialized on encode, restored on decode. Remote executors now correctly preserve reverse scan state after plan
roundtrip.

Arc::new(self.clone().with_reverse_row_groups(true))
};

if self.exact_reverse {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think returning Exact here may make the plan properties inconsistent.

On the exact-reverse path, the scan now produces the reversed order, but the downstream FileScanConfig metadata still keeps the original output_ordering rather than reversing it. That means after SortExec is removed, the plan can still advertise the original ordering even though the actual output direction has changed.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output_ordering reflects the physical sort order of the data files, which doesn't change, we just read it in reverse. The Exact result tells the optimizer that the output satisfies the requested reversed ordering, and PushdownSort updates the plan properties accordingly when it removes the Sort. So FileScanConfig.output_ordering staying as-is is correct here.

@zhuqi-lucas
Copy link
Copy Markdown
Collaborator Author

If we verify these comments are real risk, could you please also add some slt tests for them?

Thanks for review, i will look back this PR after urgent issues done.

@zhuqi-lucas zhuqi-lucas force-pushed the exact-reverse-scan branch 7 times, most recently from e7b8083 to f0004c4 Compare April 23, 2026 07:40
@zhuqi-lucas zhuqi-lucas merged commit f0004c4 into branch-52 Apr 23, 2026
57 of 59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants