Skip to content

Incorrect answers due to an incorrectly removed Sort before AggregateExec #20244

@alamb

Description

@alamb

Describe the bug

While debugging the DataFusion 52 upgrade, I found a wrong results bug with pre-sorted data that was introduced in 52

To Reproduce

CREATE TABLE agg_src(x INT, y INT, v INT) AS VALUES
(1, 1, 10),
(1, 2, 20),
(1, 3, 30),
(2, 1, 40),
(2, 2, 50),
(2, 3, 60);

-- Create an ordered table:
COPY (SELECT * FROM agg_src ORDER BY x, y) TO 'foo.parquet';

Then run

CREATE EXTERNAL TABLE agg_src_sorted(x INT, y INT, v INT) STORED AS PARQUET LOCATION 'foo.parquet' WITH ORDER (x ASC, y ASC);

set datafusion.execution.target_partitions = 1;

-- This query orders by an expresson of y that breaks the ordering 
SELECT
  x,
  CAST(y AS BIGINT) % 2,
  SUM(v)
FROM agg_src_sorted
GROUP BY x, CAST(y AS BIGINT) % 2
ORDER BY x, CAST(y AS BIGINT) % 2;

With Datafusion 52, you get the wrong answer:

andrewlamb@Andrews-MacBook-Pro-3 ~ % ~/Software/datafusion-cli/datafusion-cli-52.1.0
> SELECT
  x,
  CAST(y AS BIGINT) % 2,
  SUM(v)
FROM agg_src_sorted
GROUP BY x, CAST(y AS BIGINT) % 2
ORDER BY x, CAST(y AS BIGINT) % 2;
+---+-----------------------------+-----------------------+
| x | agg_src_sorted.y % Int64(2) | sum(agg_src_sorted.v) |
+---+-----------------------------+-----------------------+
| 1 | 1                           | 40                    |
| 1 | 0                           | 20                    | <---- the second column is 1 then 0, rather than 0 then 1
| 2 | 1                           | 100                   |
| 2 | 0                           | 50                    |
+---+-----------------------------+-----------------------+
4 row(s) fetched.
Elapsed 0.006 seconds.

On datafusion 51

andrewlamb@Andrews-MacBook-Pro-3 ~ % ~/Software/datafusion-cli/datafusion-cli-51.0.0

You get the expected answer

> SELECT
  x,
  CAST(y AS BIGINT) % 2,
  SUM(v)
FROM agg_src_sorted
GROUP BY x, CAST(y AS BIGINT) % 2
ORDER BY x, CAST(y AS BIGINT) % 2;
+---+-----------------------------+-----------------------+
| x | agg_src_sorted.y % Int64(2) | sum(agg_src_sorted.v) |
+---+-----------------------------+-----------------------+
| 1 | 0                           | 20                    | <---- this row is in the correct sopt
| 1 | 1                           | 40                    |
| 2 | 0                           | 50                    |
| 2 | 1                           | 100                   |
+---+-----------------------------+-----------------------+
4 row(s) fetched.
Elapsed 0.002 seconds.

### Expected behavior

_No response_

### Additional context

_No response_

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingregressionSomething that used to work no longer does

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions