Describe the bug
While debugging the DataFusion 52 upgrade, I found a wrong results bug with pre-sorted data that was introduced in 52
To Reproduce
CREATE TABLE agg_src(x INT, y INT, v INT) AS VALUES
(1, 1, 10),
(1, 2, 20),
(1, 3, 30),
(2, 1, 40),
(2, 2, 50),
(2, 3, 60);
-- Create an ordered table:
COPY (SELECT * FROM agg_src ORDER BY x, y) TO 'foo.parquet';
Then run
CREATE EXTERNAL TABLE agg_src_sorted(x INT, y INT, v INT) STORED AS PARQUET LOCATION 'foo.parquet' WITH ORDER (x ASC, y ASC);
set datafusion.execution.target_partitions = 1;
-- This query orders by an expresson of y that breaks the ordering
SELECT
x,
CAST(y AS BIGINT) % 2,
SUM(v)
FROM agg_src_sorted
GROUP BY x, CAST(y AS BIGINT) % 2
ORDER BY x, CAST(y AS BIGINT) % 2;
With Datafusion 52, you get the wrong answer:
andrewlamb@Andrews-MacBook-Pro-3 ~ % ~/Software/datafusion-cli/datafusion-cli-52.1.0
> SELECT
x,
CAST(y AS BIGINT) % 2,
SUM(v)
FROM agg_src_sorted
GROUP BY x, CAST(y AS BIGINT) % 2
ORDER BY x, CAST(y AS BIGINT) % 2;
+---+-----------------------------+-----------------------+
| x | agg_src_sorted.y % Int64(2) | sum(agg_src_sorted.v) |
+---+-----------------------------+-----------------------+
| 1 | 1 | 40 |
| 1 | 0 | 20 | <---- the second column is 1 then 0, rather than 0 then 1
| 2 | 1 | 100 |
| 2 | 0 | 50 |
+---+-----------------------------+-----------------------+
4 row(s) fetched.
Elapsed 0.006 seconds.
On datafusion 51
andrewlamb@Andrews-MacBook-Pro-3 ~ % ~/Software/datafusion-cli/datafusion-cli-51.0.0
You get the expected answer
> SELECT
x,
CAST(y AS BIGINT) % 2,
SUM(v)
FROM agg_src_sorted
GROUP BY x, CAST(y AS BIGINT) % 2
ORDER BY x, CAST(y AS BIGINT) % 2;
+---+-----------------------------+-----------------------+
| x | agg_src_sorted.y % Int64(2) | sum(agg_src_sorted.v) |
+---+-----------------------------+-----------------------+
| 1 | 0 | 20 | <---- this row is in the correct sopt
| 1 | 1 | 40 |
| 2 | 0 | 50 |
| 2 | 1 | 100 |
+---+-----------------------------+-----------------------+
4 row(s) fetched.
Elapsed 0.002 seconds.
### Expected behavior
_No response_
### Additional context
_No response_
Describe the bug
While debugging the DataFusion 52 upgrade, I found a wrong results bug with pre-sorted data that was introduced in 52
To Reproduce
Then run
With Datafusion 52, you get the wrong answer:
On datafusion 51
You get the expected answer