Skip to content

Performance regression on timestemp range join. #9755

@my-vegetable-has-exploded

Description

Describe the bug

For query in #8393, datafusion-cli v34 take 10s to finish the query. But it take 40s for datafusion-cli v36 to finish.

To Reproduce

Mostly same with #8393.

produce data.

CREATE
OR REPLACE TABLE pricing AS
SELECT
    t,
    RANDOM() as v
FROM
    range(
        '2022-01-01' :: TIMESTAMP,
        '2023-01-01' :: TIMESTAMP,
        INTERVAL 1 DAY
    ) ts(t);

COPY pricing to 'pricing.parquet' (format 'parquet');

CREATE
OR REPLACE TABLE timestamps AS
SELECT
    t
FROM
    range(
        '2022-01-01' :: TIMESTAMP,
        '2023-01-01' :: TIMESTAMP,
        INTERVAL 10 SECOND
    ) ts(t);

COPY timestamps to 'timestamps.parquet' (format 'parquet');

run query.

EXPLAIN ANALYZE WITH pricing_state AS (
    SELECT
        t as valid_from,
        COALESCE(
            LEAD(t, 1) OVER (
                ORDER BY
                    t
            ),
            '2077-12-31'
        ) as valid_to,
        v
    FROM
        'pricing.parquet'
)

SELECT
    t.t,
    p.v
FROM
	'timestamps.parquet' t
    LEFT JOIN pricing_state p ON t.t BETWEEN p.valid_from
    AND p.valid_to;

Expected behavior

No response

Additional context

And flamegraph is quite different.

v36
图片

v34
图片

see https://gist.github.com/my-vegetable-has-exploded/ba16c59c96c81fa20f52b56f254ea8be for more information.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions