Skip to content

DataFrame::cache errors with Plan("Mismatch between schema and batches") but query works when not cached #8476

@tv42

Description

@tv42

Describe the bug

Dataframe::cache gives an error where an execution that doesn't first cache results succeeds.

I would have expected caching to have no effect on success/failure.

To Reproduce

use datafusion::prelude::SessionContext;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let sql = "SELECT CASE WHEN true THEN NULL ELSE 1 END;";
    let ctx = SessionContext::new();
    let plan = ctx.state().create_logical_plan(sql).await?;
    let df = ctx.execute_logical_plan(plan).await?;
    // Comment out the next line to make the error go away.
    let df = df.cache().await?;
    let batches = df.collect().await?;
    let display = datafusion::arrow::util::pretty::pretty_format_batches(&batches).unwrap();
    println!("{}", display);
    Ok(())
}

Expected behavior

Behavior with and without let df = df.cache().await? to be functionally same, only changing performance and memory use.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions