Fix DataFrame::cache errors with Plan("Mismatch between schema and batches")#8510
Fix DataFrame::cache errors with Plan("Mismatch between schema and batches")#8510alamb merged 4 commits intoapache:mainfrom
DataFrame::cache errors with Plan("Mismatch between schema and batches")#8510Conversation
alamb
left a comment
There was a problem hiding this comment.
Thank you for this contribution @Asura7969 -- I think we can do this slightly more efficiently / consistenly, but the overall idea is 👍
| /// ``` | ||
| pub async fn cache(self) -> Result<DataFrame> { | ||
| let context = SessionContext::new_with_state(self.session_state.clone()); | ||
| // type cast |
There was a problem hiding this comment.
Rather than explcilty running (only) coercion (which is also run as part of execution) I suggest you use the schema that comes directly from the output stream
Something like (copied from Self::collect): https://docs.rs/datafusion/latest/src/datafusion/dataframe/mod.rs.html#756-760
let task_ctx = Arc::new(self.task_ctx());
let plan = self.create_physical_plan().await?;
let schema = plan.schema();
collect(plan, task_ctx).await
There was a problem hiding this comment.
Thanks for your advice
| // The schema is consistent with the output | ||
| let physical_plan = self.clone().create_physical_plan().await?; | ||
| let mem_table = | ||
| MemTable::try_new(physical_plan.schema(), self.collect_partitioned().await?)?; |
There was a problem hiding this comment.
FWIW this now runs the planner twice -- we could make it more efficient by calling collect_partitioned() directly on the physical plan rather than Self::collect_partitioned that will replan (and recollect)
Which issue does this PR close?
Closes #8476.
Rationale for this change
#8476 (comment)
What changes are included in this PR?
Are these changes tested?
test_cache_mismatchAre there any user-facing changes?