Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Related to #5157
There are many places in the code that use fallible functions on DFSchema to check if a column exists:
https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.index_of
https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.index_of_column_by_name
https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.field_from_column
For example, there is code that looks like this (call is_ok() or is_err()and totally discards the error with the string)
input_schema.field_from_column(col).is_ok()
This is problematic because they return a DataFusionError that not only has an allocated String but also often has gone through a lot of effort to construct a nice error message. You can see them appearing in the trace on #5157
As part of making the optimizer faster Related to #5157 we need to avoid these string allocations,
Thus I propose:
- Add new functions for checking that return a bool rather than an error
- Replace the use of
is_err() with
Find the field with the given qualified column
For example,
impl DFSchema {
// existing function that returns Result
pub fn field_from_column(&self, column: &Column) -> Result<&DFField> {...}
// new function that returns bool <---- Add this new function
pub fn has_column(&self, column: &Column) -> bool {...}
}
And then replace in the code that have the pattern
input_schema.field_from_column(col).is_ok()
With
input_schema.has_column(col)
Describe the solution you'd like
Ideally someone would do this transition one function on DFSchema at a time (not one giant PR please 🙏 )
Describe alternatives you've considered
There are more involved proposals for larger changes to DFSchema but simply avoiding this check might help a lot
Additional context
I think this is a good first exercise as the desire is well spelled out and it is a software engineering exercise rather than requires deep datafusion expertise
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Related to #5157
There are many places in the code that use fallible functions on
DFSchemato check if a column exists:https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.index_of
https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.index_of_column_by_name
https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.field_from_column
For example, there is code that looks like this (call
is_ok()oris_err()and totally discards the error with the string)This is problematic because they return a DataFusionError that not only has an allocated
Stringbut also often has gone through a lot of effort to construct a nice error message. You can see them appearing in the trace on #5157As part of making the optimizer faster Related to #5157 we need to avoid these string allocations,
Thus I propose:
is_err()withFind the field with the given qualified column
For example,
And then replace in the code that have the pattern
With
Describe the solution you'd like
Ideally someone would do this transition one function on DFSchema at a time (not one giant PR please 🙏 )
Describe alternatives you've considered
There are more involved proposals for larger changes to DFSchema but simply avoiding this check might help a lot
Additional context
I think this is a good first exercise as the desire is well spelled out and it is a software engineering exercise rather than requires deep datafusion expertise