Skip to content

Make a faster way to check column existence in optimizer (not is_err()) #5309

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Related to #5157

There are many places in the code that use fallible functions on DFSchema to check if a column exists:
https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.index_of
https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.index_of_column_by_name
https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.field_from_column

For example, there is code that looks like this (call is_ok() or is_err()and totally discards the error with the string)

input_schema.field_from_column(col).is_ok()

This is problematic because they return a DataFusionError that not only has an allocated String but also often has gone through a lot of effort to construct a nice error message. You can see them appearing in the trace on #5157

As part of making the optimizer faster Related to #5157 we need to avoid these string allocations,

Thus I propose:

  1. Add new functions for checking that return a bool rather than an error
  2. Replace the use of is_err() with

Find the field with the given qualified column

For example,

impl DFSchema {
  // existing function that returns Result
  pub fn field_from_column(&self, column: &Column) -> Result<&DFField> {...}

  // new function that returns bool  <---- Add this new function
  pub fn has_column(&self, column: &Column) -> bool {...}
}

And then replace in the code that have the pattern

input_schema.field_from_column(col).is_ok()

With

input_schema.has_column(col)

Describe the solution you'd like
Ideally someone would do this transition one function on DFSchema at a time (not one giant PR please 🙏 )

Describe alternatives you've considered
There are more involved proposals for larger changes to DFSchema but simply avoiding this check might help a lot

Additional context
I think this is a good first exercise as the desire is well spelled out and it is a software engineering exercise rather than requires deep datafusion expertise

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions