Skip to content

Refresh PyVortex to be more Arrow focussed#7505

Draft
gatesn wants to merge 4 commits intodevelopfrom
ngates/pyvortex-refresh
Draft

Refresh PyVortex to be more Arrow focussed#7505
gatesn wants to merge 4 commits intodevelopfrom
ngates/pyvortex-refresh

Conversation

@gatesn
Copy link
Copy Markdown
Contributor

@gatesn gatesn commented Apr 16, 2026

No description provided.

Signed-off-by: Nicholas Gates <nick@nickgates.com>
@gatesn gatesn added the changelog/feature A new feature label Apr 17, 2026 — with ChatGPT Codex Connector
Comment on lines +39 to +49
let coerced_dtypes = if node
.as_opt::<Binary>()
.is_some_and(|operator| operator.is_comparison())
{
match comparison_literal_coercions(&node, &child_dtypes) {
Some(coerced_dtypes) => coerced_dtypes,
None => node.scalar_fn().coerce_args(&child_dtypes)?,
}
} else {
node.scalar_fn().coerce_args(&child_dtypes)?
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated but can we use this in DF to deal with Decimal casting?

Comment thread vortex-array/src/expr/plan.rs
})
}

/// Plan an expression against a Vortex dtype.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth explaining what "planning" is.

/// The [`DType`] returned by the scan, after applying the projection.
pub fn dtype(&self) -> VortexResult<DType> {
self.projection.return_dtype(self.layout_reader.dtype())
plan_expression(self.projection.clone(), self.layout_reader.dtype())?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok after reading through all this PR, isn't this a significant behavior change? We used to talk about Vortex expressions as very strict and "physical", but now we implicitly cast and adapt them

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, we do need some way of planning the query and executing expressions that don't necessarily match. Vortex expressions shouldn't do anything implicitly and instead we have this planning stage that inserts all the additional operations. However, there's still a limit on the autocoercions you'd want to do and we should limit them to only the nonfallible ones

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Vortex directly from Python feels much more like the front-end of a query engine.

Honestly, I'm not sure where to draw the line.

If you use Vortex from inside a query engine, you have almost certainly already done physical planning. If you use Vortex directly, you probably haven't. It's a shame that Arrow Datasets also don't do physical planning, else we could use that as our "engine interface".

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to point out that this is a significant conceptual change which is bigger than what the title hints at.

One way to draw the line is to have a logical expression layer to handle this, which as far as I can tell is what pyarrow does, and is "bound" to a specific schema during execution (Like during dataset.to_table(..))

Comment on lines +73 to +74
Use `filter_policy="pushdown"` to raise when a PyArrow expression cannot be pushed into Vortex. Use
`filter_policy="fallback"` to read the rows and apply the PyArrow filter after the scan.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is default?

Comment thread docs/user-guide/pyarrow.md
Comment on lines +80 to +89
let lhs_literal = node.child(0).is::<Literal>();
let rhs_literal = node.child(1).is::<Literal>();
if lhs_literal == rhs_literal {
return None;
}

let literal_idx = usize::from(rhs_literal);
let context_idx = usize::from(lhs_literal);
let literal_dtype = &child_dtypes[literal_idx];
let context_dtype = &child_dtypes[context_idx];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: cute, match would have been more readable

Signed-off-by: Nicholas Gates <nick@nickgates.com>
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 17, 2026

Merging this PR will degrade performance by 18.85%

❌ 1 regressed benchmark
✅ 1162 untouched benchmarks
⏩ 1457 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation old_alp_prim_test_between[f64, 32768] 284.8 µs 351 µs -18.85%

Comparing ngates/pyvortex-refresh (7527931) with develop (4135209)

Open in CodSpeed

Footnotes

  1. 1457 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

gatesn added 2 commits April 17, 2026 20:36
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants