feat(7181): add cursor slicing#7798
Conversation
tustvold
left a comment
There was a problem hiding this comment.
This is getting closer, I think there is still an issue with the way this handles the null_threshold
| assert_eq!(a, b); | ||
| assert_eq!(a.cmp(&b), Ordering::Equal); | ||
|
|
||
| // 2 > NULL |
There was a problem hiding this comment.
| // 2 > NULL | |
| // i32::MIN > NULL |
There was a problem hiding this comment.
Fixed the null mask to work properly. Explicit test cases pushed. Let me know if it's correct this time.
| Self { | ||
| values: values.slice(offset, length), | ||
| offset: 0, | ||
| null_threshold: null_threshold.checked_sub(offset).unwrap_or(0), |
There was a problem hiding this comment.
I would expect this logic to depend on the null ordering. In particular I would expect if nulls are first, to decrement by offset, and otherwise by self.len - offset - length or something...
There was a problem hiding this comment.
Calculation is different, and a bit more explicit. Lmk if ok.
…lls_first and nulls_last slicing
| fn slice(&self, offset: usize, length: usize) -> Self { | ||
| let FieldCursor { | ||
| values, | ||
| offset: _, |
There was a problem hiding this comment.
This seems at odds with the behaviour of RowCursor, which takes the current offset into account
There was a problem hiding this comment.
We can remove the data slicing of the underlying FieldCursor.values. That slicing is a zero-copy of the underlying ScalarBuffer or GenericByteArray.
Would you prefer a switch to using FieldCursor offsets in the same as the RowCursor?
There was a problem hiding this comment.
I don't see an issue with slicing the underlying values, my observation is that the following will behave differently between RowCursor and FieldCursor
cursor.advance();
cursor.slice(1, 2);
In the case of RowCursor it will produce a slice that is offset by 2 from the start, whereas FieldCursor will produce one that is only offset by 1? I think...
It should just be a case of changing this method to use self.offset + offset instead of just offset
| let shorter_len = self.values.len().saturating_sub(offset + length + 1); | ||
| null_threshold.saturating_sub(offset.saturating_sub(shorter_len)) |
There was a problem hiding this comment.
Now that I think about this more, I am unsure why null_threshold.saturating_sub(offset) is incorrect
| t | ||
| } | ||
|
|
||
| fn slice(&self, offset: usize, length: usize) -> Self { |
There was a problem hiding this comment.
What would happen if this method was simply
Self {
values: self.values.slice(0, self.offset + offset + length),
offset: self.offset + offset
null_threshold: self.null_threshold,
}
Or equivalently (I think)
Self {
values: self.values.slice(offset + self.offset, length),
offset: 0
null_threshold: self.null_threshold.saturating_sub(offset + self.offset),
}
There was a problem hiding this comment.
The RowCursor slicing does not slice the underlying rows, therefore it is self.offset + offset.
Whereas the FieldCursor does slice the underlying data, and therefore the offset is reset to 0.
There was a problem hiding this comment.
But the logic below simply ignores the value of self.offset?
|
We changed the abstractions, and are now separating the Cursor from the CursorValues. After this PR merges, will add slicing to the CursorValues. |
Which issue does this PR close?
Adds cursor slicing as a prerequisite for the cascading merge.
Part of #7181
Rationale for this change
The need for a sliced cursor is described here, in it's later use of partially yielded record batches.
What changes are included in this PR?
slice()in Cursor interfacenum_rows()in Cursor interface. Used here and later in the cascaded merge.Are these changes tested?
yes
Primitive cursor slicing is unit tested here.
Row cursor slicing is tested/used in the cascading merge.
Are there any user-facing changes?
No. Cursor interface is crate private.