GH-43349: [R] Fix altrep string columns from readr#43351
GH-43349: [R] Fix altrep string columns from readr#43351jonkeane merged 6 commits intoapache:mainfrom
Conversation
|
|
| for (R_xlen_t i = 0; i < n; i++, ++p_x) { | ||
| SEXP s = *p_x; | ||
| if (s != NA_STRING) { | ||
| if (s != NA_STRING && ALTREP(s)) { |
There was a problem hiding this comment.
This does not actually work yet. With this
arrow/r/tests/testthat/test-utf.R
Lines 19 to 38 in c3ebdf5
Before #43173 this line was:
if (s != NA_STRING && !IS_UTF8(s) && !IS_ASCII(s)) {
I suspect what's going on is that we caught vroom altrep vectors with this if and called SET_STRING_ELT so we didn't get the No Set_elt found for ALTSTRING error. But which ALTREP(s) we detect the non-utf strings here too and attempt to SET_STRING_ELT when we shouldn't.
There was a problem hiding this comment.
I guess I don't understand what's happening in here. L144 says that this should no longer be altrep by the time we get here.
There was a problem hiding this comment.
I also think that STRING_PTR_RO() should materialize an ALTREP character vector, otherwise you can't iterate over the CHARSXP pointers.
Also, ALTREP(s) does not seem to make sense to me, that's a CHARSXP (not a STRSXP), so it cannot be ALTREP, unless I am misremembering something.
Are you trying to catch the non-UTF-8 strings here? Rf_translateCharUTF8() does not do anything (returns the same const char *) if the string is UTF-8, so you could always call it, and only call SET_STRING_ELT() if the returned pointer is different?
There was a problem hiding this comment.
I also think that STRING_PTR_RO() should materialize an ALTREP character vector, otherwise you can't iterate over the CHARSXP pointers.
Hmm, so maybe the actual problem here is that that isn't working on the ALTREP vectors coming from vroom? Here is the error we started seeing when we changed this condition from s != NA_STRING && !IS_UTF8(s) && !IS_ASCII(s) to s != NA_STRING:
Error in Table__from_dots(dots, schema, option_use_threads()): No Set_elt found for ALTSTRING class [class: vroom_chr, pkg: vroom]
Also, ALTREP(s) does not seem to make sense to me, that's a CHARSXP (not a STRSXP), so it cannot be ALTREP, unless I am misremembering something.
Yeah, I agree that ALTREP(s) here is not right.
Are you trying to catch the non-UTF-8 strings here? Rf_translateCharUTF8() does not do anything (returns the same const char *) if the string is UTF-8, so you could always call it,
I will admit I'm not 100% certain what is going on here, but I catching non-UTF-8 strings here is what I thought it was doing.
and only call SET_STRING_ELT() if the returned pointer is different?
Oh interesting. Forgive my C++ naïveness, but would this be something like (i.e. does != check if the pointer is different>):
SEXP new_s = Rf_mkCharCE(Rf_translateCharUTF8(s);
if (new_s != s) {
SET_STRING_ELT(x, i, Rf_mkCharCE(Rf_translateCharUTF8(new_s), CE_UTF8));
}
There was a problem hiding this comment.
Yeah, that C++ looks correct, except you don't need to translate twice:
SEXP new_s = Rf_mkCharCE(Rf_translateCharUTF8(s);
if (new_s != s) {
SET_STRING_ELT(x, i, Rf_mkCharCE(new_s));
}However, I think the issue is that if x is ALTREP, then it does not matter if it is materialized of not, it is still ALTREP, so you still cannot change it with SET_STRING_ELT.
To avoid this, you'd need to call Rf_duplicate() on x. That should create a non-altrep copy of it.
You can always call Rf_duplicate(), or you can call it when you encounter a non-utf8 string. Note that you need to PROTECT() the result of Rf_duplicate()
There was a problem hiding this comment.
Aaah thanks for those pointers! That looked like it worked, I'm going to trigger a benchmark run to see if there are any unintended consequences there (Hopefully no, but even if so, we might need to accept them!)
| lubridate, | ||
| pillar, | ||
| pkgload, | ||
| readr, |
There was a problem hiding this comment.
Not sure the test is worth adding to suggests, though we will get rid of that annoying xref error if we do.
|
@ursabot please benchmark |
|
Benchmark runs are scheduled for commit 2a0da1e. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
| // ensure that x is not actually altrep first | ||
| if (ALTREP(x)) { | ||
| x = PROTECT(Rf_duplicate(x)); | ||
| UNPROTECT(1); |
There was a problem hiding this comment.
I'm pretty sure this isn't right, doesn't the UNPROTECT need to go after you're done using the thing you PROTECTed?
I'm also not sure how the C-level PROTECT stuff interacts with unwind_protect.
There was a problem hiding this comment.
Hmmm, I'm surprised I don't get any stack imbalance issues with it. I do get them if I don't have UNPROTECT or I call it unconditionally. But anyway I have a slightly different version incoming.
|
Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit 2a0da1e. There were 38 benchmark results with an error:
There were 11 benchmark results indicating a performance regression:
The full Conbench report has more details. |
|
@github-actions crossbow submit -g r |
|
Revision: 206e94d Submitted crossbow builds: ursacomputing/crossbow @ actions-6de7c6d005 |
| // ensure that x is not actually altrep first | ||
| bool was_altrep = ALTREP(x); | ||
| if (was_altrep) { | ||
| x = PROTECT(Rf_duplicate(x)); |
There was a problem hiding this comment.
Add a comment about why we have to duplicate?
There was a problem hiding this comment.
Yeah, I'll expand what I have there
| @@ -152,6 +157,9 @@ inline SEXP utf8_strings(SEXP x) { | |||
| SET_STRING_ELT(x, i, Rf_mkCharCE(Rf_translateCharUTF8(s), CE_UTF8)); | |||
There was a problem hiding this comment.
Did we want to check whether Rf_translateCharUTF8() actually modified anything? Or do we trust that SET_STRING_ELT is a no-op in that case? I would imagine that in most cases, we already have ascii/utf-8 strings, so this whole function should be basically free. That should be easily verified by microbenchmarking.
There was a problem hiding this comment.
Something like this from above?
SEXP new_s = Rf_translateCharUTF8(s);
if (new_s == s) {
SET_STRING_ELT(x, i, Rf_mkCharCE(new_s, CE_UTF8));
}
Yeah, that's probably good. It'll also make this slightly more inline with what was there before (checking that not utf8)
There was a problem hiding this comment.
I've tried this, but after getting it to work with a bit of type faffing, doing the translate first we get errors with our utf string tests again.
### Rationale for this change To resolve the reverse dependency issue with `parquetize` ### What changes are included in this PR? One step towards resolving the issue ### Are these changes tested? yes ### Are there any user-facing changes? no * GitHub Issue: #43349 Authored-by: Jonathan Keane <jkeane@gmail.com> Signed-off-by: Jonathan Keane <jkeane@gmail.com>
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 187197c. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 28 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
To resolve the reverse dependency issue with
parquetizeWhat changes are included in this PR?
One step towards resolving the issue
Are these changes tested?
yes
Are there any user-facing changes?
no