-
Notifications
You must be signed in to change notification settings - Fork 4.1k
GH-42143: [R] Sanitize R metadata #41969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
67ce55e
b46db7e
3f0773f
df7075f
7f0271c
860333b
f1c4cdd
07155c8
d69c316
e64b85f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -30,7 +30,7 @@ | |||||||||||
| } | ||||||||||||
| } | ||||||||||||
|
|
||||||||||||
| out <- serialize(x, NULL, ascii = TRUE) | ||||||||||||
| out <- serialize(safe_r_metadata(x, on_save = TRUE), NULL, ascii = TRUE) | ||||||||||||
|
|
||||||||||||
| # if the metadata is over 100 kB, compress | ||||||||||||
| if (option_compress_metadata() && object.size(out) > 100000) { | ||||||||||||
|
|
@@ -44,23 +44,110 @@ | |||||||||||
| } | ||||||||||||
|
|
||||||||||||
| .deserialize_arrow_r_metadata <- function(x) { | ||||||||||||
| tryCatch( | ||||||||||||
| expr = { | ||||||||||||
| out <- unserialize(charToRaw(x)) | ||||||||||||
|
|
||||||||||||
| # if this is still raw, try decompressing | ||||||||||||
| if (is.raw(out)) { | ||||||||||||
| out <- unserialize(memDecompress(out, type = "gzip")) | ||||||||||||
| } | ||||||||||||
| out | ||||||||||||
| }, | ||||||||||||
| tryCatch(unserialize_r_metadata(x), | ||||||||||||
| error = function(e) { | ||||||||||||
| if (getOption("arrow.debug", FALSE)) { | ||||||||||||
| print(conditionMessage(e)) | ||||||||||||
| } | ||||||||||||
| warning("Invalid metadata$r", call. = FALSE) | ||||||||||||
| NULL | ||||||||||||
| } | ||||||||||||
| ) | ||||||||||||
| } | ||||||||||||
|
|
||||||||||||
| unserialize_r_metadata <- function(x) { | ||||||||||||
| # Check that this is ASCII serialized data (as in, what we wrote) | ||||||||||||
| if (!identical(substr(unclass(x), 1, 1), "A")) { | ||||||||||||
| stop("Invalid serialized data") | ||||||||||||
| } | ||||||||||||
| out <- safe_unserialize(charToRaw(x)) | ||||||||||||
| # If it's still raw, decompress and unserialize again | ||||||||||||
| if (is.raw(out)) { | ||||||||||||
| decompressed <- memDecompress(out, type = "gzip") | ||||||||||||
| if (!identical(rawToChar(decompressed[1]), "A")) { | ||||||||||||
| stop("Invalid serialized compressed data") | ||||||||||||
| } | ||||||||||||
| out <- safe_unserialize(decompressed) | ||||||||||||
| } | ||||||||||||
| if (!is.list(out)) { | ||||||||||||
| stop("Invalid serialized data: must be a list") | ||||||||||||
| } | ||||||||||||
| safe_r_metadata(out) | ||||||||||||
| } | ||||||||||||
|
|
||||||||||||
| safe_unserialize <- function(x) { | ||||||||||||
| # By capturing the data in a list, we can inspect it for promises without | ||||||||||||
| # triggering their evaluation. | ||||||||||||
|
Comment on lines
79
to
80
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure this is more accurate--I can |
||||||||||||
| out <- list(unserialize(x)) | ||||||||||||
| if (typeof(out[[1]]) == "promise") { | ||||||||||||
| stop("Serialized data contains a promise object") | ||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Up for other suggestions, but it would be good to make it clear that Serialized data containing a promise is problematic.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently it doesn't matter because this error gets swallowed in https://github.com/apache/arrow/pull/41969/files#diff-659e9fa6b66e5a72b4e3f9ac79ffddf08f92d9ea3d7aa45bd8c73b9a022fa2e5R52 and in the end the user sees an opaque "Invalid metadata$r" warning. This is a holdover from how we're currently doing the deserialization, any errors are just trapped if it fails to deserialize and we return |
||||||||||||
| } | ||||||||||||
| out[[1]] | ||||||||||||
| } | ||||||||||||
|
|
||||||||||||
| safe_r_metadata <- function(metadata, on_save = FALSE) { | ||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, can do. The meaning is described in a comment lower in the code but I can clarify up top too.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done in 1057b78 |
||||||||||||
| # This function recurses through the metadata list and checks that all | ||||||||||||
| # elements are of types that are allowed in R metadata. | ||||||||||||
| # If it finds an element that is not allowed, it removes it. | ||||||||||||
| # | ||||||||||||
| # This function is used both when saving and loading metadata. | ||||||||||||
| # @param on_save: If TRUE, the function will not warn if it removes elements: | ||||||||||||
| # we're just cleaning up the metadata for saving. If FALSE, it means we're | ||||||||||||
| # loading the metadata, and we'll warn if we find invalid elements. | ||||||||||||
| # | ||||||||||||
| # When loading metadata, you can optionally keep the invalid elements by | ||||||||||||
| # setting `options(arrow.unsafe_metadata = TRUE)`. It will still check | ||||||||||||
| # for invalid elements and warn if any are found, though. | ||||||||||||
|
|
||||||||||||
| # This variable will be used to store the types of elements that were removed, | ||||||||||||
| # if any, so we can give an informative warning if needed. | ||||||||||||
| types_removed <- c() | ||||||||||||
|
|
||||||||||||
| # Internal function that we'll recursively apply, | ||||||||||||
| # and mutate the `types_removed` variable outside of it. | ||||||||||||
| check_r_metadata_types_recursive <- function(x) { | ||||||||||||
| allowed_types <- c("character", "double", "integer", "logical", "complex", "list", "NULL") | ||||||||||||
| if (is.list(x)) { | ||||||||||||
| types <- map_chr(x, typeof) | ||||||||||||
| x[types == "list"] <- map(x[types == "list"], check_r_metadata_types_recursive) | ||||||||||||
| ok <- types %in% allowed_types | ||||||||||||
| if (!all(ok)) { | ||||||||||||
| # Record the invalid types, then remove the offending elements | ||||||||||||
| types_removed <<- c(types_removed, setdiff(types, allowed_types)) | ||||||||||||
| x <- x[ok] | ||||||||||||
| } | ||||||||||||
| } | ||||||||||||
| x | ||||||||||||
| } | ||||||||||||
| new <- check_r_metadata_types_recursive(metadata) | ||||||||||||
|
|
||||||||||||
| # On save: don't warn, just save the filtered metadata | ||||||||||||
| if (on_save) { | ||||||||||||
| return(new) | ||||||||||||
| } | ||||||||||||
| # On load: warn if any elements were removed | ||||||||||||
| if (length(types_removed)) { | ||||||||||||
| types_msg <- paste("Type:", oxford_paste(unique(types_removed))) | ||||||||||||
| if (getOption("arrow.unsafe_metadata", FALSE)) { | ||||||||||||
| # We've opted-in to unsafe metadata, so warn but return the original metadata | ||||||||||||
| rlang::warn( | ||||||||||||
| "R metadata may have unsafe or invalid elements", | ||||||||||||
| body = c("i" = types_msg) | ||||||||||||
| ) | ||||||||||||
| new <- metadata | ||||||||||||
| } else { | ||||||||||||
| rlang::warn( | ||||||||||||
| "Potentially unsafe or invalid elements have been discarded from R metadata.", | ||||||||||||
| body = c( | ||||||||||||
| "i" = types_msg, | ||||||||||||
| ">" = "If you trust the source, you can set `options(arrow.unsafe_metadata = TRUE)` to preserve them." | ||||||||||||
| ) | ||||||||||||
| ) | ||||||||||||
| } | ||||||||||||
| } | ||||||||||||
| new | ||||||||||||
| } | ||||||||||||
|
|
||||||||||||
| #' @importFrom rlang trace_back | ||||||||||||
| apply_arrow_r_metadata <- function(x, r_metadata) { | ||||||||||||
| if (is.null(r_metadata)) { | ||||||||||||
|
|
||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's not strictly necessary, but would asserting that this is
ARROWbe a bit more obvious?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually about how
base::serialize()works, signifying that it is ASCII:https://stat.ethz.ch/R-manual/R-devel/library/base/html/serialize.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AAAAAH Maybe a comment
X for binary serialization and A for ASCII serializationthere?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the comment on the line above not enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, maybe it is. Though I did read it when reviewing and thought we were testing that the string started with
ARROWso it wasn't when I was reading it last night. Not a huge deal either way, I think if someone needs to know this, they would poke at it more