type = function()#168
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
|
Oh, I like the look of this. I think the current approach of internally converting to |
This comment was marked as outdated.
This comment was marked as outdated.
|
@grantmcdermott I think this is ready for a "real" (initial) review. Two major benefits of this approach to keep in mind:
|
grantmcdermott
left a comment
There was a problem hiding this comment.
This is fantastic @vincentarelbundock. I really like the direction and overall design. High level comments are:
- I think we need to move the
type_funlogic upstream so that it can better handle the extremities of the confidence intervals. I have some pseudo code for this that I'll try to fomalize and share. - I'd prefer to stick to pure base R for the pre-shipped function factories.
| yymax = idata[[ii]]$ymax | ||
|
|
||
|
|
||
| if (is.function(type_fun)) { |
There was a problem hiding this comment.
This would require some additional thought, but can we move this code logic higher up in the function?
Reason: At present, you are only extracting ymin, ymax etc. after the plot window has already been drawn (where, in turn, the axes ranges are only based on the raw values of x and y). This is leading to some confidence bands being clipped, since they extend beyond the raw variable ranges. (See your lm and loess examples here.)
One way to overcome this issue, is to get these function type factories to accept an addition by argument and then use this argument (a) split the data sample upstream and (b) run the prediction functions on the relevant split data samples. That's essentially what we do now for the "histogram" and "density" types. I'll try to share a concrete example shortly.
There was a problem hiding this comment.
I don't have a very strong view on this, and understand that clipping is annoying. But the counter argument is: (1) ylim argument is easy to use, and (2) adding a by argument to the type_ functions will significantly increase complexity and barriers to entry for users. One cool thing about the current design is that it is trivial for people to write their own functions. Again, don't have a strong view; just something to consider as you decide.
There was a problem hiding this comment.
Here's the sort of thing I could envisage. In brief, we keep the existing type function factories the same, but internally we pass them to an intermediate function (here: called type_fun_by but we can obviously rename it) that correctly maps it across any grouped data. This would allow you to keep the user-facing simplicity of the main type_x functions, and recycle the additional job of correctly mapping across groups to this same intermediate function for all them.
Proof of concept using a slightly simplified version of your type_spline function factory (here, adapted so that it always produces output rather than constructs a function).
# helper function
ci = function(estimate, std.error, conf.level, df, backtransform = identity) {
crit = stats::qt(1 - (1-conf.level)/2, df)
out = list(
estimate = backtransform(estimate),
conf.low = backtransform(estimate - crit * std.error),
conf.high = backtransform(estimate + crit * std.error)
)
return(out)
}
# simplified type_spline function
type_spline = function(x, y, conf.level = 0.95, k = -1, fx = FALSE, bs = "tp", ...) {
se = if (isFALSE(conf.level)) FALSE else TRUE
conf.level = if (isFALSE(conf.level)) 0.95 else conf.level
dat = data.frame(x, y)
fit = mgcv::gam(y ~ s(x, k = k, fx = fx, bs = bs), data = dat)
nd = data.frame(x = seq(min(x), max(x), length.out = 100))
p = stats::predict(fit, newdata = nd, se.fit = TRUE)
p = ci(
estimate = p$fit,
std.error = p$se.fit,
conf.level = conf.level,
df = fit$df.residual)
out = list(x = nd$x,
y = p$estimate,
ymin = p$conf.low,
ymax = p$conf.high)
return(out)
}
# NEW: intermediate mapping function
type_fun_by = function(type_fun, x, y, by = NULL, new_data_length = 100, ...) {
new_data = data.frame(x = seq(min(x), max(x), length.out = new_data_length))
dots = list(...)
if (is.null(by)) {
arg_list = list(x = x, y = y, newdata = new_data)
arg_list = modifyList(arg_list, dots, keep.null = TRUE)
do.call(type_fun, args = arg_list)
} else {
uby = sort(unique(by))
split_data = lapply(list(x = x, y = y), function(l) split(l, by))
ret = lapply(
seq_along(uby),
function(i, ...) {
x = split_data[["x"]][[i]]
y = split_data[["y"]][[i]]
minx = min(x, na.rm = TRUE)
maxx = max(x, na.rm = TRUE)
idx = new_data[["x"]] > minx & new_data[["x"]] < maxx
new_data = new_data[idx, , drop = FALSE]
new_data = rbind(data.frame(x = minx), new_data)
arg_list = list(x = x, y = y, newdata = new_data)
arg_list = modifyList(arg_list, dots, keep.null = TRUE)
fun_ret = do.call(type_fun, args = arg_list)
if (!is.function(fun_ret)) fun_ret[["by"]] = rep(uby[i], length(fun_ret[["x"]]))
return(fun_ret)
},
)
if (is.function(ret[[1]])) {
return(ret[[1]])
} else {
return(do.call(function(...) Map("c", ...), ret))
}
}
}
## Test
# direct use
a1 = type_spline(x = airquality$Day, y = airquality$Temp)
# via intermediate function with no "by" mapping (should be the same as a1)
a2 = type_fun_by(type_fun = type_spline, x = airquality$Day, y = airquality$Temp)
# via intermediate function with by mapping (will map type function across different groups
# and add an extra by column)
a3 = type_fun_by(type_fun = type_spline, x = airquality$Day, y = airquality$Temp, by = airquality$Month)
lapply(a1, head)
#> $x
#> [1] 1.000000 1.303030 1.606061 1.909091 2.212121 2.515152
#>
#> $y
#> 1 2 3 4 5 6
#> 80.65279 80.66687 80.68089 80.69482 80.70861 80.72214
#>
#> $ymin
#> 1 2 3 4 5 6
#> 75.90010 76.21427 76.50726 76.77525 77.01541 77.22605
#>
#> $ymax
#> 1 2 3 4 5 6
#> 85.40549 85.11946 84.85452 84.61439 84.40180 84.21823lapply(a2, head)
#> $x
#> [1] 1.000000 1.303030 1.606061 1.909091 2.212121 2.515152
#>
#> $y
#> 1 2 3 4 5 6
#> 80.65279 80.66687 80.68089 80.69482 80.70861 80.72214
#>
#> $ymin
#> 1 2 3 4 5 6
#> 75.90010 76.21427 76.50726 76.77525 77.01541 77.22605
#>
#> $ymax
#> 1 2 3 4 5 6
#> 85.40549 85.11946 84.85452 84.61439 84.40180 84.21823lapply(a3, head)
#> $x
#> [1] 1.000000 1.303030 1.606061 1.909091 2.212121 2.515152
#>
#> $y
#> 1 2 3 4 5 6
#> 70.31894 69.77182 69.22524 68.67976 68.13604 67.59610
#>
#> $ymin
#> 1 2 3 4 5 6
#> 62.04877 62.37023 62.58531 62.66354 62.58437 62.34635
#>
#> $ymax
#> 1 2 3 4 5 6
#> 78.58911 77.17341 75.86518 74.69598 73.68771 72.84584
#>
#> $by
#> [1] 5 5 5 5 5 5Created on 2024-07-21 with reprex v2.1.0
There was a problem hiding this comment.
That makes a lot of sense. Let's put this on ice for now. I'm doing some refactoring experiments. Let's see if I manage to simplify things a little before deciding on an implementation of the type functions.
| #' @examples | ||
| #' plt(Sepal.Length ~ Petal.Length, type = type_spline(), data = iris) | ||
| #' @export | ||
| type_spline = function(x, y, conf_level = 0.95, k = -1, fx = FALSE, bs = "tp", ...) { |
There was a problem hiding this comment.
See my comment above about passing a dedicated by argument to this and other function type factories.
|
FYI, I intend to close this and start over based on #198 If/when that dataframe split-apply-combine mechanism is merged, I have some good ideas about how to handle the by/facets gracefully, even in user-supplied type functions. |



In this comment @grantmcdermott says he’d like to support user-supplied functions as a
type. This is a WIP proof of concept, just to see what people would like in a “real” implementation.The idea is simple:
typeaccepts a user-supplied function which takes:x,y. The user-supplied function returns a list withx,y, and also atypevalue.I feel like this would be pretty cool for users. But most important, it could serve as a very easy platform to add more geoms for fitting functions.