Skip to content

type = function()#168

Closed
vincentarelbundock wants to merge 10 commits intograntmcdermott:mainfrom
vincentarelbundock:typefun
Closed

type = function()#168
vincentarelbundock wants to merge 10 commits intograntmcdermott:mainfrom
vincentarelbundock:typefun

Conversation

@vincentarelbundock
Copy link
Copy Markdown
Collaborator

@vincentarelbundock vincentarelbundock commented Jul 20, 2024

In this comment @grantmcdermott says he’d like to support user-supplied functions as a type. This is a WIP proof of concept, just to see what people would like in a “real” implementation.

The idea is simple: type accepts a user-supplied function which takes: x, y. The user-supplied function returns a list with x, y, and also a type value.

I feel like this would be pretty cool for users. But most important, it could serve as a very easy platform to add more geoms for fitting functions.

library(tinyplot)

loess <- function(x, y, ...) {
    dat <- data.frame(x, y)
    fit <- stats::loess(y ~ x, data = dat)
    nd <- data.frame(x = seq(min(x), max(x), length.out = 1000))
    y <- predict(fit, newdata = nd)
    out <- list(y = y, x = nd$x, type = "l")
    return(out)
}

# default
with(iris, plt(Sepal.Length ~ Petal.Length, type = loess))

# by
with(iris, plt(Sepal.Length ~ Petal.Length | Species, type = loess))

# facet
with(iris, plt(Sepal.Length ~ Petal.Length, type = loess, facet = Species))

# Define a new function to draw points instead of lines
loess <- function(x, y, ...) {
    dat <- data.frame(x, y)
    fit <- stats::loess(y ~ x, data = dat)
    nd <- data.frame(x = sort(x))
    y <- predict(fit, newdata = nd)
    out <- list(y = y, x = nd$x, type = "p")
    return(out)
}

with(iris, plt(Sepal.Length ~ Petal.Length, type = loess))

@vincentarelbundock vincentarelbundock marked this pull request as draft July 20, 2024 02:46
@vincentarelbundock

This comment was marked as outdated.

@grantmcdermott
Copy link
Copy Markdown
Owner

grantmcdermott commented Jul 20, 2024

Oh, I like the look of this. I think the current approach of internally converting to type = "l" (instead of type = "p") might prove a little too tricky in the end. But definitely keen to keep iterating.

@vincentarelbundock

This comment was marked as outdated.

@vincentarelbundock
Copy link
Copy Markdown
Collaborator Author

Added documentation and a few “type factories”:

library(tinyplot)

plt(am ~ mpg, type = type_glm(family = binomial), data = mtcars)

plt(wt ~ mpg | cyl, type = type_lm(), data = mtcars)

plt(hp ~ mpg, type = type_spline(), data = mtcars)

@vincentarelbundock vincentarelbundock marked this pull request as ready for review July 20, 2024 12:26
@vincentarelbundock vincentarelbundock changed the title [WIP] type = function() type = function() Jul 20, 2024
@vincentarelbundock
Copy link
Copy Markdown
Collaborator Author

@grantmcdermott I think this is ready for a "real" (initial) review.

Two major benefits of this approach to keep in mind:

  1. Code simplification: This allows us to extract a lot of code out of the tinyplot() function, which is starting to get a bit unwieldy.
  2. Arguments: Each new type can have its own set of arguments, much like ggplot2::geom_*(). And those are documented in a nice self-contained spot: ?type_spline. For instance, I might argue that jitter would be better as one of those type factories, since it would allow us to implement a bunch of different arguments like vertical and horizontal noise without polluting the main interface. plt(y ~ x, type = type_jitter(vnoise = .1, hnoise = .1))

Copy link
Copy Markdown
Owner

@grantmcdermott grantmcdermott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic @vincentarelbundock. I really like the direction and overall design. High level comments are:

  1. I think we need to move the type_fun logic upstream so that it can better handle the extremities of the confidence intervals. I have some pseudo code for this that I'll try to fomalize and share.
  2. I'd prefer to stick to pure base R for the pre-shipped function factories.

Comment thread R/tinyplot.R
yymax = idata[[ii]]$ymax


if (is.function(type_fun)) {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would require some additional thought, but can we move this code logic higher up in the function?

Reason: At present, you are only extracting ymin, ymax etc. after the plot window has already been drawn (where, in turn, the axes ranges are only based on the raw values of x and y). This is leading to some confidence bands being clipped, since they extend beyond the raw variable ranges. (See your lm and loess examples here.)

One way to overcome this issue, is to get these function type factories to accept an addition by argument and then use this argument (a) split the data sample upstream and (b) run the prediction functions on the relevant split data samples. That's essentially what we do now for the "histogram" and "density" types. I'll try to share a concrete example shortly.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a very strong view on this, and understand that clipping is annoying. But the counter argument is: (1) ylim argument is easy to use, and (2) adding a by argument to the type_ functions will significantly increase complexity and barriers to entry for users. One cool thing about the current design is that it is trivial for people to write their own functions. Again, don't have a strong view; just something to consider as you decide.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the sort of thing I could envisage. In brief, we keep the existing type function factories the same, but internally we pass them to an intermediate function (here: called type_fun_by but we can obviously rename it) that correctly maps it across any grouped data. This would allow you to keep the user-facing simplicity of the main type_x functions, and recycle the additional job of correctly mapping across groups to this same intermediate function for all them.

Proof of concept using a slightly simplified version of your type_spline function factory (here, adapted so that it always produces output rather than constructs a function).

# helper function
ci = function(estimate, std.error, conf.level, df, backtransform = identity) {
  crit = stats::qt(1 - (1-conf.level)/2, df)
  out = list(
    estimate = backtransform(estimate),
    conf.low = backtransform(estimate - crit * std.error),
    conf.high = backtransform(estimate + crit * std.error)
  )
  return(out)
}

# simplified type_spline function
type_spline = function(x, y, conf.level = 0.95, k = -1, fx = FALSE, bs = "tp", ...) {
  se = if (isFALSE(conf.level)) FALSE else TRUE
  conf.level = if (isFALSE(conf.level)) 0.95 else conf.level
  dat = data.frame(x, y)
  fit = mgcv::gam(y ~ s(x, k = k, fx = fx, bs = bs), data = dat)
  nd = data.frame(x = seq(min(x), max(x), length.out = 100))
  p = stats::predict(fit, newdata = nd, se.fit = TRUE)
  p = ci(
    estimate = p$fit,
    std.error = p$se.fit,
    conf.level = conf.level, 
    df = fit$df.residual)
  out = list(x = nd$x,
             y = p$estimate,
             ymin = p$conf.low,
             ymax = p$conf.high)
  return(out)
}

# NEW: intermediate mapping function
type_fun_by = function(type_fun, x, y, by = NULL, new_data_length = 100, ...) {
  new_data = data.frame(x = seq(min(x), max(x), length.out = new_data_length))
  dots = list(...)
  if (is.null(by)) {
    arg_list = list(x = x, y = y, newdata = new_data)
    arg_list = modifyList(arg_list, dots, keep.null = TRUE)
    do.call(type_fun, args = arg_list)
  } else {
    uby = sort(unique(by))
    split_data = lapply(list(x = x, y = y), function(l) split(l, by))
    ret = lapply(
      seq_along(uby),
      function(i, ...) {
        x = split_data[["x"]][[i]]
        y = split_data[["y"]][[i]]
        minx = min(x, na.rm = TRUE)
        maxx = max(x, na.rm = TRUE)
        idx = new_data[["x"]] > minx & new_data[["x"]] < maxx
        new_data = new_data[idx, , drop = FALSE]
        new_data = rbind(data.frame(x = minx), new_data)
        arg_list = list(x = x, y = y, newdata = new_data)
        arg_list = modifyList(arg_list, dots, keep.null = TRUE)
        fun_ret = do.call(type_fun, args = arg_list)
        if (!is.function(fun_ret)) fun_ret[["by"]] = rep(uby[i], length(fun_ret[["x"]]))
        return(fun_ret)
      },
    )
    if (is.function(ret[[1]])) {
      return(ret[[1]])
    } else {
      return(do.call(function(...) Map("c", ...), ret))
    }
  }
}

## Test

# direct use
a1 = type_spline(x = airquality$Day, y = airquality$Temp)
# via intermediate function with no "by" mapping (should be the same as a1)
a2 = type_fun_by(type_fun = type_spline, x = airquality$Day, y = airquality$Temp)
# via intermediate function with by mapping (will map type function across different groups
# and add an extra by column)
a3 = type_fun_by(type_fun = type_spline, x = airquality$Day, y = airquality$Temp, by = airquality$Month)

lapply(a1, head)
#> $x
#> [1] 1.000000 1.303030 1.606061 1.909091 2.212121 2.515152
#> 
#> $y
#>        1        2        3        4        5        6 
#> 80.65279 80.66687 80.68089 80.69482 80.70861 80.72214 
#> 
#> $ymin
#>        1        2        3        4        5        6 
#> 75.90010 76.21427 76.50726 76.77525 77.01541 77.22605 
#> 
#> $ymax
#>        1        2        3        4        5        6 
#> 85.40549 85.11946 84.85452 84.61439 84.40180 84.21823
lapply(a2, head)
#> $x
#> [1] 1.000000 1.303030 1.606061 1.909091 2.212121 2.515152
#> 
#> $y
#>        1        2        3        4        5        6 
#> 80.65279 80.66687 80.68089 80.69482 80.70861 80.72214 
#> 
#> $ymin
#>        1        2        3        4        5        6 
#> 75.90010 76.21427 76.50726 76.77525 77.01541 77.22605 
#> 
#> $ymax
#>        1        2        3        4        5        6 
#> 85.40549 85.11946 84.85452 84.61439 84.40180 84.21823
lapply(a3, head)
#> $x
#> [1] 1.000000 1.303030 1.606061 1.909091 2.212121 2.515152
#> 
#> $y
#>        1        2        3        4        5        6 
#> 70.31894 69.77182 69.22524 68.67976 68.13604 67.59610 
#> 
#> $ymin
#>        1        2        3        4        5        6 
#> 62.04877 62.37023 62.58531 62.66354 62.58437 62.34635 
#> 
#> $ymax
#>        1        2        3        4        5        6 
#> 78.58911 77.17341 75.86518 74.69598 73.68771 72.84584 
#> 
#> $by
#> [1] 5 5 5 5 5 5

Created on 2024-07-21 with reprex v2.1.0

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes a lot of sense. Let's put this on ice for now. I'm doing some refactoring experiments. Let's see if I manage to simplify things a little before deciding on an implementation of the type functions.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sgtm thanks

Comment thread R/type_factories.R Outdated
#' @examples
#' plt(Sepal.Length ~ Petal.Length, type = type_spline(), data = iris)
#' @export
type_spline = function(x, y, conf_level = 0.95, k = -1, fx = FALSE, bs = "tp", ...) {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above about passing a dedicated by argument to this and other function type factories.

Comment thread README.md
Comment thread R/type_factories.R Outdated
Comment thread R/type_factories.R Outdated
Comment thread DESCRIPTION Outdated
Comment thread R/type_factories.R Outdated
Comment thread R/type_factories.R Outdated
Comment thread R/type_factories.R Outdated
Comment thread R/type_factories.R Outdated
@vincentarelbundock vincentarelbundock marked this pull request as draft July 22, 2024 04:24
@vincentarelbundock
Copy link
Copy Markdown
Collaborator Author

FYI, I intend to close this and start over based on #198

If/when that dataframe split-apply-combine mechanism is merged, I have some good ideas about how to handle the by/facets gracefully, even in user-supplied type functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants