Expose df in pool()#562
Conversation
|
In accord to our previous conservation #560 with @danielinteractive and @luwidmer |
|
Hi @luwidmer , what are your thoughts on this one? 😃 |
|
Thank you for flagging this again @danielinteractive, I was out of office. Will take a look over the next days @munoztd0 |
|
It would be good to see a reference for the statement "median fallback is the standard pragmatic choice". Such a decision would need to be thoroughly documented and also highlighted in the methods vignette. |
luwidmer
left a comment
There was a problem hiding this comment.
- I agree with @tobiasmuetze here RE the statement "median fallback is the standard pragmatic choice". I would also like to see references for this added to the documentation.
In addition:
- The old code threw a clear error when
dfsvaried. Now it silently proceeds withmedian(dfs). This can silently introduce unexpected behavior in case a user relied on this error. From a software engineering standpoint I don't think this is desirable. - This PR introduces test failures, which would need to be addressed.
- If the new median
dfis indeed desirable, the behavior there should have tests as well. pool_internal.jackknife(),pool_internal.bootstrap(), andpool_internal.bmlmi()all return the parametric_ci() list without$df. Onlypool_internal.rubin()now appends it. This inconsistency means downstream code cannot reliably access$dfwithout first checking which method was used. Ifdfshould be exposed, one should consider to do this as consistently as possible (and/oras_data_frame_internal()should be updated to include it).as.data.frame.pool()won't surface the newdf. Theas_data_frame_internal()function extractsest,se,ci,pvaluebut notdf. If the goal is to exposedfto downstream callers, it should appear in the data frame representation too, which is the primary user-facing output.
Thanks @tobiasmuetze - good point, I don't think there is any literature reference for this statement, because it was just my personal judgement call when I ran into this problem 😄 If we think that it is not a good idea to add this I can double check again first if we still need this feature for our outputs. (I remember one case where we later removed one feature use...) |
|
@luwidmer and @tobiasmuetze We discussed with @danielinteractive the pros and cons and decided to drop the multiple df values cases from this PR (so I reverted this part) Then I made sure that to expose df across all methods for consistency and updated the tests accordingly. One remaining issue: the print.md snapshot shows df = because sysdata.rda predates this PR and the stored pool objects lack the $df field. Re-running data-raw/create_print_test_data.R fixes the df column but also chnages point estimates due to the RNG (I beleieve?) see below:
Unrelated change that would pollute the diff.. Should we accept the snapshot for now and regenerate sysdata.rda in a separate dedicated PR, or is there a preferred way to handle this? |

pool() now exposes df (Barnard-Rubin pooled degrees of freedom) across all pooling methods and in as.data.frame.pool(). For methods without a d.f. concept (jackknife, bootstrap, bmlmi), df = NA_real_.
The constant-d.f. assertion is remains unchanged