使用`boot_res`

Question

上次我问如何计算多个受访者重复测量的变量（procras）的每个测量时间（周）的平均得分。所以我的（简化）长格式数据集看起来像下面的例子（这里有两个学生，5个时间点，没有分组变量）：

studentID  week   procras
   1        0     1.4
   1        6     1.2
   1        16    1.6
   1        28    NA
   1        40    3.8
   2        0     1.4
   2        6     1.8
   2        16    2.0
   2        28    2.5
   2        40    2.8

使用dplyr我会得到每个测量时间的平均分数

mean_data <- group_by(DataRlong, week)%>% summarise(procras = mean(procras, na.rm = TRUE))

看起来像这样：

Source: local data frame [5 x 2]
        occ  procras
      (dbl)    (dbl)
    1     0 1.993141
    2     6 2.124020
    3    16 2.251548
    4    28 2.469658
    5    40 2.617903

使用ggplot2我现在可以绘制随时间的平均变化，并通过轻松调整dplyr的group_data（），我也可以获得每个子组的平均值（例如，男性和女性的每次平均得分）。现在我想在mean_data表中添加一个列，其中包括每次平均得分的95％-CIs的长度。

http://www.cookbook-r.com/Graphs/Plotting_means_and_error_bars_(ggplot2)/解释了如何获取和绘制CI，但是一旦我想为任何子组执行此操作，这种方法似乎就成了问题，对吧？那么有没有办法让dplyr在mean_data中自动包含CI（基于组大小等）？之后，将新值作为CI绘制到我希望的图表中应该相当容易。谢谢。

Answer 1

您可以使用mutate

中的summarise一些额外功能手动执行此操作

library(dplyr)
mtcars %>%
  group_by(vs) %>%
  summarise(mean.mpg = mean(mpg, na.rm = TRUE),
            sd.mpg = sd(mpg, na.rm = TRUE),
            n.mpg = n()) %>%
  mutate(se.mpg = sd.mpg / sqrt(n.mpg),
         lower.ci.mpg = mean.mpg - qt(1 - (0.05 / 2), n.mpg - 1) * se.mpg,
         upper.ci.mpg = mean.mpg + qt(1 - (0.05 / 2), n.mpg - 1) * se.mpg)

#> Source: local data frame [2 x 7]
#> 
#>      vs mean.mpg   sd.mpg n.mpg    se.mpg lower.ci.mpg upper.ci.mpg
#>   (dbl)    (dbl)    (dbl) (int)     (dbl)        (dbl)        (dbl)
#> 1     0 16.61667 3.860699    18 0.9099756     14.69679     18.53655
#> 2     1 24.55714 5.378978    14 1.4375924     21.45141     27.66287

Answer 2

我使用 gmodels 包中的 ci 命令：

library(gmodels)
your_db %>% group_by(gouping_variable1, grouping_variable2, ...)
        %>% summarise(mean = ci(variable_of_interest)[1], 
                      lowCI = ci(variable_of_interest)[2],
                      hiCI = ci(variable_of_interest)[3], 
                      sd = ci (variable_of_interest)[4])

Answer 3

如果您想使用boot软件包的多功能性，我发现了this blog post useful（下面的代码从那里得到启发）

library(dplyr)
library(tidyr)
library(purrr)
library(boot)

set.seed(321)
mtcars %>%
  group_by(vs) %>%
  nest() %>% 
  mutate(boot_res = map(data,
                        ~ boot(data = .$mpg,
                               statistic = function(x, i) mean(x[i]),
                               R = 1000)),
         boot_res_ci = map(boot_res, boot.ci, type = "perc"),
         mean = map(boot_res_ci, ~ .$t0),
         lower_ci = map(boot_res_ci, ~ .$percent[[4]]),
         upper_ci = map(boot_res_ci, ~ .$percent[[5]]),
         n =  map(data, nrow)) %>% 
  select(-data, -boot_res, -boot_res_ci) %>% 
  unnest(cols = c(n, mean, lower_ci, upper_ci)) %>% 
  ungroup()
#> # A tibble: 2 x 5
#>      vs  mean lower_ci upper_ci     n
#>   <dbl> <dbl>    <dbl>    <dbl> <int>
#> 1     0  16.6     15.0     18.3    18
#> 2     1  24.6     22.1     27.3    14

^{由reprex package（v0.3.0）于2020-01-22创建}

一些代码解释：

与nest()嵌套时，将创建一个列表列（默认情况下称为data），其中包含2个数据帧，是整个mtcars的2个子集，按{{ 1}}（包含2个唯一值0和1）。然后，使用vs和mutate()，通过将map()包中的函数boot_res应用到列表列{{1}，创建列表列boot()。 }。然后，通过将boot函数应用于data列表列来创建boot_res_ci列表列，依此类推。使用boot.ci()，我们可以删除不再需要的列表列，而无需嵌套和取消分组最终结果。

不幸的是，该代码不容易导航，但它可以达到另一个示例的目的。

使用`boot_res`

刚刚意识到，包select()具有一种处理broom::tidy()输出的方法的实现，如here所示。这使代码不再那么冗长，输出也更加完整，包括偏差和统计的标准误差（均值）：

broom

^{由reprex package（v0.3.0）于2020-01-22创建}

`boot()`简洁的语法

但是请注意，我通过使用library(dplyr) library(tidyr) library(purrr) library(broom) library(boot) set.seed(321) mtcars %>% group_by(vs) %>% nest() %>% mutate(boot_res = map(data, ~ boot(data = .$mpg, statistic = function(x, i) mean(x[i]), R = 1000)), boot_tidy = map(boot_res, tidy, conf.int = TRUE, conf.method = "perc"), n = map(data, nrow)) %>% select(-data, -boot_res) %>% unnest(cols = -vs) %>% ungroup() #> # A tibble: 2 x 7 #> vs statistic bias std.error conf.low conf.high n #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> #> 1 0 16.6 -0.0115 0.843 15.0 18.3 18 #> 2 1 24.6 -0.0382 1.36 22.1 27.3 14软件包而不是data.table来获得更简洁的语法：

data.table

^{由reprex package（v0.3.0）于2020-01-23创建}

与data.table一起使用多个变量

dplyr

^{由reprex package（v0.3.0）于2020-01-23创建}

Answer 4

更新 tidyr 1.0.0

@Valentin 给出的所有解决方案都是可行的，但我想暗示一种新的替代方案，它对你们中的一些人来说更具可读性。它用一个名为 summarise 的相对较新的 [tidyr 1.0.0][1] 函数替换了所有 unnest_wider 解决方案。有了它，您可以将代码简化为以下内容：

mtcars %>% 
  nest(data = -"vs") %>%
  mutate(ci = map(data, ~ MeanCI(.x$mpg, method = "boot", R = 1000))) %>% 
  unnest_wider(ci)

给出：

# A tibble: 2 x 5
     vs data                mean lwr.ci upr.ci
  <dbl> <list>             <dbl>  <dbl>  <dbl>
1     0 <tibble [18 × 10]>  16.6   14.7   18.5
2     1 <tibble [14 × 10]>  24.6   22.0   27.1

无需引导即可更简单地计算置信区间：

mtcars %>% 
  nest(data = -"vs") %>%
  mutate(ci = map(data, ~ MeanCI(.x$mpg))) %>% 
  unnest_wider(ci)

使用dplyr

4 个答案:

使用boot_res

boot()简洁的语法

与data.table一起使用多个变量

更新 tidyr 1.0.0

使用`boot_res`

`boot()`简洁的语法