R dplyr的group_by也考虑空组

时间:2019-03-19 18:47:09

标签: r group-by dplyr

让我们考虑以下数据框:

set.seed(123)
data <- data.frame(col1 = factor(rep(c("A", "B", "C"), 4)),
                   col2 = factor(c(rep(c("A", "B", "C"), 3), c("A", "A", "A"))),
                   val1 = 1:12,
                   val2 = rnorm(12, 10, 15))

列联表如下:

cont_tab <- table(data$col1, data$col2, dnn = c("col1", "col2"))

cont_tab

    col2
col1 A B C
   A 4 0 0
   B 1 3 0
   C 1 0 3

如您所见,未发生某些配对:(A,B),(A,C),(B,C),(C,B)。我分析的最终目标是列出所有对(在本例中为9)并显示每个对的统计信息。使用dplyr::group_by()函数时,我遇到了一个限制。即,dplyr::group_by()仅考虑现有的配对(至少出现一次的配对):

data %>%
  group_by(col1, col2) %>%
  summarize(stat = sum(val2) - sum(val1))

# A tibble: 5 x 3
# Groups:   col1 [?]
  col1  col2   stat
  <fct> <fct> <dbl>
1 A     A      58.1
2 B     A     -16.4
3 B     B      17.0
4 C     A     -12.9
5 C     C     -41.9

我想到的输出有9行(其中4行的stat等于0)。它可以在dplyr中使用吗?

编辑:抱歉一开始太含糊。真正的问题比计算特定对出现的次数更为复杂。我添加了新数据,以使实际问题更加明显。

5 个答案:

答案 0 :(得分:4)

<?php include('/path/to/key.php'); //here you are defining $foo $bar = $foo; //now you can continue with the rest of your original script $email = $_POST('email'); if (in_array($email, $bar)) { echo('in array'); } else { echo('not in array'); } ?> 添加spread到获得与tidyr相同的结果要容易得多

table

注意:library(dplyr) library(tidyr) count(data, col1, col2) %>% spread(col2, n, fill = 0) # A tibble: 3 x 4 # Groups: col1 [3] # col1 A B C # <fct> <dbl> <dbl> <dbl> #1 A 4 0 0 #2 B 1 3 0 #3 C 1 0 3 步骤在此处更改为group_by/summarise

如@divibisan所建议,如果OP需要长格式,则在末尾添加count

gather

更新

OP帖子中有更新的数据

data %>%
   group_by(col1, col2) %>%
   summarize(stat = n()) %>%
   spread(col2, stat, fill = 0) %>%
   gather(col2, stat, A:C)
# A tibble: 9 x 3
# Groups:   col1 [3]
#  col1  col2   stat
#  <fct> <chr> <dbl>
#1 A     A         4
#2 B     A         1
#3 C     A         1
#4 A     B         0
#5 B     B         3
#6 C     B         0
#7 A     C         0
#8 B     C         0
#9 C     C         3

答案 1 :(得分:3)

即使没有dplyr

,这也是可行的
as.data.frame(table(data$col1, data$col2, dnn = c("col1", "col2")))
#  col1 col2 Freq
#1    A    A    4
#2    B    A    1
#3    C    A    1
#4    A    B    0
#5    B    B    3
#6    C    B    0
#7    A    C    0
#8    B    C    0
#9    C    C    3

答案 2 :(得分:2)

您可以使用__block

tidyr::complete

您也可以在第一部分中使用library(tidyverse) data %>% group_by(col1, col2) %>% summarize(stat = n()) %>% # additions below ungroup %>% complete(col1, col2, fill = list(stat = 0)) # # A tibble: 9 x 3 # col1 col2 stat # <chr> <chr> <dbl> # 1 A A 4 # 2 A B 0 # 3 A C 0 # 4 B A 1 # 5 B B 3 # 6 B C 0 # 7 C A 1 # 8 C B 0 # 9 C C 3 。下面的代码提供与上面的代码相同的输出

count

答案 3 :(得分:1)

还有tidyverse使用tidyr::complete()的可能性:

data %>% 
 group_by_all() %>%
 add_count() %>%
 complete(col1, col2, fill = list(n = 0)) %>%
 distinct()

  col1  col2      n
  <fct> <fct> <dbl>
1 A     A         4
2 A     B         0
3 A     C         0
4 B     A         1
5 B     B         3
6 B     C         0
7 C     A         1
8 C     B         0
9 C     C         3

或使用tidyr::expand()

data %>% 
 count(col1, col2) %>%
 right_join(data %>%
            expand(col1, col2), by = c("col1" = "col1",
                                       "col2" = "col2")) %>%
 replace_na(list(n = 0))

或使用tidyr::crossing()

data %>%
 count(col1, col2) %>%
 right_join(crossing(col1 = unique(data$col1), 
                     col2 = unique(data$col2)), by = c("col1" = "col1",
                                                       "col2" = "col2")) %>%
 replace_na(list(n = 0))

答案 4 :(得分:0)

这里有一个解决方法,希望它对您有用。将表格与所有组合的表格合并,然后将NA替换为0。

data %>%
group_by(col1, col2) %>%
summarize(stat = n()) %>% 
merge(unique(expand.grid(data)), by=c("col1","col2"), all=T) %>% 
replace_na(list(stat=0))