将聚合计数添加为额外的数据帧行

时间:2015-07-04 18:19:06

标签: r dataframe aggregate rbind split-apply-combine

我的数据框中包含英文字母及其频率。现在,知道元音和辅音的频率以及出现的总次数会很高兴 - 因为我想绘制所有这些信息,我需要将它放在一个数据帧中。

所以我经常发现自己处于这样的情况:

df <- data.frame(letter = letters, freq = sample(1:100, length(letters)))

df_vowels <- data.frame(letter = "vowels", freq = sum(df[df$letter %in% c("a", "e", "i", "o", "u"), ]$freq))
df_consonants <- data.frame(letter = "consonants", freq = sum(df[!df$letter %in% c("a", "e", "i", "o", "u"), ]$freq))
df_totals <- data.frame(letter = "totals", freq = sum(df$freq))

df <- rbind(df, df_vowels, df_consonants, df_totals)

我是以正确的方式做到这一点还是有更优雅的解决方案呢?

看起来我的描述非常令人困惑:

基本上,我想在数据框中添加新的类别(行)。在这个非常简单的例子中,它只是汇总了数据。

(对于时间序列图,我正在使用聚合函数。)

enter image description here

2 个答案:

答案 0 :(得分:2)

编辑:对于你问题的第三个版本,这是一个非常优雅的答案:

df <- data.frame(letter = letters, freq = sample(1:100, length(letters)),
                 stringsAsFactors=F)

df = df %>% group_by(letter) %>% summarize(freq = sum(freq))

df.tots = df %>% group_by(is_vowel = letter %in% c('a','e','i','o','u')) %>%
                 summarize(freq=sum(freq))

# Now we just rbind your three summary rows onto the df, then pipe it into your ggplot  
df %>%
  rbind(c('vowels',     df.tots[df.tots$is_vowel==T,]$freq)) %>%
  rbind(c('consonants', df.tots[df.tots$is_vowel==F,]$freq)) %>%
  rbind(c('total',      sum(df.tots$freq)))                  %>%
  ggplot( ... your_ggplot_command_goes_here ...)

  #qplot(data=..., x=letter, y=freq, stat='identity', geom='histogram')
  # To keep your x-axis in order, i.e. our summary rows at bottom,
  # you have to explicitly set order of factor levels:
  # df$letter = factor(df$letter, levels=df$letter)

Voila!

注意:

  1. 我们需要data.frame(... stringsAsFactors=F)所以我们以后可以追加 行'元音','辅音','总'因为那些不会发生 在“字母”的因子水平
  2. 请注意,dplyr group_by(is_vowel = ...)允许我们同时插入一个新列(mutate),然后在该表达式(group_by)上拆分,所有这些都在一个紧凑的行中。整齐。从来不知道能做到这一点。
  3. 你应该能够让bind_rows最终工作,我做不到。
  4. 编辑:第二版。你说你想要进行聚合,所以我们认为每个字母在df中都有> 1条记录。你似乎只是将你的df分成元音和辅音,然后再合并,所以除了is_vowel之外,我没有看到新的colunms是必要的。一种方法是使用dplyr:

    require(dplyr)
    #  I don't see why you don't just overwrite df here with df2, the df of totals...
    df2 = df %>% group_by(letter) %>% summarize(freq = sum(freq))
       letter     freq
    1       a      150
    2       b       33
    3       c       54
    4       d      258
    5       e      285
    6       f      300
    7       g      198
    8       h       27
    9       i       36
    10      j      189
    ..    ...      ...
    
    # Now add a logical column, so we can split on it when aggregating
    # df or df2 ....
    df$is_vowel = df$letter %in% c('a','e','i','o','u')
    
    # Then your total vowels are:
    df %>% filter(is_vowel==T) %>% summarize(freq = sum(freq))
         freq
          312
    # ... and total consonants ...
    df %>% filter(is_vowel==F) %>% summarize(freq = sum(freq))
         freq
         1011
    

    这是另一种方式,如果你想避免使用dplyr:

    split(df, df$letter %in% c("a", "e", "i", "o", "u") )
    

    顺便说一下,你可以通过从所有字母中减去元音来更容易地形成辅音列表(/ set):

    setdiff(letters, c("a", "e", "i", "o", "u"))
    # "b" "c" "d" "f" "g" "h" "j" "k" "l" "m" "n" "p" "q" "r" "s" "t" "v" "w" "x" "y" "z"
    

答案 1 :(得分:2)

你可以尝试

 v2 <- with(df, tapply(freq, c('consonants', 'vowels')[letter %in% 
              v1+1L], FUN=sum))

 df1 <- rbind(df, data.frame(letter=c(names(v2),"Total"), 
            freq=c(v2, sum(v2)), stringsAsFactors=FALSE))
 library(ggplot2)
 ggplot(df1, aes(x=letter, y=freq)) +
                  geom_bar(stat='identity')

数据

set.seed(24)
df <- data.frame(letter= sample(letters,200, replace=TRUE),
 freq = sample(1:100, 200, replace=TRUE), stringsAsFactors=FALSE)
v1 <- c("a", "e", "i", "o", "u")