Question

我有一些数据（下面）我希望根据某些分组变量的当前列的总和迭代添加列，并且我想将列命名为当前名称+“_tot”的粘贴值。我认为dplyr和lapply的组合是解决它的方法，但我无法使结构正确。

set.seed(1234)
data <- data.frame(
    biz = sample(c("telco","shipping","tech"), 50, replace = TRUE),
    region = sample(c("mideast","americas"), 50, replace = TRUE),
    june = sample(1:50, 50, replace=TRUE),
    july = sample(100:150, 50, replace=TRUE)
    )

所以，我想要做的是1）按“区域”对这些数据进行分组，然后为下个月的每个月添加一个新列，即该月值的总和（在实际数据帧中，有很多期间）以下）。

基本上，我想应用这个功能

library(dplyr)
data %>% group_by(region) %>% mutate(june_tot = sum(june))

每个月，无需指定“6月”或“7月”。我最初的看法：

testfun <- function(df, col) {
    name <- paste(col, "_tot", sep="")
    data2 <- df %>% group_by(region) %>% summarise(name=sum(col))
    return(data2)
}

但是这不起作用，因为我必须指定要调用初始函数的列。当然，从初始函数中删除“col”参数也不起作用。

任何想法如何提出这种论点？

Answer 1

以下是使用dplyr解决问题的可能方法（首先，因为这是您尝试过的方法），然后是data.table以及base R解决方案：

dplyr：

cols <- lapply(names(data)[-(1:2)], as.name)
names(cols) <- paste0(names(data)[-(1:2)], "_tot")
data %>% group_by(region) %>% mutate_each_q(funs(sum), cols)

假设每一列，但前两个是每月数据。按行说明：

我们使用as.name和lapply生成我们希望mutate作为符号的列名称列表
我们将所需的新名称（即month_tot）提供给1中的符号列表。
我们使用mutate_each_q（在mutate_each_中称为dplyr 0.3.0.2）将sum应用于我们在1和2中创建的表达式列表。

这是（样本）结果：

Source: local data frame [50 x 6]
Groups: region

        biz   region june july june_tot july_tot
1  shipping  mideast   17  124      780     3339
2     telco americas   11  101      465     2901
3     telco  mideast   27  131      780     3339
4      tech americas   24  135      465     2901
... rows omitted

data.table：

new.names <- paste0(tail(names(data), 2L), "_tot")  # Make new names
data.table(data)[,
  (new.names):=lapply(.SD, sum),    # `lapply` `sum` to the selected columns (those in .SD), and assign to `new.names` columns
  by=region, .SDcols=-1             # group by `region`, and exclude first column from `.SD` (note `region` is excluded as well by reason of being in `by`
][]                                 # extra `[]` just to force printing

在这里，类似的逻辑，除了我们使用特殊的.SD对象，该对象代表我们没有分组的data.table中的每一列。

碱

do.call(
  cbind, 
  list(
    data, 
    setNames(
      lapply(data[-(1:2)], function(x) ave(x, data$region, FUN=sum)),
      paste0(names(data[-(1:2)]), "_tot")
) ) )

我们使用ave计算每个区域的总和，使用lapply将ave应用于每个列，并使用do.call(cbind, ...)重建最终的数据框。< / p>

Answer 2

尝试：

> for(i in 3:4) print(tapply(data[[i]], data$region, sum))
americas  mideast 
     563      768 
americas  mideast 
    2538     3802

如果需要，您可以在列表中获取所有输出。

Answer 3

对数据进行重组很有效。

require(tidyr)
# wide to long
d2 <- gather(data = data,key = month,value = monthval,-c(biz,region))

# get totals and rename month
month_tots <- aggregate(x = list(total = d2$monthval),by = list(region = d2$region,month = d2$month),sum)
month_tots$month <- paste0(month_tots$month,'_tot')

# long to wide
month_tots <- spread(data = month_tots,key = month,value = total)

# recombine
merge(data,month_tots,by = 'region',all.x = T)

迭代地根据分组变量创建列

3 个答案:

dplyr：

data.table：

碱