Question

我有一个类似于以下内容的数据集，但有更多的列和行：

a<-c("Fred","John","Mindy","Mike","Sally","Fred","Alex","Sam")
b<-c("M","M","F","M","F","M","M","F")
c<-c(40,35,25,50,25,40,35,40)
d<-c(9,7,8,10,10,9,5,8)
df<-data.frame(a,b,c,d)
colnames(df)<-c("Name", "Gender", "Age", "Score")

我需要创建一个函数，让我对所选数据子集的分数求和。但是，所选择的子集每次可能具有不同数量的变量。一个子集可以是Name ==“Fred”，另一个可以是Gender ==“M”＆amp;年龄== 40.在我的实际数据集中，所选子集中最多可以使用20列，因此我需要尽可能地使用它。

我尝试使用包含eval（parse（text = ...）的sapply命令，但是只需要20,000个左右的记录样本需要很长时间。我确信有更快的方法，而且我我很感激找到它的任何帮助。

提前致谢！

Sparky的

Answer 1

有几种方法可以表示这两个变量。一种方法是作为两个不同的对象，另一种方式是列表中的两个元素。

但是，使用named list可能是最简单的：

# df is a function for the F distribution.  Avoid using "df" as a variable name
DF <- df

example1 <- list(Name = c("Fred"))  # c() not needed, used for emphasis
example2 <- list(Gender = c("M"), Age=c(40, 50))

## notice that the key portion is `DF[[nm]] %in% ll[[nm]]`

subByNmList <- function(ll, DF, colsToSum=c("Score")) {
    ret <- vector("list", length(ll))
    names(ret) <- names(ll)
    for (nm in names(ll))
        ret[[nm]] <- colSums(DF[DF[[nm]] %in% ll[[nm]] , colsToSum, drop=FALSE])

    # optional
    if (length(ret) == 1)
        return(unlist(ret, use.names=FALSE))

    return(ret)
   }

subByNmList(example1, DF)
subByNmList(example2, DF)

Answer 2

lapply( subset( df, Gender == "M" & Age == 40, select=Score), sum)
#$Score
#[1] 18

我本可以写：

sum( subset( df, Gender == "M" & Age == 40, select=Score) )

但这不会很好地概括。

r中的动态子集

2 个答案: