R子集数据框中的错误然后使用sapply

时间:2014-12-16 00:05:38

标签: r subset sapply

我正在尝试对数据框中的数据组(县)运行回归(lm)。但是,我首先想要过滤该数据帧(dat)以排除一些数据点太少的组。只要我不首先对数据框进行子集化,我就能让一切工作正常:

tmp1 <- with(dat, 
    by(dat, County,
        function(x) lm(formula = Y ~ A + B + C, data=x)))
sapply(tmp1, function(x) summary(x)$adj.r.squared)

我按预期回来了:

  

Barrow Carroll Cherokee Clayton Cobb Dekalb Douglas

     

0.00000 NaN 0.61952 0.69591 0.48092 0.61292 0.39335

但是,当我第一次对数据框进行子集时:

dat.counties <- aggregate(dat[,"County"], by=list(County), FUN=length)
good.counties <- as.matrix(subset(dat.counties, x > 20, select=Group.1))
dat.temp <- dat["County" %in% good.counties,]

然后运行相同的代码:

tmp2 <- with(dat, 
by(dat, County,
    function(x) lm(formula = Y ~ A + B + C, data=x)))
sapply(tmp2, function(x) summary(x)$adj.r.squared)

我收到以下错误:“$运算符对原子向量无效”。如果我然后跑 summary(tmp2)我看到以下内容:

     Length Class  Mode
     

Barrow 0 -none- NULL

     

Carroll 0 -none- NULL

     

Cherokee 12 lm list

     

Clayton 12 lm list

sapply显然是对Class -none-对象的轰炸。但那些是我上面排除的那些!它们如何仍然出现在我的新数据框中?!

感谢您的任何启发。

1 个答案:

答案 0 :(得分:1)

代码的某些部分不清楚。可能是你做了attach数据集。此外,由@BrodieG评论使用错误的dat代替dat.temp也存在问题。关于错误,可能是因为列Countyfactorlevels未被删除。你可以试试

dat.temp1 <- droplevels(dat.temp)
tmp2 <- with(dat.temp1, 
      by(dat.temp1, County,
      function(x) lm(formula = Y ~ A + B + C, data=x)))
sapply(tmp2, function(x) summary(x)$adj.r.squared)

以下是重现错误的示例

set.seed(24)
d <- data.frame(
 state = rep(c('NY', 'CA','MD', 'ND'), c(10,10,6,7)),
 year = sample(1:10,33,replace=TRUE),
 response= rnorm(33)
)

 tmp1 <- with(d, by(d, state, function(x) lm(formula=response~year, data=x)))
 sapply(tmp1, function(x) summary(x)$adj.r.squared)
 #       CA          MD          ND          NY 
 # 0.03701114 -0.04988296 -0.07817515 -0.11850038 

d.states <- aggregate(d[,"state"], by=list(d[,'state']), FUN=length)
good.states <- as.matrix(subset(d.states, x > 6, select=Group.1))
d.sub <-  d[d$state %in% good.states[,1],]

tmp2 <- with(d.sub, 
    by(d.sub, state,
      function(x) lm(formula = response~year, data=x)))
sapply(tmp2, function(x) summary(x)$adj.r.squared)
#Error in summary(x)$adj.r.squared : 
# $ operator is invalid for atomic vectors

如果你看一下

 tmp2[2]
 #$MD
 #NULL

d.sub1 <- droplevels(d.sub)
tmp2 <- with(d.sub1, 
      by(d.sub1, state,
          function(x) lm(formula = response~year, data=x)))
sapply(tmp2, function(x) summary(x)$adj.r.squared)
#       CA          ND          NY 
# 0.03701114 -0.07817515 -0.11850038