使用ddply计算rmse

时间:2015-08-04 12:04:05

标签: r statistics dataframe plyr

我使用ddply计算rmse,为每个id,条件组合的大数据框计算其他摘要统计信息。数据框的结构是

'data.frame':   107955 obs. of  11 variables:
 $ date         : Factor w/ 1077 levels "2012-08-17","2012-08-18",..: 487 488 489 490 491 492 493 494 495 496 ...
 $ value        : num  
 $ mean         : num  
 $ accuracy     : num  
 $ id           : int  
 $ criteria     : Factor w/ 5 levels 

我尝试了以下

ddply(foo, .(id, criteria), summarize, mean=mean(accuracy, na.rm=T), median=median(accuracy, na.rm=T), rmse=sqrt(sum((mean - value)^2 , na.rm = TRUE ) / nrow(foo)))

nrow(foo)给出整个数据帧的行数,而不是切片的行数(id,criteria)。

我尝试使用显然不对的nrow(.(id, criteria))

示例数据:http://pastebin.com/8m0vD5Bq

ddply(foo, .(id, criteria), summarize, mean=mean(accuracy, na.rm=T), median=median(accuracy, na.rm=T), rmse=sqrt(sum((mean - value)^2 , na.rm = TRUE ) / n()))

   id criteria   mean median   rmse
1  49        g 123.00  123.0 101.00
2  49        h 115.25   72.0  80.31
3  49        I 196.00  110.0 173.75
4  50        f 191.75  204.5 168.59
5  50        g 649.00  275.0 634.92
6  51        d 180.00  180.0 160.00
7  51        e 378.67  137.5 359.19
8  51        f 247.00  247.0 227.08
9  52        a 109.00  107.0  74.18
10 52        b  76.33   45.0  46.31
11 52        d  98.67  100.0  64.56

计算rmse的id = 50和标准=' g'

 sub_foo <- foo[foo$id == 50 & foo$criteria=='g',]

R> sub_foo
         date value mean accuracy id criteria
23 2014-01-08     2   37     1850 50        g
24 2014-01-09    12   33      275 50        g
25 2014-01-10    19   48      253 50        g
26 2014-01-11    35   35      100 50        g
27 2014-01-12     3   23      767 50        g

R> sqrt(sum((sub_foo$mean -sub_foo$value)^2 , na.rm = TRUE ) / nrow(sub_foo))
[1] 24.11

预期的rmse是24.11而不是我使用ddply获得634.92这是错误的。

编辑:添加数据帧的输入

R>dput(foo)
structure(list(date = structure(1:36, .Label = c("2013-12-17", 
"2013-12-18", "2013-12-19", "2013-12-20", "2013-12-21", "2013-12-22", 
"2013-12-23", "2013-12-24", "2013-12-25", "2013-12-26", "2013-12-27", 
"2013-12-28", "2013-12-29", "2013-12-30", "2013-12-31", "2014-01-01", 
"2014-01-02", "2014-01-03", "2014-01-04", "2014-01-05", "2014-01-06", 
"2014-01-07", "2014-01-08", "2014-01-09", "2014-01-10", "2014-01-11", 
"2014-01-12", "2014-01-13", "2014-01-14", "2014-01-15", "2014-01-16", 
"2014-01-17", "2014-01-18", "2014-01-19", "2014-01-20", "2014-01-21"
), class = "factor"), value = c(33L, 30L, 42L, 15L, 36L, 44L, 
31L, 30L, 42L, 20L, 25L, 9L, 25L, 17L, 3L, 39L, 14L, 26L, 14L, 
41L, 23L, 16L, 2L, 12L, 19L, 35L, 3L, 22L, 8L, 50L, 48L, 41L, 
30L, 40L, 6L, 15L), mean = c(33L, 36L, 45L, 25L, 6L, 20L, 34L, 
30L, 36L, 36L, 19L, 49L, 11L, 32L, 40L, 34L, 47L, 41L, 45L, 15L, 
25L, 48L, 37L, 33L, 48L, 35L, 23L, 27L, 24L, 28L, 42L, 7L, 14L, 
37L, 31L, 19L), accuracy = c(100L, 120L, 107L, 167L, 17L, 45L, 
110L, 100L, 86L, 180L, 76L, 544L, 44L, 188L, 1333L, 87L, 336L, 
158L, 321L, 37L, 109L, 300L, 1850L, 275L, 253L, 100L, 767L, 123L, 
300L, 56L, 88L, 17L, 47L, 93L, 517L, 127L), id = c(52L, 52L, 
52L, 52L, 52L, 52L, 52L, 52L, 52L, 51L, 51L, 51L, 51L, 51L, 51L, 
51L, 51L, 51L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 49L, 
49L, 49L, 49L, 49L, 49L, 49L, 49L, 49L), criteria = structure(c(1L, 
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 
5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 8L, 
8L, 8L, 8L), .Label = c("a", "b", "d", "e", "f", "g", "h", "I"
), class = "factor")), .Names = c("date", "value", "mean", "accuracy", 
"id", "criteria"), class = "data.frame", row.names = c(NA, -36L
))

1 个答案:

答案 0 :(得分:0)

对我有用的解决方案是使用自定义函数而不是使用汇总,其中,我可以使用nrow()来获取切片中的行数。

解决方案:

metrics <- ddply(foo, c("id", "criteria"), function(df) data.frame(mean=mean(df$accuracy, na.rm=T), median=median(df$accuracy, na.rm=T), rmse=sqrt(sum((df$mean - df$value)^2 , na.rm = TRUE ) / nrow(df))))

感谢指点。