为什么聚合和lapply产生不同的结果?

时间:2016-10-12 18:02:15

标签: r aggregate xts lapply zoo

我正在尝试估算以下样本数据中每天两次观察A和B之间的平均秒数:

dput(tt2)
structure(c(1371.25, NA, 1373.95, NA, NA, 1373, NA, 1373.95, 
1373.9, NA, NA, 1374, 1374.15, NA, 1374, 1373.85, 1372.55, 1374.05, 
1374.15, 1374.75, NA, NA, 1375.9, 1374.05, NA, NA, NA, NA, NA, 
NA, NA, 1375, NA, NA, NA, NA, NA, 1376.35, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, 1376.25, NA, 1378, 1376.5, NA, NA, NA, 1378, 
1378, NA, NA, 1378.8, 231.9, 231.85, NA, 231.9, 231.85, 231.9, 
231.8, 231.9, 232.6, 231.95, 232.35, 232, 232.1, 232.05, 232.05, 
232.05, 231.5, 231.3, NA, NA, 231.1, 231.1, 231.1, 231, 231, 
230.95, 230.6, 230.6, 230.7, 230.6, 231, NA, 231, 231, 231.45, 
231.65, 231.4, 231.7, 231.3, 231.25, 231.25, 231.4, 231.4, 231.85, 
231.75, 231.5, 231.55, 231.35, NA, 231.5, 231.5, NA, 231.5, 231.25, 
231.15, 231, 231, 231, 231.05, NA), .Dim = c(60L, 2L), .indexCLASS = c("POSIXct", 
"POSIXt"), tclass = c("POSIXct", "POSIXt"), .indexTZ = "Asia/Calcutta", tzone = "Asia/Calcutta", index = structure(c(1459482300, 
1459483766.38983, 1459485231.77966, 1459486697.16949, 1459488162.55932, 
1459489627.94915, 1459491093.33898, 1459492558.72881, 1459494025.11864, 
1459495490.50847, 1459496955.89831, 1459498421.28814, 1459499887.67797, 
1459501353.0678, 1459502818.45763, 1459504283.84746, 1459505749.23729, 
1459507214.62712, 1459508680.01695, 1459510145.40678, 1459511610.79661, 
1459513076.18644, 1459514541.57627, 1459516007.9661, 1459517474.35593, 
1459518939.74576, 1459520405.13559, 1459521870.52542, 1459523335.91525, 
1459524804.30508, 1459526269.69492, 1459527735.08475, 1459529200.47458, 
1459530667.86441, 1459532134.25424, 1459533600.64407, 1459535066.0339, 
1459536531.42373, 1459537996.81356, 1459539702.20339, 1459541167.59322, 
1459542634.98305, 1459544100.37288, 1459545565.76271, 1459547031.15254, 
1459548496.54237, 1459549961.9322, 1459551429.32203, 1459552894.71186, 
1459554360.10169, 1459555829.49153, 1459557294.88136, 1459558760.27119, 
1459560225.66102, 1459561691.05085, 1459563160.44068, 1459564625.83051, 
1459566091.22034, 1459567557.61017, 1459569028), tclass = c("POSIXct", 
"POSIXt"), tzone = "Asia/Calcutta"), .Dimnames = list(NULL, c("A", 
"B")), class = c("xts", "zoo"))

我可以通过两种方式实现:

  

1

 fun.time=function(x) mean(diff(as.numeric(time(na.omit(x)))))
my.df.time<-do.call(rbind, lapply(split(tt2, "days"), FUN=function (x) {do.call(cbind, lapply(x, fun.time))})) 

my.df.time
            A        B
[1,] 3029.006 1648.939
[2,] 5416.096 1632.957
  

2

df.time<-do.call(cbind, lapply(as.list(tt2), function(x) {
  times <- time(na.omit(x))
  aggregate(zoo(as.numeric(times), times), as.Date, function(x) mean(diff(x)))
}))

df.time
                  A        B
2016-04-01 4152.630 1637.730
2016-04-02 3299.627 1675.446

请您建议 为什么这两种方法的A和B列值不同?

1 个答案:

答案 0 :(得分:2)

不同之处在于as.Date计算UTC的日期,而split(tt2, "days")会将日期按当地时区(UTC-5.5,IIRC)午夜分割。

> tail(data.frame(tt2, utcDate=as.Date(index(tt2))), 12)
                          A      B    utcDate
2016-04-02 04:51:34 1376.25     NA 2016-04-01
2016-04-02 05:16:00      NA 231.50 2016-04-01
2016-04-02 05:40:29 1378.00 231.50 2016-04-02
2016-04-02 06:04:54 1376.50     NA 2016-04-02
2016-04-02 06:29:20      NA 231.50 2016-04-02
2016-04-02 06:53:45      NA 231.25 2016-04-02
2016-04-02 07:18:11      NA 231.15 2016-04-02
2016-04-02 07:42:40 1378.00 231.00 2016-04-02
2016-04-02 08:07:05 1378.00 231.00 2016-04-02
2016-04-02 08:31:31      NA 231.00 2016-04-02
2016-04-02 08:55:57      NA 231.05 2016-04-02
2016-04-02 09:20:28 1378.80     NA 2016-04-02

哪个是正确的取决于你想要什么。使用xts中的工具更简洁的方法是使用apply.daily

meanTimeDiff <- function(x) {
  mean(diff(.index(na.omit(x))))
}
apply.daily(tt2, function(x) sapply(x, meanTimeDiff))
#                            A        B
# 2016-04-01 23:54:26 3029.006 1648.939
# 2016-04-02 09:20:28 5416.096 1632.957