两组中的平均日差

时间:2018-06-18 14:58:32

标签: r date dplyr


我正在使用id,month,date的一些数据。我希望每个月的ID和平均值有所不同(所以两组)。我已经阅读了this post,并且我试图修改答案(仅针对ID,而不是月份),没有运气。

我的数据类似于:

test <-structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                       1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                       1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                       1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                       1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                       1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                       1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "1", class = "factor"), 
                       month = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
                                1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 
                                3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
                                3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 
                                4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 
                                4, 4, 4, 4, 4, 4), 
                       date = structure(c(17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 
                                                                     17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 
                                                                     17555, 17579, 17579, 17579, 17579, 17579, 17579, 17579, 17579, 
                                                                     17579, 17579, 17579, 17579, 17618, 17618, 17618, 17618, 17618, 
                                                                     17618, 17618, 17618, 17618, 17618, 17618, 17621, 17621, 17621, 
                                                                      17621, 17621, 17621, 17621, 17621, 17621, 17621, 17621, 17649, 
                                                                      17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 
                                                                      17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 
                                                                      17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 
                                                                      17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 
                                                                      17649, 17649, 17649, 17649, 17649), class = "Date")),class="data.frame",row.names = c(NA,-98L))

结果是这样的(对dput()感到抱歉,但分享数据示例的方式却不那么痛苦了。)

head(test)
id month       date
1  1     1 2018-01-24
2  1     1 2018-01-24
3  1     1 2018-01-24
4  1     1 2018-01-24
5  1     1 2018-01-24
6  1     1 2018-01-24

所以我试过这个:

library(dplyr)
test %>%
group_by(id,month)%>%
arrange(date) %>%
summarize(avg = as.numeric(mean(diff(date))))%>%data.frame()

结果是:

> result
  id month       avg
1  1     1 0.0000000
2  1     2 0.0000000
3  1     3 0.1428571
4  1     4 0.0000000

但是,看看这些数据,March有一个问题,因为3月的日子是31和28,他们的差异是3,差异的平均值应该是3(只有一个距离)。

> table(test[which(test$month==3),]$date)

2018-03-28 2018-03-31 
        11         11 


我究竟做错了什么?
提前致谢

1 个答案:

答案 0 :(得分:3)

您获得的结果是正确的:diff(date)计算数据中所有连续日期对之间的差异(在组内和排序日期之后)。 3月份,您有11次2018-03-28次和11次2018-03-31次。所以在3月份,diff(date)是10倍0,一次3倍,10倍0.因此平均值为3/21 = 0.143。

也许您首先要考虑(id, month, date)的不同组合:

test %>%
  distinct(id, month, date) %>%
  group_by(id,month)%>%
  arrange(date) %>%
  summarize(avg = as.numeric(mean(diff(date)))) %>%
  data.frame()

请注意,此输出3表示3月,但NaN表示其他月份,因为您要求长度为1的向量上的diff,这会给出一个长度为0的向量。 ,你可以使用

test %>%
  distinct(id, month, date) %>%
  group_by(id,month)%>%
  arrange(date) %>%
  summarize(avg = as.numeric(max(date)-min(date)) / max(1, n()-1))