Question

我正在使用id，month，date的一些数据。我希望每个月的ID和平均值有所不同（所以两组）。我已经阅读了this post，并且我试图修改答案（仅针对ID，而不是月份），没有运气。

我的数据类似于：

test <-structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                       1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                       1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                       1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                       1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                       1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                       1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "1", class = "factor"), 
                       month = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
                                1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 
                                3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
                                3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 
                                4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 
                                4, 4, 4, 4, 4, 4), 
                       date = structure(c(17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 
                                                                     17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 
                                                                     17555, 17579, 17579, 17579, 17579, 17579, 17579, 17579, 17579, 
                                                                     17579, 17579, 17579, 17579, 17618, 17618, 17618, 17618, 17618, 
                                                                     17618, 17618, 17618, 17618, 17618, 17618, 17621, 17621, 17621, 
                                                                      17621, 17621, 17621, 17621, 17621, 17621, 17621, 17621, 17649, 
                                                                      17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 
                                                                      17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 
                                                                      17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 
                                                                      17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 
                                                                      17649, 17649, 17649, 17649, 17649), class = "Date")),class="data.frame",row.names = c(NA,-98L))

结果是这样的（对dput()感到抱歉，但分享数据示例的方式却不那么痛苦了。）

head(test)
id month       date
1  1     1 2018-01-24
2  1     1 2018-01-24
3  1     1 2018-01-24
4  1     1 2018-01-24
5  1     1 2018-01-24
6  1     1 2018-01-24

所以我试过这个：

library(dplyr)
test %>%
group_by(id,month)%>%
arrange(date) %>%
summarize(avg = as.numeric(mean(diff(date))))%>%data.frame()

结果是：

> result
  id month       avg
1  1     1 0.0000000
2  1     2 0.0000000
3  1     3 0.1428571
4  1     4 0.0000000

但是，看看这些数据，March有一个问题，因为3月的日子是31和28，他们的差异是3，差异的平均值应该是3（只有一个距离）。

> table(test[which(test$month==3),]$date)

2018-03-28 2018-03-31 
        11         11

我究竟做错了什么？
提前致谢

Answer 1

您获得的结果是正确的：diff(date)计算数据中所有连续日期对之间的差异（在组内和排序日期之后）。 3月份，您有11次2018-03-28次和11次2018-03-31次。所以在3月份，diff(date)是10倍0，一次3倍，10倍0.因此平均值为3/21 = 0.143。

也许您首先要考虑(id, month, date)的不同组合：

test %>%
  distinct(id, month, date) %>%
  group_by(id,month)%>%
  arrange(date) %>%
  summarize(avg = as.numeric(mean(diff(date)))) %>%
  data.frame()

请注意，此输出3表示3月，但NaN表示其他月份，因为您要求长度为1的向量上的diff，这会给出一个长度为0的向量。，你可以使用

test %>%
  distinct(id, month, date) %>%
  group_by(id,month)%>%
  arrange(date) %>%
  summarize(avg = as.numeric(max(date)-min(date)) / max(1, n()-1))

两组中的平均日差

1 个答案: