我有以下数据集,其中“值”列中的值从开始到结束日期都是有效的:
data.table(company = c("A", "A", "B", "B"), person = c("a", "b", "b", "c"), value = c(2,3,5,5), start_date = c("2015-01-01", "2015-01-04", "2015-01-02", "2015-01-06"), end_date = c("2015-01-06", "2015-01-07", "2015-01-07", "2015-01-07"))
company person value start_date end_date
1: A a 2 2015-01-01 2015-01-06
2: A b 3 2015-01-04 2015-01-07
3: B b 5 2015-01-02 2015-01-07
4: B c 5 2015-01-06 2015-01-07
我想根据此数据计算三件事:
我已经尝试了以下方法,这些方法对于我的测试样本来说就像是一种魅力,但是由于它需要大量的计算能力,因此在实际的数据集上失败了。我知道这是由于数据集在每个公司每人每个日期的每个公司有单独的行而导致的,但是,我不知道如何使用R中的某种函数来解决这个问题。
尝试的代码:
test$start_date = as.Date(as.character(test$start_date), format = "%Y-%m-%d")
test$end_date = as.Date(as.character(test$end_date), format = "%Y-%m-%d")
#indexing per row
indxtest = test[,.(Date=seq(from = min(start_date), to = max(end_date), by = "day")), by = 1:nrow(test)]
test = test[, nrow := 1:nrow(test)]
test = merge(indxtest, test, by = "nrow", all.x = TRUE)
setDT(test, "company","Date")
test = test[, mean_EPS := mean(value, na.rm = TRUE), by = c("company", "Date")]
test = test[, Number_people := .N, by = c("company", "Date")]
test = test[, number_companies := uniqueN(company), by = "Date"]
我当前的结果将类似于:
nrow Date company person value start_date end_date mean_value Number_people number_companies
1: 1 2015-01-01 A a 2 2015-01-01 2015-01-06 2.0 1 1
2: 1 2015-01-02 A a 2 2015-01-01 2015-01-06 2.0 1 2
3: 3 2015-01-02 B b 5 2015-01-02 2015-01-07 5.0 1 2
4: 1 2015-01-03 A a 2 2015-01-01 2015-01-06 2.0 1 2
5: 3 2015-01-03 B b 5 2015-01-02 2015-01-07 5.0 1 2
6: 1 2015-01-04 A a 2 2015-01-01 2015-01-06 2.5 2 2
7: 2 2015-01-04 A b 3 2015-01-04 2015-01-07 2.5 2 2
8: 3 2015-01-04 B b 5 2015-01-02 2015-01-07 5.0 1 2
9: 1 2015-01-05 A a 2 2015-01-01 2015-01-06 2.5 2 2
10: 2 2015-01-05 A b 3 2015-01-04 2015-01-07 2.5 2 2
11: 3 2015-01-05 B b 5 2015-01-02 2015-01-07 5.0 1 2
12: 1 2015-01-06 A a 2 2015-01-01 2015-01-06 2.5 2 2
13: 2 2015-01-06 A b 3 2015-01-04 2015-01-07 2.5 2 2
14: 3 2015-01-06 B b 5 2015-01-02 2015-01-07 5.0 2 2
15: 4 2015-01-06 B c 5 2015-01-06 2015-01-07 5.0 2 2
16: 2 2015-01-07 A b 3 2015-01-04 2015-01-07 3.0 1 2
17: 3 2015-01-07 B b 5 2015-01-02 2015-01-07 5.0 2 2
18: 4 2015-01-07 B c 5 2015-01-06 2015-01-07 5.0 2 2
除了我自己想到的解决方案之外,我在这里找不到任何相关的内容,但是,如果有参考的话会很有帮助。
答案 0 :(得分:2)
您真的必须避免这种连接,因为它会炸裂更大的数据。您可以尝试此循环是否足够快(日期数量可能不大,我预计最多不超过三到四千个)。
3.1.3
答案 1 :(得分:0)
这是一个整洁的解决方案:
library(tidyverse)
df =df%>%as.tibble()%>%
transmute(Date = map2(start_date, end_date, seq, by = "day"), company,person,value) %>%
unnest()
df1=df%>%group_by(Date,company)%>%
summarize(mean_value=mean(value),Number_people=n_distinct(person))%>%
right_join(df,by=c("company","Date"))
df2=df%>%
group_by(Date)%>%
summarize(companies=n_distinct(company))%>%
right_join(df1,by="Date")%>%
arrange(Date)
df2
Date companies company mean_value Number_people person value
<date> <int> <chr> <dbl> <int> <chr> <dbl>
1 2015-01-01 1 A 2 1 a 2
2 2015-01-02 2 A 2 1 a 2
3 2015-01-02 2 B 5 1 b 5
4 2015-01-03 2 A 2 1 a 2
5 2015-01-03 2 B 5 1 b 5
6 2015-01-04 2 A 2.5 2 a 2
7 2015-01-04 2 A 2.5 2 b 3
8 2015-01-04 2 B 5 1 b 5
9 2015-01-05 2 A 2.5 2 a 2
10 2015-01-05 2 A 2.5 2 b 3
11 2015-01-05 2 B 5 1 b 5
12 2015-01-06 2 A 2.5 2 a 2
13 2015-01-06 2 A 2.5 2 b 3
14 2015-01-06 2 B 5 2 b 5
15 2015-01-06 2 B 5 2 c 5
16 2015-01-07 2 A 3 1 b 3
17 2015-01-07 2 B 5 2 b 5
18 2015-01-07 2 B 5 2 c 5