完整的数据框,缺少多个参数的日期范围

时间:2018-07-12 15:58:20

标签: r

我有以下数据框:

Date_from <- c("2013-02-01","2013-05-10","2013-08-13","2013-02-01","2013-05-10","2013-08-13","2013-02-01","2013-05-10","2013-08-13")
Date_to <- c("2013-05-07","2013-08-12","2013-11-18","2013-05-07","2013-08-12","2013-11-18","2013-05-07","2013-08-12","2013-11-18")
y <- data.frame(Date_from,Date_to)
y$concentration <- c("1.5","2.5","1.5","3.5","1.5","2.5","1.5","3.5","3")
y$Parameter<-c("A","A","A","B","B","B","C","C","C")
y$Date_from <- as.Date(y$Date_from)
y$Date_to <- as.Date(y$Date_to)
y$concentration <- as.numeric(y$concentration)

对于EACH参数,我将需要检查数据框,日期范围从一年的第一天(2013-01-01)开始,到一年的最后一天(2013-12-31)结束。如果不是这样,我将需要在每个参数的开头和结尾添加一个额外的行,以将每个参数的日期范围完整到一整年。结果应如下所示:

Date_from    Date_to concentration Parameter
2013-01-01 2013-01-31            NA        NA
2013-02-01 2013-05-07           1.5         A
2013-05-10 2013-08-12           2.5         A
2013-08-13 2013-11-18           1.5         A
2013-11-19 2013-12-31            NA        NA
2013-01-01 2013-01-31            NA        NA
2013-02-01 2013-05-07           3.5         B
2013-05-10 2013-08-12           1.5         B
2013-08-13 2013-11-18           2.5         B
2013-11-19 2013-12-31            NA        NA
2013-01-01 2013-01-31            NA        NA
2013-02-01 2013-05-07           1.5         C
2013-05-10 2013-08-12           3.5         C
2013-08-13 2013-11-18           3.0         C
2013-11-19 2013-12-31            NA        NA

请注意:为简化起见,在此示例中,日期范围仅相等。

更新:这是我的原始数据段和代码:

sm<-read.csv("https://www.dropbox.com/s/tft6inwcrjqujgt/Test_data.csv?dl=1",sep=";",header=TRUE)
cleaned_sm<-sm[,c(4,5,11,14)] ##Delete obsolete columns
colnames(cleaned_sm)<-c("Parameter","Concentration","Date_from","Date_to")
cleaned_sm$Date_from<-as.Date(cleaned_sm$Date_from, format ="%d.%m.%Y")     
cleaned_sm$Date_to<-as.Date(cleaned_sm$Date_to, format ="%d.%m.%Y") 
#detect comma decimal separator and replace with dot decimal separater as comma is not recognised as a number
cleaned_sm=lapply(cleaned_sm, function(x) gsub(",", ".", x))
cleaned_sm<-data.frame(cleaned_sm)
cleaned_sm$Concentration <- as.numeric(cleaned_sm$Concentration)
cleaned_sm$Date_from <- as.Date(cleaned_sm$Date_from)
cleaned_sm$Date_to <- as.Date(cleaned_sm$Date_to)

基于@jasbner添加了代码:

cleaned_sm %>%
   group_by(Parameter) %>%
   do(add_row(.,
                 Date_from = ymd(max(Date_to))+1 ,
                 Date_to = ymd(paste(year(max(Date_to)),"1231")),
                 Parameter = .$Parameter[1])) %>%
   do(add_row(.,
                 Date_to = ymd(min(Date_from))-1, 
                 Date_from = ymd(paste(year(min(Date_from)),"0101")) ,
                 Parameter = .$Parameter[1],
                 .before = 0)) %>% 
   filter(!duplicated(Date_from,fromLast = T),!duplicated(Date_to))

2 个答案:

答案 0 :(得分:1)

我尝试使用dplyrlubridate。一起砍死,但我认为应该可以。请注意,这不会在日期范围的中间寻找任何差距。基本上,对于每个组,您都可以在该特定组的前后添加一行。然后,如果在某些情况下日期范围从年初开始或在年底结束,则会过滤掉添加的行。

library(dplyr)
library(lubridate)
cleaned_sm %>%
  group_by(Parameter) %>%
  do(add_row(.,
             Date_from = ymd(max(.$Date_to))+1 ,
             Date_to = ymd(paste(year(max(.$Date_to)),"1231")),
             Parameter = .$Parameter[1])) %>%
  do(add_row(.,
             Date_to = ymd(min(.$Date_from))-1, 
             Date_from = ymd(paste(year(min(.$Date_from)),"0101")) ,
             Parameter = .$Parameter[1],
             .before = 0)) %>% 
  filter(!duplicated(Date_from,fromLast = T),!duplicated(Date_to))  

# A tibble: 15 x 4
# Groups: Parameter [3]
#    Date_from  Date_to    concentration Parameter
#    <date>     <date>             <dbl> <chr>    
#  1 2013-01-01 2013-01-31         NA    A        
#  2 2013-02-01 2013-05-07          1.50 A        
#  3 2013-05-10 2013-08-12          2.50 A        
#  4 2013-08-13 2013-11-18          1.50 A        
#  5 2013-11-19 2013-12-31         NA    A        
#  6 2013-01-01 2013-01-31         NA    B        
#  7 2013-02-01 2013-05-07          3.50 B        
#  8 2013-05-10 2013-08-12          1.50 B        
#  9 2013-08-13 2013-11-18          2.50 B        
# 10 2013-11-19 2013-12-31         NA    B        
# 11 2013-01-01 2013-01-31         NA    C        
# 12 2013-02-01 2013-05-07          1.50 C        
# 13 2013-05-10 2013-08-12          3.50 C        
# 14 2013-08-13 2013-11-18          3.00 C        
# 15 2013-11-19 2013-12-31         NA    C 

答案 1 :(得分:0)

这似乎需要结合使用不同的软件包才能对其进行攻击。我正在使用> dput(head(transfer)) structure(list(pxcor = c(0, 1, 2, 3, 4, 5), pycor = c(0, 0, 0, 0, 0, 0), boarTerritoryStrength = structure(list(`count boars-here` = c(1.74067061418327, 1.72108894667326, 1.80564895320475, 1.86442162955961, 1.96794014517206, 1.97282628219563)), row.names = c(NA, 6L), class = "data.frame")), row.names = c(NA, 6L), class = "data.frame") tidyr,并且我使用了data.table

lubridate

好的,因此我创建了一个从起点到终点(date.start <- seq.Date(as.Date("2013-01-01"), as.Date("2013-12-31"), by = "day") Date.Int <- data.frame(Date_from = date.start, Date_to = date.start) y_wide <- y %>% spread(Parameter, concentration) y_wide <- as.data.table(setkey(as.data.table(y_wide), Date_from, Date_to)) Date.Int <- as.data.table(setkey(as.data.table(Date.Int), Date_from, Date_to)) dats <- foverlaps(Date.Int, y_wide, nomatch = NA) fin.dat <- dats %>% mutate(A = ifelse(is.na(A), -5, A), seqs = cumsum(!is.na(A) & A != lag(A, default = -5))) %>% group_by(seqs) %>% summarise(Date_from = first(i.Date_from), Date_to = last(i.Date_to) , A = first(A), B = first(B), C = first(C)) %>% mutate(A = ifelse(A == -5, NA, A)) %>% ungroup()%>% gather(Concentration, Parameter, A:C) %>% mutate(Concentration = ifelse(is.na(Parameter), NA, Concentration)) )的日期向量;然后我变成了date.start,其间隔名称和间隔日期与data.frame相同。这是因为Date.Int需要比较两个间隔(foverlaps中的相同日期开始和结束日期现在是正式间隔)。然后,我将您提供的数据和Date.Int从长格式数据转换为宽格式数据,然后将其转换为spreaddata.table设置key可以设置它的排列方式,使用data.table时,您必须键入开始日期和结束日期(按此顺序)。 foverlaps确定一个间隔是否落在另一个日期间隔内。如果您打印出foverlaps,则所有内容都会显示一串带有 NA 的行,因为它们不在间隔内。因此,现在我们必须以某种方式将它们分组。我在dats中按“ A”值分组。分组变量称为dats。但是随后我汇总了数据,然后将其从宽格式切换为长格式,并替换了适当的NA值。