数据框中的混乱日期格式

时间:2018-08-12 07:03:00

标签: r dataframe data-cleaning

我为自己创建了一个我无法解决的任务-有一个带有某些项目开始日期和结束日期的数据框。 有些元素是错误的,它们显示了项目的持续时间而不是结束日期。

start_date <- c("2017-05-04", "2016-04-01", "2013-12-12", "2011-05-11", "2010-04-10", "2009-01-01")
end_date <- c("2020-01-01", "2020-01-06", "3 years", "36 months", "2020-01-01", "2020-01-01")
df <- data.frame(start_date, end_date)

start_date   end_date
1 2017-05-04 2020-01-01
2 2016-04-01 2020-01-06
3 2013-12-12    3 years
4 2011-05-11  36 months
5 2010-04-10 2020-01-01
6 2009-01-01 2020-01-01


如何计算它们并将其转换为日期格式?此外,start_dateend_date的数据结构是因素。

1 个答案:

答案 0 :(得分:6)

您可以在结束日期使用as.Date,然后在失败的值(即不适用)上使用lubridate::as.duration

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date
start_date <- c("2017-05-04", "2016-04-01", "2013-12-12", "2011-05-11", "2010-04-10", "2009-01-01")
end_date <- c("2020-01-01", "2020-01-06", "3 years", "36 months", "2020-01-01", "2020-01-01")
df <- data.frame(start_date = as.Date(start_date), end_date, stringsAsFactors = FALSE)
df$EndDate <- as.Date(df$end_date)

for (i in which(is.na(df$EndDate))) {
  df[i, ]$EndDate <- as.Date(df[i, ]$start_date + as.duration(df[i, ]$end_date))
}
df
#>   start_date   end_date    EndDate
#> 1 2017-05-04 2020-01-01 2020-01-01
#> 2 2016-04-01 2020-01-06 2020-01-06
#> 3 2013-12-12    3 years 2016-12-11
#> 4 2011-05-11  36 months 2014-05-10
#> 5 2010-04-10 2020-01-01 2020-01-01
#> 6 2009-01-01 2020-01-01 2020-01-01