使用mutate错误将一行拆分为多行

时间:2018-06-12 00:55:53

标签: r dplyr mutate

我有一个类似于df1的df,其中我要打破行,以便Hrs_Time_Worked列的间隔为4,如df2所示。

我一直在使用以下代码,但它会抛出错误:

df2 = df1 %>%
 group_by(Row)%>%
 mutate(S=START_DATE_TIME,
        Hrs_Time_Worked=list((n<-c(rep(4,Hrs_Time_Worked%/%4),Hrs_Time_Worked%%4))[n!=0]))%>%
 unnest()%>%
 mutate(E=START_DATE_TIME+hours(cumsum(Hrs_Time_Worked)),
        S=E-hours(unlist(Hrs_Time_Worked)),
        START_DATE_TIME=(S),
        END_DATE_TIME=(E),
        S=NULL,E=NULL)
  

mutate_impl(.data,dots)中的错误:评估错误:无效的类   期间对象:期间必须具有整数值。

以下是必需的:

所有分类数据在子行上必须保持相同(例如,TIME_RPTG_CD    在每个子行上保持不变)

如果有余数    少于四个,剩余金额应列在最后一个    line(例如,df2;第3行)

如果子行在下一行开始或结束    date应该相应地更新日期列(例如,df2;第2-3行)

df1(当前)

   Row EMPLID TIME_RPTG_CD START_DATE_TIME     END_DATE_TIME       Hrs_Time_Worked
       <chr>  <chr>        <dttm>              <dttm>                        <dbl>
     1 X00007 REG          2014-07-03 16:00:00 2014-07-03 02:00:00            10.0

df2(所需)

Row EMPLID TIME_RPTG_CD START_DATE_TIME     END_DATE_TIME       Hrs_Time_Worked
    <chr>  <chr>        <dttm>              <dttm>                        <dbl>
1   X00007 REG          2014-07-03 16:00:00 2014-07-03 20:00:00            4.0
2   X00007 REG          2014-07-03 20:00:00 2014-07-04 24:00:00            4.0
3   X00007 REG          2014-07-04 24:00:00 2014-07-04 02:00:00            2.0

2 个答案:

答案 0 :(得分:1)

其中一种方法可能是

library(dplyr)
library(tidyr)
library(lubridate)

df %>%
  rowwise() %>%
  mutate(START_DATE_TIME = paste(seq.POSIXt(START_DATE_TIME, END_DATE_TIME, by = "4 hour"), collapse = ",")) %>%
  separate_rows(START_DATE_TIME, sep = ",") %>%
  group_by(Row) %>%
  mutate(END_DATE_TIME   = ymd_hms(lead(START_DATE_TIME, order_by = Row, default = as.character(END_DATE_TIME))),
         START_DATE_TIME = ymd_hms(START_DATE_TIME),
         Hrs_Time_Worked = as.numeric(difftime(END_DATE_TIME, START_DATE_TIME, units = "hour"))) %>%
  filter(Hrs_Time_Worked > 0)

给出了

    Row EMPLID TIME_RPTG_CD START_DATE_TIME     END_DATE_TIME       Hrs_Time_Worked
1     1 X00007 REG          2014-07-03 16:00:00 2014-07-03 20:00:00            4.00
2     1 X00007 REG          2014-07-03 20:00:00 2014-07-04 00:00:00            4.00
3     1 X00007 REG          2014-07-04 00:00:00 2014-07-04 02:00:00            2.00


示例数据:

df <- structure(list(Row = 1L, EMPLID = "X00007", TIME_RPTG_CD = "REG", 
    START_DATE_TIME = structure(1404403200, tzone = "UTC", class = c("POSIXct", 
    "POSIXt")), END_DATE_TIME = structure(1404439200, tzone = "UTC", class = c("POSIXct", 
    "POSIXt")), Hrs_Time_Worked = 10), .Names = c("Row", "EMPLID", 
"TIME_RPTG_CD", "START_DATE_TIME", "END_DATE_TIME", "Hrs_Time_Worked"
), row.names = c(NA, -1L), class = "data.frame")

#  Row EMPLID TIME_RPTG_CD     START_DATE_TIME       END_DATE_TIME Hrs_Time_Worked
#1   1 X00007          REG 2014-07-03 16:00:00 2014-07-04 02:00:00              10

答案 1 :(得分:0)

与@ Prem相似,但使用列表列和unnest

df %>% 
  rowwise %>%
  mutate(START_DATE_TIME = list(seq.POSIXt(START_DATE_TIME, END_DATE_TIME, by = "4 hour")),
         END_DATE_TIME = list(c(tail(START_DATE_TIME,-1),END_DATE_TIME))) %>%
  unnest %>%
  mutate(Hrs_Time_Worked = difftime(END_DATE_TIME,START_DATE_TIME, "hours"))

# # A tibble: 3 x 6
#     Row EMPLID TIME_RPTG_CD Hrs_Time_Worked START_DATE_TIME     END_DATE_TIME      
#   <int> <chr>  <chr>        <time>          <dttm>              <dttm>             
# 1     1 X00007 REG          4               2014-07-03 16:00:00 2014-07-03 20:00:00
# 2     1 X00007 REG          4               2014-07-03 20:00:00 2014-07-04 00:00:00
# 3     1 X00007 REG          2               2014-07-04 00:00:00 2014-07-04 02:00:00

使用map比使用rowwise效率更高,虽然我认为不太可读,但使用地图可以做到这一点:

df %>% 
  mutate(START_DATE_TIME = map(START_DATE_TIME,~seq.POSIXt(., END_DATE_TIME, by = "4 hour")),
         END_DATE_TIME = map2(END_DATE_TIME,START_DATE_TIME,~c(tail(.y,-1),.x))) %>%
  unnest %>%
  mutate(Hrs_Time_Worked = difftime(END_DATE_TIME,START_DATE_TIME, "hours"))

#   Row EMPLID TIME_RPTG_CD Hrs_Time_Worked     START_DATE_TIME       END_DATE_TIME
# 1   1 X00007          REG         4 hours 2014-07-03 16:00:00 2014-07-03 20:00:00
# 2   1 X00007          REG         4 hours 2014-07-03 20:00:00 2014-07-04 00:00:00
# 3   1 X00007          REG         2 hours 2014-07-04 00:00:00 2014-07-04 02:00:00

在这种情况下,输出不是tibble,而是标准data.frame,这解释了Hrs_Time_Worked列以不同方式打印的原因。使用as_tibble获取相同的输出。或者在任何解决方案上使用as.numeric将其设为double