查找并替换案例

时间:2014-12-21 07:28:26

标签: r replace

我有一个数据集,其中包含四列X1(id号),x2(日期时间),x3(日期时间),x4(持续时间)

 structure(list(X1 = c(549395L, 678018L, 706197L, 549395L, 775731L, 
 789858L, 845277L, 936749L, 845277L, 954953L), X2 = c("06/16/2014 10:45:24 AM", 
 "09/16/2014 10:02:46 AM", "02/12/2014 12:00:13 PM", "06/16/2014 10:45:24 AM", 
 "08/29/2014 8:42:34 AM", "02/26/2014 12:29:26 PM", "04/07/2014 1:49:04 PM", 
 "02/14/2014 12:02:29 PM", "05/18/2014 12:09:35 PM", "03/05/2014 9:47:11 AM"
 ), X3 = c("06/04/2014 11:10:03 AM", "09/16/2014 10:23:00 AM", 
 "02/12/2014 12:21:00 PM", "", "08/29/2014 8:51:03 AM", "02/26/2014 12:49:00 PM", 
 "04/07/2014 1:59:56 PM", "02/14/2014 12:08:00 PM", "", "03/05/2014 10:14:00 AM"
 ), X4 = c(8L, 21L, 10L, 72L, 39L, 14L, 41L, 31L, 43L, 24L)), .Names = c("X1", 
 "X2", "X3", "X4"), class = "data.frame", row.names = c(NA, -10L
 ))

   X1     X2                     X3                      X4
   549395 06/16/2014 10:45:24 AM 06/04/2014 11:10:03 AM  8
   678018 09/16/2014 10:02:46 AM 09/16/2014 10:23:00 AM 21
   706197 02/12/2014 12:00:13 PM 02/12/2014 12:21:00 PM 10
   549395 06/16/2014 10:45:24 AM                        72
   775731  08/29/2014 8:42:34 AM  08/29/2014 8:51:03 AM 39
   789858 02/26/2014 12:29:26 PM 02/26/2014 12:49:00 PM 14
   845277  04/07/2014 1:49:04 PM  04/07/2014 1:59:56 PM 41
   936749 02/14/2014 12:02:29 PM 02/14/2014 12:08:00 PM 31
   845277 05/18/2014 12:09:35 PM                        43
   954953  03/05/2014 9:47:11 AM 03/05/2014 10:14:00 AM 24

我想做的是,

   First)  find x1(Id numbers) that have NA in their x3(DataTime) column in this example 549395 
   Second) Identify other observations with similar Id number in this example obs1 and obs4 
   Third)  Compare the date value in x2 for these matching observations (Obs1 & Obs4) 
   Fourth) If the date value in x2 matches then replace the corresponding x4 to 0

在这种情况下,obs1和obs4的x4将为0,因为obs 4包含缺失的X3,id编号为549395,Id 549395的x2符合06/16/2014 ....

虽然缺少x3 for obs 0,但ID号845277有两个匹配的观察值(obs9和obs 7)但是这个id 845277的x2不一样(04/07 / 2014,05 / 18/201)所以x4不应该改为0.

最终数据集应该如下所示。

   X1     X2                     X3                      X4
   549395 06/16/2014 10:45:24 AM 06/04/2014 11:10:03 AM    0
   678018 09/16/2014 10:02:46 AM 09/16/2014 10:23:00 AM 21
   706197 02/12/2014 12:00:13 PM 02/12/2014 12:21:00 PM 10
   549395 06/16/2014 10:45:24 AM                           0
   775731  08/29/2014 8:42:34 AM  08/29/2014 8:51:03 AM 39
   789858 02/26/2014 12:29:26 PM 02/26/2014 12:49:00 PM 14
   845277  04/07/2014 1:49:04 PM  04/07/2014 1:59:56 PM 41
   936749 02/14/2014 12:02:29 PM 02/14/2014 12:08:00 PM 31
   845277 05/18/2014 12:09:35 PM                        43
   954953  03/05/2014 9:47:11 AM 03/05/2014 10:14:00 AM 24

需要帮助。感谢。

1 个答案:

答案 0 :(得分:0)

在提供的数据集中,NA"NA"factor列的级别。您可以查看str

test2
str(test2)
#'data.frame':  10 obs. of  3 variables:
#$ X1: Factor w/ 8 levels "549395","678018",..: 1 2 3 1 4 5 6 7 6 8
#$ X2: Factor w/ 9 levels "02/12/2014 12:21:00 PM",..: 6 8 1 9 7 3 5 2 9 4
#$ X3: Factor w/ 10 levels "10","14","21",..: 10 3 1 9 6 2 7 5 8 4

 levels(test2$X2)
#[1] "02/12/2014 12:21:00 PM" "02/14/2014 12:08:00 PM" "02/26/2014 12:49:00 PM"
#[4] "03/05/2014 10:14:00 AM" "04/07/2014 1:59:56 PM"  "06/04/2014 11:10:03 AM"
#[7] "08/29/2014 8:51:03 AM"  "09/16/2014 10:23:00 AM" "NA"                    

使用read.table/read.csv阅读数据集时,您可以指定stringsAsFactors=FALSE并使用na.strings='NA'。对于当前数据集,首先将X3列转换为numeric,然后通过检查X2哪一行具有'NA'级别

来创建逻辑索引
test2$X3 <- as.numeric(levels(test2$X3))[test2$X3]
test2$X3[test2$X1 %in% test2$X1[test2$X2 == 'NA']] <- 0

test2$X3
#[1]  0 21 10  0 39 14  0 31  0 24

如果NAs是真实的NA,您可以尝试(由@David Arenburg评论)

test2$X3[test2$X1 %in% test2$X1[is.na(test2$X2)]] <- 0   

更新

这可能有助于更新问题。请注意,在新数据集中,您有''而不是NA

 indx <-  with(test2, as.logical(ave(X2, X1, FUN=function(x)
             all(duplicated(x)|duplicated(x,fromLast=TRUE))))) & 
                               test2$X1 %in% test2$X1[test2$X3=='']

  test2$X4[indx] <- 0

或者您可以使用

 library(dplyr)
 test2 %>% 
       group_by(X1) %>%
       mutate(X4= replace(X4,all(X3=='' &
          duplicated(X2)|duplicated(X2,fromLast=TRUE)), 0))
 #       X1        X2              X3 X4
 #1  549395 6/16/2014  6/4/2014 11:10  0
 #2  678018 9/16/2014 9/16/2014 10:23 21
 #3  706197 2/12/2014 2/12/2014 12:21 10
 #4  549395 6/16/2014                  0
 #5  775731 8/29/2014  8/29/2014 8:51 39
 #6  789858 2/26/2014 2/26/2014 12:49 14
 #7  845277  4/7/2014  4/7/2014 13:59 41
 #8  936749 2/14/2014 2/14/2014 12:08 31
 #9  845277 5/18/2014                 43
 #10 954953  3/5/2014  3/5/2014 10:14 24

或使用data.table

 library(data.table)
 setDT(test2)[,X4:=replace(X4, all(X3=='' & 
      duplicated(X2)|duplicated(X2,fromLast=TRUE)),0L) , by=X1]