将所有重复项替换为na

时间:2019-01-16 17:27:18

标签: r dplyr duplicates time-series na

我的问题类似于replace duplicate values with NA in time series data using dplyr,但同时适用于其他时间序列,如下所示:

box_num      date       x         y
6-WQ      2018-11-18   20.2       8
6-WQ      2018-11-25   500.75     7.2
6-WQ      2018-12-2    500.75     23
25-LR     2018-11-18   374.95     4.3
25-LR     2018-11-25   0.134      9.3
25-LR     2018-12-2    0.134      4
73-IU     2018-12-2     225.54    0.7562
73-IU     2018-12-9     28        0.7562
73-IU     2018-12-16    225.54    52.8

library(dplyr)
df %>%
  group_by(box_num) %>%
  mutate_at(vars(x:y), funs(replace(., duplicated(.), NA)))

上面的代码可以识别并替换为NA,但是潜在的问题是我在接下来的步骤中试图用线性趋势替换所有NA。由于它是一个时间序列。但是当我们看到box_num:6-WQ之后的20.2时,我们可以直接看到一个很大的偏移,可以说这是一个推定值,因此我将两个推算值都替换为NA而另一种情况是一周后输入了box_num 73-IU的推定值,因此我想用NA代替推定值

Expected output :
box_num      date       x         y
6-WQ      2018-11-18   20.2       8
6-WQ      2018-11-25   NA         7.2
6-WQ      2018-12-2    NA         23
25-LR     2018-11-18   374.95     4.3
25-LR     2018-11-25   NA         9.3
25-LR     2018-12-2    NA         4
73-IU     2018-12-2    NA         NA
73-IU     2018-12-9    28         NA
73-IU     2018-12-16   NA         52.8

2 个答案:

答案 0 :(得分:0)

foo = function(x){
    replace(x, ave(x, x, FUN = length) > 1, NA)
}

myCols = c("x", "y")
df1[myCols] = lapply(df1[myCols], foo)
df1
#  box_num       date      x    y
#1    6-WQ 2018-11-18  20.20  8.0
#2    6-WQ 2018-11-25     NA  7.2
#3    6-WQ  2018-12-2     NA 23.0
#4   25-LR 2018-11-18 374.95  4.3
#5   25-LR 2018-11-25     NA  9.3
#6   25-LR  2018-12-2     NA  4.0
#7   73-IU  2018-12-2     NA   NA
#8   73-IU  2018-12-9  28.00   NA
#9   73-IU 2018-12-16     NA 52.8

#DATA
df1 = structure(list(box_num = c("6-WQ", "6-WQ", "6-WQ", "25-LR", "25-LR", 
"25-LR", "73-IU", "73-IU", "73-IU"), date = c("2018-11-18", "2018-11-25", 
"2018-12-2", "2018-11-18", "2018-11-25", "2018-12-2", "2018-12-2", 
"2018-12-9", "2018-12-16"), x = c(20.2, 500.75, 500.75, 374.95, 
0.134, 0.134, 225.54, 28, 225.54), y = c(8, 7.2, 23, 4.3, 9.3, 
4, 0.7562, 0.7562, 52.8)), class = "data.frame", row.names = c(NA, 
-9L))

答案 1 :(得分:0)

使用tidyverse,您可以执行以下操作:

df %>%
 group_by(box_num) %>%
 mutate_at(vars(x:y), funs(ifelse(. %in% subset(rle(sort(.))$values, rle(sort(.))$length > 1), NA, .)))

  box_num date           x     y
  <fct>   <fct>      <dbl> <dbl>
1 6-WQ    2018-11-18  20.2  8.00
2 6-WQ    2018-11-25  NA    7.20
3 6-WQ    2018-12-2   NA   23.0 
4 25-LR   2018-11-18 375.   4.30
5 25-LR   2018-11-25  NA    9.30
6 25-LR   2018-12-2   NA    4.00
7 73-IU   2018-12-2   NA   NA   
8 73-IU   2018-12-9   28.0 NA   
9 73-IU   2018-12-16  NA   52.8 

首先,它将“ x”和“ y”中的值排序,并计算相等值的游程长度。其次,它为游程长度大于1的那些值创建一个子集。最后,它比较“ x”和“ y”中的值是否在子集中,如果是,则得出NA。