R组内的条件计算和行标志

时间:2018-07-20 14:24:32

标签: r

我当前正在使用R。我有三列需要标识重复项。

这是我正在使用的数据框:

df1 <-data.frame(ID_NUMBER = c(990,50000,52000,764000,764000,764000,1420000,1420000,1470000,1470000,2176000,2176000,2401000,2401000,2667000,2667000,3519000,3721000,3721000,4654000,4654000,4685000), 
     CalNumber = c(0,1126.61,1152.24,26900.12,26900.2,26910,50673.98,50674.31,52161.18,52161.73,77743.17,77743.7,85593.97,85594.42,94854.76,94855,124033.46,130973.56,130973.59,162935.73,162935.85,163446.89),

     Date = c('8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/8/2013' ,'8/16/2008' ,'8/16/2008' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/8/2013' ,'8/8/2013' ,'8/16/2008' ,'8/16/2008' ,'8/8/2013' ,'8/8/2013')) 



   ID_NUMBER    CalNumber        Date
     990      0           8/8/2013 0:00
     50000    1126.61     8/16/2008 0:00
     52000    1152.24     8/8/2013 0:00
     764000   26900.12    8/8/2013 0:00
     764000   26900.2     8/16/2008 0:00
     764000   26910       8/16/2008
    1420000   50673.98    8/16/2008 0:00
    1420000   50674.31    8/8/2013 0:00
    1470000   52161.18    8/16/2008 0:00
    1470000   52161.73    8/8/2013 0:00
    2176000   77743.17    8/16/2008 0:00
    2176000   77743.7     8/8/2013 0:00
    2401000   85593.97    8/16/2008 0:00
    2401000   85594.42    8/8/2013 0:00
    2667000   94854.76    8/16/2008 0:00
    2667000   94855       8/8/2013 0:00
    3519000   124033.46   8/8/2013 0:00
    3721000   130973.56   8/8/2013 0:00
    3721000   130973.59   8/16/2008 0:00
    4654000   162935.73   8/16/2008 0:00
    4654000   162935.85   8/8/2013 0:00
    4685000   163446.89   8/8/2013 0:00

重复项标识如下:如果ID_NUMBER不是唯一的,则减去下面ID_Number组的记录。如果下一个之间的增量小于等于1,则将其视为重复项。优先记录将是该组的最长日期。该组将成为主要组,第二组将被标记为次要组。我的最终结果集将具有两个新标志:isNew和isPrimary。如果不存在重复项,则将其视为新的首次记录。因此,对于非重复记录,isNew将为“ Y”,而isPrimary将为“ Y”。我希望下面的结果示例可以更好地解释我的观点。我是R的新手,所以我不知道从哪里开始。所以任何建议或指针都将不胜感激。

   ID_NUMBER    CalNumber   Date     CalcDiff       IsNew   isPrimary
     990      0           8/8/2013           --          Y         Y
     50000    1126.61     8/16/2008          --          Y         Y
     52000    1152.24     8/8/2013           --          Y         Y
     764000   26900.12    8/8/2013           --          N         Y
     764000   26900.2     8/16/2008          .08         N         N
     764000   26910           8/16/2008          9.8         Y         Y 
    1420000   50673.98    8/16/2008          --          N         N
    1420000   50674.31    8/8/2013           .33         N         Y
    1470000   52161.18    8/16/2008          --          N         N
    1470000   52161.73    8/8/2013           .55         N         Y
    2176000   77743.17    8/16/2008          --          N         Y
    2176000   77743.7     8/8/2013           .53         N         N 
    2401000   85593.97    8/16/2008          --          N         N
    2401000   85594.42    8/8/2013           .45         N         Y 
    2667000   94854.76    8/16/2008          --          N         N   
    2667000   94855          8/8/2013            .24         N         Y
    3519000   124033.46   8/8/2013           --          Y         Y     
    3721000   130973.56   8/8/2013           --          N         Y
    3721000   130973.59   8/16/2008          .03         N         N  
    4654000   162935.73   8/16/2008          --          Y         Y 
    4654000   162936.85   8/8/2013           1.12        Y         Y  
    4685000   163446.89   8/8/2013            --         Y         Y  

1 个答案:

答案 0 :(得分:2)

此解决方案需要dplyrmagrittr(对于复合分配管道)。首先,我定义数据框:

df <- data.frame(ID_NUMBER = c(990,50000,52000,764000,764000,764000,1420000,1420000,1470000,1470000,2176000,2176000,2401000,2401000,2667000,2667000,3519000,3721000,3721000,4654000,4654000,4685000), 
             CalNumber = c(0,1126.61,1152.24,26900.12,26900.2,26910,50673.98,50674.31,52161.18,52161.73,77743.17,77743.7,85593.97,85594.42,94854.76,94855,124033.46,130973.56,130973.59,162935.73,162936.85,163446.89),
             Date = c('8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/8/2013' ,'8/16/2008' ,'8/16/2008' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/8/2013' ,'8/8/2013' ,'8/16/2008' ,'8/16/2008' ,'8/8/2013' ,'8/8/2013'))

在这里,我将您的Date转换为日期。然后,我按ID_NUMBER分组并计算相邻行之间的差异。然后,我使用case_when应用您的条件来确定IsNew。最后,我再次按ID_NUMBERIsNew分组,并检查最近的日期。

df %<>% 
  mutate(Date = as.Date(Date, "%m/%d/%Y")) %>% 
  group_by(ID_NUMBER) %>% 
  mutate(CalcDiff = c(NA, diff(CalNumber))) %>% 
  mutate(IsNew = case_when(
    n() > 1 & is.na(CalcDiff) & lead(CalcDiff)[1] <=1 ~ "N",
    n() > 1 & is.na(CalcDiff) & lead(CalcDiff)[1] > 1 ~ "Y",
    n() > 1 & CalcDiff <= 1 ~ "N",
    n() > 1 & CalcDiff >1 ~ "Y",
    TRUE ~ "Y"
  )) %>% 
  group_by(ID_NUMBER, IsNew) %>% 
  mutate(IsPrimary = case_when(
    Date == max(Date) & IsNew == "N" ~ "Y",
    Date != max(Date) & IsNew == "N" ~ "N",
    TRUE ~ "Y"
  ))

结果:

# A tibble: 22 x 6
# Groups:   ID_NUMBER, IsNew [14]
# ID_NUMBER CalNumber Date       CalcDiff IsNew IsPrimary
# <dbl>     <dbl> <date>        <dbl> <chr> <chr>    
# 1       990         0  2013-08-08  NA      Y     Y        
# 2      50000     1127. 2008-08-16  NA      Y     Y        
# 3      52000     1152. 2013-08-08  NA      Y     Y        
# 4     764000    26900. 2013-08-08  NA      N     Y        
# 5     764000    26900. 2008-08-16   0.08   N     N        
# 6     764000    26910  2008-08-16   9.80   Y     Y        
# 7    1420000    50674. 2008-08-16  NA      N     N        
# 8    1420000    50674. 2013-08-08   0.330  N     Y        
# 9    1470000    52161. 2008-08-16  NA      N     N        
# 10   1470000    52162. 2013-08-08   0.55   N     Y        
# 11   2176000    77743. 2008-08-16  NA      N     N        
# 12   2176000    77744. 2013-08-08   0.530  N     Y        
# 13   2401000    85594. 2008-08-16  NA      N     N        
# 14   2401000    85594. 2013-08-08   0.450  N     Y        
# 15   2667000    94855. 2008-08-16  NA      N     N        
# 16   2667000    94855  2013-08-08   0.24   N     Y        
# 17   3519000   124033. 2013-08-08  NA      Y     Y        
# 18   3721000   130974. 2013-08-08  NA      N     Y        
# 19   3721000   130974. 2008-08-16   0.0300 N     N        
# 20   4654000   162936. 2008-08-16  NA      Y     Y        
# 21   4654000   162937. 2013-08-08   1.12   Y     Y        
# 22   4685000   163447. 2013-08-08  NA      Y     Y