R比较下一行的日期

时间:2018-07-07 21:17:23

标签: r dataframe lag lead

我在R中有这个数据框

  raw_payment_id from_bank_account        amount posted_at 
           <int> <chr>                     <dbl> <date>    
1         620691 SK660900000000062087       20.0 2018-02-25
2         618433 SK660900000000062087       10.0 2018-02-27
3         623157 SK660900000000062087       10.0 2018-03-02
4         628236 SK300900000000506871      812.  2018-03-06
5         627899 SK300900000000506871      812.  2018-03-07
6         628966 SK660900000000062087       10.0 2018-03-09

我的目标是确定是否在3天内发布了来自同一帐户且金额相同的付款。如果是,则将两个付款都标记为1。这样就可以了。

  raw_payment_id from_bank_account        amount posted_at     test 
           <int> <chr>                     <dbl> <date>        <int> 
1         620691 SK660900000000062087       20.0 2018-02-25    0
2         618433 SK660900000000062087       10.0 2018-02-27    1
3         623157 SK660900000000062087       10.0 2018-03-02    1
4         628236 SK300900000000506871      812.  2018-03-06    1
5         627899 SK300900000000506871      812.  2018-03-07    1
6         628966 SK660900000000062087       10.0 2018-03-09    0

我找不到方法,我的滞后/超前尝试失败了,因为银行帐户可能只有一笔付款。

2 个答案:

答案 0 :(得分:1)

library(dplyr)


df %>% 
  group_by(from_bank_account, amount) %>% 
  mutate(var = case_when(abs(as.Date(posted_at) - as.Date(lag(posted_at))) < 4 ~ 1, 
                         abs(as.Date(posted_at) - as.Date(lead(posted_at))) < 4 ~ 1,
                         TRUE ~ 0))

  raw_payment_id from_bank_account    amount posted_at    var
           <int> <fct>                 <dbl> <fct>      <dbl>
1         620691 SK660900000000062087    20. 2018-02-25    0.
2         618433 SK660900000000062087    10. 2018-02-27    1.
3         623157 SK660900000000062087    10. 2018-03-02    1.
4         628236 SK300900000000506871   812. 2018-03-06    1.
5         627899 SK300900000000506871   812. 2018-03-07    1.
6         628966 SK660900000000062087    10. 2018-03-09    0.

答案 1 :(得分:0)

library(dplyr)

# Within each accounts, how many transactions were the same amount
tmp <- mydat %>% 
  group_by(from_bank_account, amount) %>% 
  mutate(number_of_dupes = n()) %>% 
  filter(number_of_dupes > 1) # only keep duplicates

# remove dups > 3 days apart
tmp$dup <- 0

for(i in 1:nrow(tmp)){
  acct <- tmp$from_bank_account[i]
  n    <- tmp$number_of_dupes[i]

  if(length(tmp$dup[(abs(difftime(tmp$posted_at[i],tmp$posted_at,units = "days")) < 4)
                    & (tmp$from_bank_account == acct)]) > 1){
    tmp$dup[i] <- 1
  }
}
tmp <- tmp[tmp$dup==1,]

mydat$flag_duplicate <- ifelse(mydat$raw_payment_id %in% tmp$raw_payment_id,1,0)
  raw_payment_id    from_bank_account amount  posted_at flag_duplicate
1         620691 SK660900000000062087     20 2018-02-25              0
2         618433 SK660900000000062087     10 2018-02-27              1
3         623157 SK660900000000062087     10 2018-03-02              1
4         628236 SK300900000000506871    812 2018-03-06              1
5         627899 SK300900000000506871    812 2018-03-07              1
6         628966 SK660900000000062087     10 2018-03-09              0
相关问题