近似匹配

时间:2017-05-05 10:02:26

标签: r

我对R很新,我一直想知道是否存在近似(dateTime)匹配的函数或包。函数 intersect()提供完全匹配的列表,但我对近似匹配感兴趣。

e.g。我有两个带有dateTime值的数组,我想要一个两个数组中出现的出现列表,最大差异为2秒。

arrayA<-c("2000-12-31 10:00:00","2000-12-31 12:00:00")
arrayB<-c("2000-12-31 10:00:00","2000-12-31 12:00:01")
arrayA<-strptime(arrayA, "%Y-%m-%d %H:%M:%S", tz="UTC")
arrayB<-strptime(arrayB, "%Y-%m-%d %H:%M:%S", tz="UTC")

intersect(arrayA,arrayB) #returns "2000-12-31 10:00:00 UTC" 

intersect()只返回完全相同的值,但我想返回&#34; 2000-12-31 10:00:00 UTC&#34;和&#34; 2000-12-31 12:00:00 UTC&#34;。

所以基本上我的问题是你是否可以指定交叉匹配出现的程度。我的问题涉及日期,但数值可能会遇到同样的问题。我的数据集非常大,因此2个for循环往往需要很长时间才能进行手动匹配,并且交叉非常快。

2 个答案:

答案 0 :(得分:2)

checkSelfPermission()包提供了两种方法:data.table函数和非equi连接。这两种方法都需要将辅助列添加到数据

创建数据

foverlaps()

请注意,两个向量都是类arrayA <- anytime::utctime(c("2000-12-31 10:00:00", "2000-12-31 12:00:00", "2000-12-31 12:00:05", "2000-12-31 12:00:10"), tz = "UTC") arrayB <- anytime::utctime(c("2000-12-31 10:00:00", "2000-12-31 12:00:01", "2000-12-31 12:00:02", "2000-12-31 11:00:00"), tz = "UTC") ,它比POSIXct函数创建的POSIXlt类更合适。此外,还添加了更多时间戳来测试不匹配。

准备数据

两种方法的数据准备相同:

strptime()
# make data.tables library(data.table) # version 1.10.4 used here A <- data.table(arrayA) B <- data.table(arrayB) # define tolerance = 2 * tol_half tol_half <- 1L # seconds # add helper columns A[, "copyA" := arrayA] A # arrayA copyA #1: 2000-12-31 10:00:00 2000-12-31 10:00:00 #2: 2000-12-31 12:00:00 2000-12-31 12:00:00 #3: 2000-12-31 12:00:05 2000-12-31 12:00:05 #4: 2000-12-31 12:00:10 2000-12-31 12:00:10 B[, `:=`(start = arrayB - tol_half, end = arrayB + tol_half)] B # arrayB start end #1: 2000-12-31 10:00:00 2000-12-31 09:59:59 2000-12-31 10:00:01 #2: 2000-12-31 12:00:01 2000-12-31 12:00:00 2000-12-31 12:00:02 #3: 2000-12-31 12:00:02 2000-12-31 12:00:01 2000-12-31 12:00:03 #4: 2000-12-31 11:00:00 2000-12-31 10:59:59 2000-12-31 11:00:01 中的

startend表示B必须符合的可容忍时间范围才能被视为匹配。这类似于arrayA函数在fuzzyjoin solution中动态执行的操作。

match_fun

使用foverlaps()搜索foverlaps()A中的重叠时间范围:

B

请注意,# setting keys is required by foverlap() setkey(A, arrayA, copyA) setkey(B, start, end) # find overlaps result <- foverlaps(B, A, nomatch = 0)[, c("copyA", "start", "end") := NULL][] result # arrayA arrayB #1: 2000-12-31 10:00:00 2000-12-31 10:00:00 #2: 2000-12-31 12:00:00 2000-12-31 12:00:01 immediatley会从[, c("copyA", "start", "end") := NULL][]的输出中删除辅助列。

非等联接

使用最新版本的foverlaps()非等联接是可能的:

data.table

请注意,由于自动索引,非equi连接不需要事先设置键。

基准

待办事项:在大型用例中比较result <- A[B, .(arrayA, arrayB), on = c("copyA>=start", "copyA<=end"), nomatch = 0L] result # arrayA arrayB #1: 2000-12-31 10:00:00 2000-12-31 10:00:00 #2: 2000-12-31 12:00:00 2000-12-31 12:00:01 fuzzyjoin非equi join 会很有趣。

答案 1 :(得分:0)

library(lubridate)
library(fuzzyjoin)
arrayA<-c("2000-12-31 10:00:00","2000-12-31 12:00:00")
arrayB<-c("2000-12-31 10:00:00","2000-12-31 12:00:01")
arrayA <- strptime(arrayA, "%Y-%m-%d %H:%M:%S", tz = "UTC")
arrayB <- strptime(arrayB, "%Y-%m-%d %H:%M:%S", tz = "UTC")

# make data frames for join operations
A <- as.data.frame(arrayA)
B <- as.data.frame(arrayB)

# fuzzyjoin works by matching rows where a function applied
# to the column pairs is TRUE. Here the function is defined 
# inline, and uses lubridate durations.
fuzzy_join(A, B, 
           by=c("arrayA" = "arrayB"), 
           match_fun = function(x,y) {abs(x-y) <= duration(2, "seconds")})

# arrayA              arrayB
# 1 2000-12-31 10:00:00 2000-12-31 10:00:00
# 2 2000-12-31 12:00:00 2000-12-31 12:00:01