R

时间:2015-08-05 21:29:20

标签: r fuzzy-comparison

我使用R进行字符串处理。我有一个带有一列字符串的数据框,例如:

 df <- data.frame(textcol=c("In this substring would like to find the position of this substring",
 "I would also like to find the position of thes substring",
 "No match here","No mention of this substrangy thing"))

 matchPattern <- "this substring"

我正在搜索一个函数(取决于某种距离参数,比如Jarro-Winkler)将我的matchPattern,将它与数据框文本列的每一行进行比较,并返回匹配的确切位置在匹配的字符串中,即第一个元素为36(除非我错误计算),第二个元素为(或许)43,第三个为NA,第四个为14(?)。

1 个答案:

答案 0 :(得分:3)

您可以使用aregexec

## Get positions (-1 instead of NA)
positions <- aregexec(matchPattern, df$textcol, max.distance = 0.1)
unlist(positions)
# [1] 38 43 -1 15

## Extract matches
regmatches(df$textcol, positions)
# [[1]]
# [1] "this substring"
# 
# [[2]]
# [1] "thes substring"
# 
# [[3]]
# character(0)
# 
# [[4]]
# [1] "this substrang"

修改

## A possibilty for replacing matches, or maybe `regmatches<-`
res <- regmatches(df$textcol, positions)
res[lengths(res)==0] <- "XXXX"  # deal with 0 length matches somehow
df$out <- Vectorize(gsub)(unlist(res), "Censored", df$textcol)
df$out
# [1] "I would like to find the position of Censored"     
# [2] "I would also like to find the position of Censored"
# [3] "No match here"                                     
# [4] "No mention of Censoredy thing"