用普通字匹配和替换相似字

时间:2018-08-17 12:45:31

标签: r pattern-matching

我有一个唯一区域(dist)的向量和一个向量(dist_plus),其中每个区域都有一些附加值。 我的目标是创建一个“结果”,在该结果中类似的地区名称将被唯一的地区取代。

dist <- c("Bengaluru", "Andaman","South 24 Parganas")
dist_plus <- c("Bengaluru Rural", "Bengaluru Urban", "South Andaman","North Andaman","South 24 Parganas")


result <- c("Bengaluru", "Bengaluru", "Andaman","Andaman","South 24 Parganas")

最简单的方法是什么?谢谢。

5 个答案:

答案 0 :(得分:2)

dist <- c("Bengaluru", "Andaman","South 24 Parganas")
dist_plus <- c("Bengaluru Rural", "Bengaluru Urban", "South Andaman","North Andaman","South 24 Parganas")

library(tidyverse)

# vectorised function to spot matches
f = function(x,y) grepl(x, y)
f = Vectorize(f)

# create a look up table of matches
expand.grid(dist_plus=dist_plus, dist=dist, stringsAsFactors = F) %>%
  filter(f(dist, dist_plus)) -> look_up

# join dist_plus values with their matches 
data.frame(dist_plus, stringsAsFactors = F) %>%
  left_join(look_up, by="dist_plus") %>%
  pull(dist)

#[1] "Bengaluru"         "Bengaluru"         "Andaman"           "Andaman"           "South 24 Parganas"

答案 1 :(得分:1)

您可以使用$('.actionBar').append("<button class='buttonNext'>Extra</button>") 比较相似的单词: 首先,使用str_detect检查相似的单词,如果存在,则将str_detect向量中的单词和dist替换为loop中的所有元素。

dist_plus

输出:

library(stringr)
c(na.omit(unlist(lapply(dist_plus, function(x) ifelse(str_detect(x, dist),dist,NA)))))

答案 2 :(得分:1)

最好的方法是让您很好地理解它。有很多方法。这是使用for循环的一种方法。

# create an empty result with NAs
# if your final result has any NAs it means something probably went wrong
result <- rep(NA, length(dist_plus))

# for each dist_plus check if it contains any of the dist
for (d in 1:length(dist_plus)) {
  # d is an integer and it will span from 1 to how many elements dist_plus has

  # traverse all elements of dist (sapply =~ for ()) and see if 
  # any element appears in your subsetted dist_plus[d]
  incl <- sapply(dist, FUN = function(x, y) grepl(x, y), y = dist_plus[d])

  # find which element is this (dist[incl]) and write it to your result
  result[d] <- dist[incl]
}

[1] "Bengaluru"         "Bengaluru"         "Andaman"           "Andaman"          
[5] "South 24 Parganas"

答案 3 :(得分:1)

以下内容将满足您的需求。

inx <- lapply(dist, function(s) grep(s, dist_plus))

result2 <- character(length(dist_plus))
for(i in seq_along(inx)){
    result2[ inx[[i]] ] <- dist[i]
}

result下面的测试中,问题的引导者。

identical(result, result2)
#[1] TRUE

答案 4 :(得分:0)

谢谢大家提供了许多解决问题的方法。我也想出了一个解决方案。

library(plyr)

dist <- c("Bengaluru", "Andaman","South 24 Parganas")
dist_plus <- c("Bengaluru Rural", "Bengaluru Urban", "South Andaman","North Andaman","South 24 Parganas")
result <- c("Bengaluru", "Bengaluru", "Andaman","Andaman","South 24 Parganas")

r <- dist_plus

l_ply(dist, function(x){
  r[grepl(x, dist_plus)] <<- x
})

identical(r, result)
#[1] TRUE