如何从R中的字符串中删除某个模式中的重复单词

时间:2017-01-30 16:32:31

标签: r

我的目标是仅在字符串集的括号中删除重复的单词。

a = c( 'I (have|has|have) certain (words|word|worded|word) certain',
'(You|You|Youre) (can|cans|can) do this (works|works|worked)',
'I (am|are|am) (sure|sure|surely) you know (what|when|what) (you|her|you) should (do|do)' )

我想得到的就像这样

a
[1]'I (have|has) certain (words|word|worded) certain'
[2]'(You|Youre) (can|cans) do this (works|worked)'
[3]'I (am|are) pretty (sure|surely) you know (what|when) (you|her) should (do|)'

为了得到结果,我使用了像这样的代码

a = gsub('\\|', " | ",  a)
a = gsub('\\(', "(  ",  a)
a = gsub('\\)', "  )",  a)
a = vapply(strsplit(a, " "), function(x) paste(unique(x), collapse = " "), character(1L))

然而,它导致了不良的输出。

a    
[1] "I (  have | has ) certain words word worded"                 
[2] "(  You | Youre ) can cans do this works worked"              
[3] "I (  am | are ) sure surely you know what when her should do"

为什么我的代码会删除位于字符串后半部分的括号? 我应该怎样做我想要的结果?

3 个答案:

答案 0 :(得分:5)

我们可以使用gsubfn。这里的想法是通过匹配开括号来选择括号内的字符(\\(必须转义括号,因为它是元字符)后跟一个或多个不是右括号的字符({{1 }},将其作为括号内的一组捕获。在替换中,我们将字符组([^)]+)与xstrsplit unlist输出分开,获取list元素和{{1}它在一起

unique

答案 1 :(得分:2)

得到上面的答案。这更简单,但您也可以尝试:

library(stringi)
library(stringr)
a_new <- gsub("[|]","-",a) # replace this | due to some issus during the replacement later
a1 <- str_extract_all(a_new,"[(](.*?)[)]") # extract the "units"
# some magic using stringi::stri_extract_all_words()
a2 <- unlist(lapply(a1,function(x) unlist(lapply(stri_extract_all_words(x), function(y) paste(unique(y),collapse = "|")))))
# prepare replacement
names(a2) <- unlist(a1)
# replacement and finalization
str_replace_all(a_new, a2)
[1] "I (have|has) certain (words|word|worded) certain"                   
[2] "(You|Youre) (can|cans) do this (works|worked)"                      
[3] "I (am|are) (sure|surely) you know (what|when) (you|her) should (do)"

这个想法是将括号内的单词作为单位提取出来。然后删除重复项并用更新后的旧单元替换。

答案 2 :(得分:1)

更长但更精心的尝试

a = c( 'I (have|has|have) certain (words|word|worded|word) certain',
       '(You|You|Youre) (can|cans|can) do this (works|works|worked)',
       'I (am|are|am) (sure|sure|surely) you know (what|when|what) (you|her|you) should (do|do)' )
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

# blank output     
new_a <- c()
for (sentence in 1:length(a)) {
split <- trim(unlist(strsplit(a[sentence],"[( )]")))
newsentence <- c()
for (i in split) {
  j1 <- as.character(unique(trim(unlist(strsplit(gsub('\\|'," ",i)," ")))))
   if( length(j1)==0) {
     next
   } else {
     ifelse(length(j1)>1,
         newsentence <- c(newsentence,paste("(",paste(j1,collapse="|"),")",sep="")),
         newsentence <- c(newsentence,j1[1]))
   }
}
newsentence <- paste(newsentence,collapse=" ")
print(newsentence)
new_a <- c(new_a,newsentence)}
# [1] "I (have|has) certain (words|word|worded) certain"                 
# [2] "(You|Youre) (can|cans) do this (works|worked)"                    
# [3] "I (am|are) (sure|surely) you know (what|when) (you|her) should do"
相关问题