使用正则表达式在R中提取特定长度的单词

时间:2012-12-10 08:25:12

标签: regex string r

我有一个代码(我得到它here):

m<- c("Hello! #London is gr8. I really likewhatishappening here! The alcomb of Mount Everest is excellent! the aforementioned place is amazing! #Wow")

x<- gsub("\\<[a-z]\\{4,10\\}\\>","",m)
x

我尝试了其他方法,比如

m<- c("Hello! #London is gr8. I really likewhatishappening here! The alcomb of Mount Everest is excellent! the aforementioned place is amazing! #Wow")

x<- gsub("[^(\\b.{4,10}\\b)]","",m)
x

我需要删除长度小于4或大于10的单词。我哪里错了?

6 个答案:

答案 0 :(得分:11)

  gsub("\\b[a-zA-Z0-9]{4,10}\\b", "", m) 
 "! # is gr8. I  likewhatishappening ! The  of   is ! the aforementioned  is ! #Wow"

让我们解释正则表达式术语:

  1. \ b在称为“单词边界”的位置匹配。这场比赛是零长度。
  2. [a-zA-Z0-9]:字母数字
  3. {4,10}:{min,max}
  4. 如果你想得到这个的否定,你把它放在()之间,你拿// 1

    gsub("([\\b[a-zA-Z0-9]{4,10}\\b])", "//1", m) 
    

    “你好!#London是gr8。我真的很喜欢这里的东西!珠穆朗玛峰的alcomb很棒!前面提到的地方很棒!#Wow”

    很有趣的是,在2 regexpr中存在4个字母的单词。

答案 1 :(得分:1)

# starting string
m <- c("Hello! #London is gr8. I really likewhatishappening here! The alcomb of Mount Everest is excellent! the aforementioned place is amazing! #Wow")

# remove punctuation (optional)
v <- gsub("[[:punct:]]", " ", m)

# split into distinct words
w <- strsplit( v , " " )

# calculate the length of each word
x <- nchar( w[[1]] )

# keep only words with length 4, 5, 6, 7, 8, 9, or 10
y <- w[[1]][ x %in% 4:10 ]

# string 'em back together
z <- paste( unlist( y ), collapse = " " )

# voila
z

答案 2 :(得分:1)

gsub(" [^ ]{1,3} | [^ ]{11,} "," ",m)
[1] "Hello! #London gr8. really here! alcomb Mount Everest excellent! aforementioned
     place amazing! #Wow"

答案 3 :(得分:1)

这可能会让你开始:

m <- c("Hello! #London is gr8. I really likewhatishappening here! The alcomb of Mount Everest is excellent! the aforementioned place is amazing! #Wow")
y <- gsub("\\b[a-zA-Z0-9]{1,3}\\b", "", m) # replace words shorter than 4
y <- gsub("\\b[a-zA-Z0-9]{10,}\\b", "", y) # replace words longer than 10
y <- gsub("\\s+\\.\\s+ ", ". ", y) # replace stray dots, eg "Foo  .  Bar" -> "Foo. Bar"
y <- gsub("\\s+", " ", y) # replace multiple spaces with one space
y <- gsub("#\\b+", "", y) # remove leftover hash characters from hashtags
y <- gsub("^\\s+|\\s+$", "", y) # remove leading and trailing whitespaces
y
# [1] "Hello! London. really here! alcomb Mount Everest excellent! place amazing!"

答案 4 :(得分:1)

来自Alaxender&amp;的答案。 agstudy:

x<- gsub("\\b[a-zA-Z0-9]{1,3}\\b|\\b[a-zA-Z0-9]{10,}\\b", "", m)

立即行动!

非常感谢,伙计!

答案 5 :(得分:0)

我不熟悉R并且不知道它在正则表达式模式中支持哪些类或其他功能。没有它们,模式就像这样

[^A-z0-9]([A-z0-9]{1,3}|[A-z0-9]{11,})[^A-z0-9]
相关问题