从字符串中删除单词

时间:2017-11-17 10:21:50

标签: r regex grep

我正在尝试从数据框中删除某些字词:

name    age words
James   34  hello, my name is James. 
John    30  hello, my name is John. Here is my favourite website https://stackoverflow.com
Jim 27  Hi! I'm another person whose name begins with a J! Here is something that should be filtered out: <filter>
df<-structure(list(name = structure(c(1L, 3L, 2L), .Label = c("James", 
"Jim", "John"), class = "factor"), age = c(34L, 30L, 27L), message = structure(1:3, .Label = c("hello, my name is James. ", 
"hello, my name is John. Here is my favourite website https://stackoverflow.com", 
"Hi! I'm another person whose name begins with a J! Here is something that should be filtered out: <filter>"
), class = "factor")), .Names = c("name", "age", "message"), class = "data.frame", row.names = c(NA, 
-3L))

我正在尝试删除包含httpfilter匹配的所有字词。

我想迭代每一行,将字符串拆分为空格,然后询问该单词是否包含http<filter>(或其他字)。如果是这样,那么我想用空格替换这个单词。

有一个load of questions有关删除与另一个单词或单词列表完全匹配的单词,但我找不到多少删除符合某些条件的字词(例如httpwww.)。

我试过了:

gsub!grepltm_map接近(例如this),但我无法将它们中的任何一个产生我预期的输出:

name    age words
James   34  hello, my name is James. 
John    30  hello, my name is John. Here is my favourite website 
Jim 27  Hi! I'm another persoon whose name begins with a J! Here is something that should be filtered out: 

2 个答案:

答案 0 :(得分:2)

我们可以使用gsub

gsub("\\s(https:\\S+|<filter>)", "", df$message)

答案 1 :(得分:2)

要删除任何包含 1}}使用以下 PCRE 正则表达式(添加http参数):

filter

请参阅regex demo

<强>详情

  • gsub - 1+ wjhitespaces或string of string
  • perl=TRUE - 尽可能多的非空白字符
  • (?:\s+|^)\S*(?<!\w)(?:https?|<filter>)(?!\w)\S* - 当前位置的左侧不允许使用字词字符
  • (?:\s+|^) - \S*(?<!\w)(?:https?|<filter>)
  • http - 当前位置右侧(在交替组中的单词之后)不允许使用单词char
  • https - 尽可能多的非空白字符。

查看online R demo

<filter>

结果:

(?!\w)