需要帮助删除HTML标记,某些标点符号和结束期间

时间:2014-06-09 23:00:08

标签: regex r

假设我有这个测试字符串:

test.string <- c("This is just a <test> string. I'm trying to see, if a FN will remove certain things like </HTML tags>, periods; but not the one in ASP.net, for example.")

我想:

  1. 删除html标记中包含的任何内容
  2. 删除某些标点符号(,:;)
  3. 句末的句号。
  4. 所以上面应该是:

    c("This is just a string I'm trying to see if a FN will remove certain things like periods but not the one in ASP.net for example")
    

    对于#1,我尝试过以下方法:

    gsub("<.*?>", "", x, perl = FALSE)
    

    这似乎工作正常。

    对于#2,我认为它只是:

    gsub("[:@$%&*:,;^():]", "", x, perl = FALSE)
    

    哪个有效。

    对于#3,我试过了:

    gsub("+[:alpha:]?[.]+[:space:]", "", test.string, perl = FALSE)
    

    但那不起作用......

    关于我哪里出错的任何想法?我完全厌倦了RegExp,所以任何帮助都会非常感激!!

2 个答案:

答案 0 :(得分:4)

根据您提供的输入和要删除的内容的规则,以下内容应该有效。

gsub('\\s*<.*?>|[:;,]|(?<=[a-zA-Z])\\.(?=\\s|$)', '', test.string, perl=T)

请参阅Working Demo

答案 1 :(得分:1)

试试这个:

test.string <- "There is a natural aristocracy among men. The grounds of this are virtue and talents. "

gsub("\\.\\s*", "", gsub("([a-zA-Z0-9]). ([A-Z])", "\\1 \\2", test.string))
# "There is a natural aristocracy among men The grounds of this are virtue and talents