Question

假设我有这个测试字符串：

test.string <- c("This is just a <test> string. I'm trying to see, if a FN will remove certain things like </HTML tags>, periods; but not the one in ASP.net, for example.")

我想：

删除html标记中包含的任何内容
删除某些标点符号（，：;）
句末的句号。

所以上面应该是：

c("This is just a string I'm trying to see if a FN will remove certain things like periods but not the one in ASP.net for example")

对于＃1，我尝试过以下方法：

gsub("<.*?>", "", x, perl = FALSE)

这似乎工作正常。

对于＃2，我认为它只是：

gsub("[:@$%&*:,;^():]", "", x, perl = FALSE)

哪个有效。

对于＃3，我试过了：

gsub("+[:alpha:]?[.]+[:space:]", "", test.string, perl = FALSE)

但那不起作用......

关于我哪里出错的任何想法？我完全厌倦了RegExp，所以任何帮助都会非常感激!!

Answer 1

根据您提供的输入和要删除的内容的规则，以下内容应该有效。

gsub('\\s*<.*?>|[:;,]|(?<=[a-zA-Z])\\.(?=\\s|$)', '', test.string, perl=T)

请参阅Working Demo

Answer 2

试试这个：

test.string <- "There is a natural aristocracy among men. The grounds of this are virtue and talents. "

gsub("\\.\\s*", "", gsub("([a-zA-Z0-9]). ([A-Z])", "\\1 \\2", test.string))
# "There is a natural aristocracy among men The grounds of this are virtue and talents

需要帮助删除HTML标记，某些标点符号和结束期间

2 个答案: