1。一个有效的纯正则表达式解决方案（a.k.a.编辑＃2）

Question

是否可以删除所有标点符号，但保留表情符号，如

： - （

：）

：d

：P

structure(list(text = structure(c(4L, 6L, 1L, 2L, 5L, 3L), .Label =     c("ãããæããããéãããæãããInappropriate announce:-(", 
"@AirAsia your direct debit (Maybank) payment gateways is not working. Is it something     you are working to fix?", 
"@AirAsia Apart from the slight delay and shortage of food on our way back from Phuket, both flights were very smooth. Kudos :)", 
"RT @AirAsia: ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í Now you can enjoy a #great :D breakfast onboard with our new breakfast meals! :D", 
"xdek ke flight @AirAsia Malaysia to LA... hahah..:p bagi la promo murah2 sikit, kompom aku beli...", 
"You know there is a problem when customer service asks you to wait for 103 minutes and your no is 42 in the queue. X-("
), class = "factor"), created = structure(c(5L, 4L, 4L, 3L, 2L, 
1L), .Label = c("1/2/2014 16:14", "1/2/2014 17:00", "3/2/2014 0:54", 
"3/2/2014 0:58", "3/2/2014 1:28"), class = "factor")), .Names = c("text", 
"created"), class = "data.frame", row.names = c(NA, -6L))

Answer 1

1。一个有效的纯正则表达式解决方案（a.k.a.编辑＃2）

此任务可以使用正则表达式完成纯粹（非常感谢@Mike Samuel）

首先我们建立一个表情符号数据库：

(emots <- as.character(outer(c(":", ";", ":-", ";-"),
+                c(")", "(", "]", "[", "D", "o", "O", "P", "p"), stri_paste)))
## [1] ":)"  ";)"  ":-)" ";-)" ":("  ";("  ":-(" ";-(" ":]"  ";]"  ":-]" ";-]" ":["  ";["  ":-[" ";-[" ":D"  ";D"  ":-D" ";-D"
## [21] ":o"  ";o"  ":-o" ";-o" ":O"  ";O"  ":-O" ";-O" ":P"  ";P"  ":-P" ";-P" ":p"  ";p"  ":-p" ";-p"

示例性输入文字：

text <- ":) ;P :] :) ;D :( LOL :) I've been to... the (grocery) st{o}re :P :-) --- and the salesperson said: Oh boy!"

一个辅助函数，用于转义某些特殊字符，以便它们可以在正则表达式模式中使用（使用stringi包）：

library(stringi)
escape_regex <- function(r) {
   stri_replace_all_regex(r, "\\(|\\)|\\[|\\]", "\\\\$0")
}

与表情符号匹配的正则表达式：

(regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")"))
## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"

现在，正如@Mike Samuel在下面建议的那样，我们只匹配(emoticon)|punctuation （请注意，表情符号位于捕获组中），然后替换匹配项捕获组1的结果（如果它是表情符号，我们有替换= 这个表情符号，如果它是标点符号，我们有替换= 没有< / em>的）。这将起作用，因为ICU Regex中|的替换（stri_replace_all_regex中使用的正则表达式引擎）是greedy and left-biased：表情符号将比标点符号更早匹配。

stri_replace_all_regex(text, stri_c(regex1, "|\\p{P}"), "$1") ## [1] ":) ;P :] :) ;D :( LOL :) Ive been to the grocery store :P :-) and the salesperson said Oh boy"

顺便说一句，如果你只想摆脱一组选定的字符，请举例如： [.,]代替[\\p{P}]以上。

2。正则表达式解决方案提示 - 我的第一次（不是明智的）尝试（a.k.a.原始答案）

我的第一个想法（在这里主要是为了＃34;历史原因＆＃34;）是通过使用look-aheads and look-behinds解决这个问题，但是 - 正如你所看到的那样 - 远非完美。

删除所有:和;后面没有)，(，D，X，8， [或]使用负面后卫：

stri_replace_all_regex(text, "[:;](?![)P(DX8\\[\\]])", "") ## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P -) --- and the salesperson said Oh boy!"

现在我们可以添加一些老式表情符号（带鼻子，例如:-)，;-D等。

stri_replace_all_regex(text, "[:;](?![-]?[)P(DX8\\[\\]])", "") ## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P :-) --- and the salesperson said Oh boy!"

现在删除连字符（负面看后面并向前看）

stri_replace_all_regex(text, "[:;](?![-]?[)P(DX8\\[\\]])|(?!<[:;])[-](?![)P(DX8\\[\\]])", "") ## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P :-) and the salesperson said Oh boy!"

等等。当然，首先你应该建立自己的表情符号数据库（保持原样）和标点符号（删除）。正则表达式将高度依赖于这两组，因此很难添加新的表情符号 - 它绝对不值得应用（并且可能扭曲你的大脑）。

3。第二次尝试（regex-dumb friendlier，a.k.a。编辑＃1）

另一方面，如果您对复杂的正则异常过敏，请尝试这样做。这种方法有一些＆＃34;教学效益＆＃34; - 我们完全了解以下每个步骤中的内容：

在text;
中找到所有表情符号
找到text;
中的所有标点字符
查找不是表情符号部分的标点符号的位置;

删除步骤3中的字符。

示例性输入文本 - 仅限1个字符串 - 广义案例留作练习;）

text <- ":) ;P :] :) ;D :( LOL :) I've been to... the (grocery) st{o}re :P :-) --- and the salesperson said: Oh boy!"

一个辅助函数，可以转义某些特殊字符，以便它们可以在正则表达式中使用：

escape_regex <- function(r) { library("stringi") stri_replace_all_regex(r, "\$|\$|\\[|\\]", "\\\\$0") }

与表情符号匹配的正则表达式：

(regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")")) ## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"

找到所有表情符号的开始和结束位置（即找到第一个 OR 第二个 OR ...表情符号）：

where_emots <- stri_locate_all_regex(text, regex1)[[1]] # only for the first string of text print(where_emots) ## start end ## [1,] 1 2 ## [2,] 4 5 ## [3,] 7 8 ## [4,] 10 11 ## [5,] 13 14 ## [6,] 16 17 ## [7,] 23 24 ## [8,] 64 65 ## [9,] 67 69

找到所有标点字符（此处\\p{P}是表示标点符号的Unicode character class）：

where_punct <- stri_locate_all_regex(text, "\\p{P}")[[1]] print(where_punct) ## start end ## [1,] 1 1 ## [2,] 2 2 ## [3,] 4 4 ## [4,] 7 7 ## [5,] 8 8 ## ... ## [26,] 72 72 ## [27,] 73 73 ## [28,] 99 99 ## [29,] 107 107

由于表情符号中出现了一些标点字符，我们不应将其分阶段删除：

which_punct_omit <- sapply(1:nrow(where_punct), function(i) { any(where_punct[i,1] >= where_emots[,1] & where_punct[i,2] <= where_emots[,2]) }) where_punct <- where_punct[!which_punct_omit,] # update where_punct print(where_punct) ## start end ## [1,] 27 27 ## [2,] 38 38 ## [3,] 39 39 ## [4,] 40 40 ## [5,] 46 46 ## [6,] 54 54 ## [7,] 58 58 ## [8,] 60 60 ## [9,] 71 71 ## [10,] 72 72 ## [11,] 73 73 ## [12,] 99 99 ## [13,] 107 107

每个标点符号肯定只包含1个字符，因此总是where_punct[,1]==where_punct[,2]。

现在是最后一部分。如您所见，where_punct[,1]包含要删除的字符的位置。恕我直言，最简单的方法（没有循环）是将字符串转换为UTF-32（每个字符== 1整数），删除不需要的元素，然后返回文本表示：

text_tmp <- stri_enc_toutf32(text)[[1]] print(text_tmp) # here - just ASCII codes... ## [1] 58 41 32 59 80 32 58 93 32 58.... text_tmp <- text_tmp[-where_punct[,1]] # removal, but be sure that where_punct is not empty!

结果是：

stri_enc_fromutf32(text_tmp) ## [1] ":) ;P :] :) ;D :( LOL :) Ive been to the grocery store :P :-) and the salesperson said Oh boy"

你在这里。

Answer 2

这是一种不太复杂的方法，可能比@ gagolews的解决方案慢。它要求你喂它一个表情词典。您可以创建它或使用qdapDictionaries包中的那个。基本方法将表情符号转换为不会被误认为是其他任何内容的文本（我使用dat$Temp <-前缀来确保这一点）。然后使用qdap::strip删除标点符号，然后通过mgsub将占位符转换回表情符号：

library(qdap)
#reps <- emoticon
emos <- c(":-(", ":)", ":D", ":p", "X-(")
reps <- data.frame(seq_along(emos), emos)

reps[, 1] <- paste0("EMOTICONREPLACE", reps[, 1])
dat$Temp <- mgsub(as.character(reps[, 2]), reps[, 1], dat[, 1])
dat$Temp <- mgsub(reps[, 1], as.character(reps[, 2]), 
    strip(dat$Temp, digit.remove = FALSE, lower.case=FALSE))

查看它：

truncdf(left_just(dat[, 3, drop=F]), 50)

##   Temp                                              
## 1 RT AirAsia ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í No
## 2 You know there is a problem when customer service 
## 3 ãããæããããéãããæãããInappropriate announce:-(         
## 4 AirAsia your direct debit Maybank payment gateways
## 5 xdek ke flight AirAsia Malaysia to LA hahah:p bagi
## 6 AirAsia Apart from the slight delay and shortage o

编辑：要求?和!按要求通过char.keep函数中的strip参数：

dat$Temp <- mgsub(reps[, 1], as.character(reps[, 2]), 
    strip(dat$Temp, digit.remove = FALSE, lower.case=FALSE, char.keep=c("!", "?")))

Answer 3

我将此功能作为qdap version > 2.0.0功能添加到sub_holder。基本上这个函数使用the response I gave above但减轻了编码负荷。 sub_holder函数采用文本向量和要分出的项目（例如表情符号）。它返回一个列表：

测试的载体，其中的物品为地方持有人提供了
一个函数（称为unhold），用于交换原始术语的持有者

以下是代码：

emos <- c(":-(", ":)", ":D", ":p", "X-(")
(m <- sub_holder(emos, dat[,1]))
m$unhold(strip(m$output, digit.remove = FALSE, lower.case=FALSE, char.keep=c("!", "?")))

Answer 4

使用rex可能会使这类任务变得更简单一些。它会根据需要自动转义字符，如果放入or()函数，它将会是矢量的所有元素。带有全局参数的re_matches()将为您提供给定行的所有表情符号的列表。

x = structure(list(text = structure(c(4L, 6L, 1L, 2L, 5L, 3L), .Label =     c("ãããæããããéãããæãããInappropriate announce:-(", 
"@AirAsia your direct debit (Maybank) payment gateways is not working. Is it something     you are working to fix?", 
"@AirAsia Apart from the slight delay and shortage of food on our way back from Phuket, both flights were very smooth. Kudos :)", 
"RT @AirAsia: ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í Now you can enjoy a #great :D breakfast onboard with our new breakfast meals! :D", 
"xdek ke flight @AirAsia Malaysia to LA... hahah..:p bagi la promo murah2 sikit, kompom aku beli...", 
"You know there is a problem when customer service asks you to wait for 103 minutes and your no is 42 in the queue. X-("
), class = "factor"), created = structure(c(5L, 4L, 4L, 3L, 2L, 
1L), .Label = c("1/2/2014 16:14", "1/2/2014 17:00", "3/2/2014 0:54", 
"3/2/2014 0:58", "3/2/2014 1:28"), class = "factor")), .Names = c("text", 
"created"), class = "data.frame", row.names = c(NA, -6L))

emots <- as.character(outer(c(":", ";", ":-", ";-"), c(")", "(", "]", "[", "D", "o", "O", "P", "p"), paste0))

library(rex)
re_matches(x$text,
  rex(
    capture(name = 'emoticons',
      or(emots)
    ),
  global = T)

#>[[1]]
#>  emoticon
#>1       :D
#>2       :D
#>
#>[[2]]
#>  emoticon
#>1     <NA>
#>
#>[[3]]
#>  emoticon
#>1      :-(
#>
#>[[4]]
#>  emoticon
#>1     <NA>
#>
#>[[5]]
#>  emoticon
#>1       :p
#>
#>[[6]]
#>  emoticon
#>1       :)

删除标点但保留表情符号？

4 个答案:

1。一个有效的纯正则表达式解决方案（a.k.a.编辑＃2）

2。正则表达式解决方案提示 - 我的第一次（不是明智的）尝试（a.k.a.原始答案）

3。第二次尝试（regex-dumb friendlier，a.k.a。编辑＃1）