是否可以删除所有标点符号,但保留表情符号,如
: - (
:)
:d
:P
structure(list(text = structure(c(4L, 6L, 1L, 2L, 5L, 3L), .Label = c("ãããæããããéãããæãããInappropriate announce:-(",
"@AirAsia your direct debit (Maybank) payment gateways is not working. Is it something you are working to fix?",
"@AirAsia Apart from the slight delay and shortage of food on our way back from Phuket, both flights were very smooth. Kudos :)",
"RT @AirAsia: ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í Now you can enjoy a #great :D breakfast onboard with our new breakfast meals! :D",
"xdek ke flight @AirAsia Malaysia to LA... hahah..:p bagi la promo murah2 sikit, kompom aku beli...",
"You know there is a problem when customer service asks you to wait for 103 minutes and your no is 42 in the queue. X-("
), class = "factor"), created = structure(c(5L, 4L, 4L, 3L, 2L,
1L), .Label = c("1/2/2014 16:14", "1/2/2014 17:00", "3/2/2014 0:54",
"3/2/2014 0:58", "3/2/2014 1:28"), class = "factor")), .Names = c("text",
"created"), class = "data.frame", row.names = c(NA, -6L))
答案 0 :(得分:7)
此任务可以使用正则表达式完成纯粹(非常感谢@Mike Samuel)
首先我们建立一个表情符号数据库:
(emots <- as.character(outer(c(":", ";", ":-", ";-"),
+ c(")", "(", "]", "[", "D", "o", "O", "P", "p"), stri_paste)))
## [1] ":)" ";)" ":-)" ";-)" ":(" ";(" ":-(" ";-(" ":]" ";]" ":-]" ";-]" ":[" ";[" ":-[" ";-[" ":D" ";D" ":-D" ";-D"
## [21] ":o" ";o" ":-o" ";-o" ":O" ";O" ":-O" ";-O" ":P" ";P" ":-P" ";-P" ":p" ";p" ":-p" ";-p"
示例性输入文字:
text <- ":) ;P :] :) ;D :( LOL :) I've been to... the (grocery) st{o}re :P :-) --- and the salesperson said: Oh boy!"
一个辅助函数,用于转义某些特殊字符,以便它们可以在正则表达式模式中使用(使用stringi包):
library(stringi)
escape_regex <- function(r) {
stri_replace_all_regex(r, "\\(|\\)|\\[|\\]", "\\\\$0")
}
与表情符号匹配的正则表达式:
(regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")"))
## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"
现在,正如@Mike Samuel在下面建议的那样,我们只匹配(emoticon)|punctuation
(请注意,表情符号位于捕获组中),然后替换匹配项
捕获组1的结果(如果它是表情符号,我们有替换= 这个表情符号,如果它是标点符号,我们有替换= 没有< / em>的)。这将起作用,因为ICU Regex中|
的替换(stri_replace_all_regex
中使用的正则表达式引擎)是greedy and left-biased:表情符号将比标点符号更早匹配。
stri_replace_all_regex(text, stri_c(regex1, "|\\p{P}"), "$1")
## [1] ":) ;P :] :) ;D :( LOL :) Ive been to the grocery store :P :-) and the salesperson said Oh boy"
顺便说一句,如果你只想摆脱一组选定的字符,请举例如: [.,]
代替[\\p{P}]
以上。
我的第一个想法(在这里主要是为了#34;历史原因&#34;)是通过使用look-aheads and look-behinds解决这个问题,但是 - 正如你所看到的那样 - 远非完美。
删除所有:
和;
后面没有)
,(
,D
,X
,8
, [
或]
使用负面后卫:
stri_replace_all_regex(text, "[:;](?![)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P -) --- and the salesperson said Oh boy!"
现在我们可以添加一些老式表情符号(带鼻子,例如:-)
,;-D
等。
stri_replace_all_regex(text, "[:;](?![-]?[)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P :-) --- and the salesperson said Oh boy!"
现在删除连字符(负面看后面并向前看)
stri_replace_all_regex(text, "[:;](?![-]?[)P(DX8\\[\\]])|(?!<[:;])[-](?![)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P :-) and the salesperson said Oh boy!"
等等。当然,首先你应该建立自己的表情符号数据库(保持原样)和标点符号(删除)。正则表达式将高度依赖于这两组,因此很难添加新的表情符号 - 它绝对不值得应用(并且可能扭曲你的大脑)。
另一方面,如果您对复杂的正则异常过敏,请尝试这样做。这种方法有一些&#34;教学效益&#34; - 我们完全了解以下每个步骤中的内容:
text
; text
; 示例性输入文本 - 仅限1个字符串 - 广义案例留作练习;)
text <- ":) ;P :] :) ;D :( LOL :) I've been to... the (grocery) st{o}re :P :-) --- and the salesperson said: Oh boy!"
一个辅助函数,可以转义某些特殊字符,以便它们可以在正则表达式中使用:
escape_regex <- function(r) {
library("stringi")
stri_replace_all_regex(r, "\\(|\\)|\\[|\\]", "\\\\$0")
}
与表情符号匹配的正则表达式:
(regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")"))
## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"
找到所有表情符号的开始和结束位置(即找到第一个 OR 第二个 OR ...表情符号):
where_emots <- stri_locate_all_regex(text, regex1)[[1]] # only for the first string of text
print(where_emots)
## start end
## [1,] 1 2
## [2,] 4 5
## [3,] 7 8
## [4,] 10 11
## [5,] 13 14
## [6,] 16 17
## [7,] 23 24
## [8,] 64 65
## [9,] 67 69
找到所有标点字符(此处\\p{P}
是表示标点符号的Unicode character class):
where_punct <- stri_locate_all_regex(text, "\\p{P}")[[1]]
print(where_punct)
## start end
## [1,] 1 1
## [2,] 2 2
## [3,] 4 4
## [4,] 7 7
## [5,] 8 8
## ...
## [26,] 72 72
## [27,] 73 73
## [28,] 99 99
## [29,] 107 107
由于表情符号中出现了一些标点字符,我们不应将其分阶段删除:
which_punct_omit <- sapply(1:nrow(where_punct), function(i) {
any(where_punct[i,1] >= where_emots[,1] &
where_punct[i,2] <= where_emots[,2]) })
where_punct <- where_punct[!which_punct_omit,] # update where_punct
print(where_punct)
## start end
## [1,] 27 27
## [2,] 38 38
## [3,] 39 39
## [4,] 40 40
## [5,] 46 46
## [6,] 54 54
## [7,] 58 58
## [8,] 60 60
## [9,] 71 71
## [10,] 72 72
## [11,] 73 73
## [12,] 99 99
## [13,] 107 107
每个标点符号肯定只包含1个字符,因此总是where_punct[,1]==where_punct[,2]
。
现在是最后一部分。如您所见,where_punct[,1]
包含要删除的字符的位置。恕我直言,最简单的方法(没有循环)是将字符串转换为UTF-32(每个字符== 1整数),删除不需要的元素,然后返回文本表示:
text_tmp <- stri_enc_toutf32(text)[[1]]
print(text_tmp) # here - just ASCII codes...
## [1] 58 41 32 59 80 32 58 93 32 58....
text_tmp <- text_tmp[-where_punct[,1]] # removal, but be sure that where_punct is not empty!
结果是:
stri_enc_fromutf32(text_tmp)
## [1] ":) ;P :] :) ;D :( LOL :) Ive been to the grocery store :P :-) and the salesperson said Oh boy"
你在这里。
答案 1 :(得分:5)
这是一种不太复杂的方法,可能比@ gagolews的解决方案慢。它要求你喂它一个表情词典。您可以创建它或使用qdapDictionaries
包中的那个。基本方法将表情符号转换为不会被误认为是其他任何内容的文本(我使用dat$Temp <-
前缀来确保这一点)。然后使用qdap::strip
删除标点符号,然后通过mgsub
将占位符转换回表情符号:
library(qdap)
#reps <- emoticon
emos <- c(":-(", ":)", ":D", ":p", "X-(")
reps <- data.frame(seq_along(emos), emos)
reps[, 1] <- paste0("EMOTICONREPLACE", reps[, 1])
dat$Temp <- mgsub(as.character(reps[, 2]), reps[, 1], dat[, 1])
dat$Temp <- mgsub(reps[, 1], as.character(reps[, 2]),
strip(dat$Temp, digit.remove = FALSE, lower.case=FALSE))
查看它:
truncdf(left_just(dat[, 3, drop=F]), 50)
## Temp
## 1 RT AirAsia ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í No
## 2 You know there is a problem when customer service
## 3 ãããæããããéãããæãããInappropriate announce:-(
## 4 AirAsia your direct debit Maybank payment gateways
## 5 xdek ke flight AirAsia Malaysia to LA hahah:p bagi
## 6 AirAsia Apart from the slight delay and shortage o
编辑:要求?
和!
按要求通过char.keep
函数中的strip
参数:
dat$Temp <- mgsub(reps[, 1], as.character(reps[, 2]),
strip(dat$Temp, digit.remove = FALSE, lower.case=FALSE, char.keep=c("!", "?")))
答案 2 :(得分:0)
我将此功能作为qdap version > 2.0.0
功能添加到sub_holder
。基本上这个函数使用the response I gave above但减轻了编码负荷。 sub_holder
函数采用文本向量和要分出的项目(例如表情符号)。它返回一个列表:
unhold
),用于交换原始术语的持有者以下是代码:
emos <- c(":-(", ":)", ":D", ":p", "X-(")
(m <- sub_holder(emos, dat[,1]))
m$unhold(strip(m$output, digit.remove = FALSE, lower.case=FALSE, char.keep=c("!", "?")))
答案 3 :(得分:0)
使用rex可能会使这类任务变得更简单一些。它会根据需要自动转义字符,如果放入or()
函数,它将会是矢量的所有元素。带有全局参数的re_matches()
将为您提供给定行的所有表情符号的列表。
x = structure(list(text = structure(c(4L, 6L, 1L, 2L, 5L, 3L), .Label = c("ãããæããããéãããæãããInappropriate announce:-(",
"@AirAsia your direct debit (Maybank) payment gateways is not working. Is it something you are working to fix?",
"@AirAsia Apart from the slight delay and shortage of food on our way back from Phuket, both flights were very smooth. Kudos :)",
"RT @AirAsia: ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í Now you can enjoy a #great :D breakfast onboard with our new breakfast meals! :D",
"xdek ke flight @AirAsia Malaysia to LA... hahah..:p bagi la promo murah2 sikit, kompom aku beli...",
"You know there is a problem when customer service asks you to wait for 103 minutes and your no is 42 in the queue. X-("
), class = "factor"), created = structure(c(5L, 4L, 4L, 3L, 2L,
1L), .Label = c("1/2/2014 16:14", "1/2/2014 17:00", "3/2/2014 0:54",
"3/2/2014 0:58", "3/2/2014 1:28"), class = "factor")), .Names = c("text",
"created"), class = "data.frame", row.names = c(NA, -6L))
emots <- as.character(outer(c(":", ";", ":-", ";-"), c(")", "(", "]", "[", "D", "o", "O", "P", "p"), paste0))
library(rex)
re_matches(x$text,
rex(
capture(name = 'emoticons',
or(emots)
),
global = T)
#>[[1]]
#> emoticon
#>1 :D
#>2 :D
#>
#>[[2]]
#> emoticon
#>1 <NA>
#>
#>[[3]]
#> emoticon
#>1 :-(
#>
#>[[4]]
#> emoticon
#>1 <NA>
#>
#>[[5]]
#> emoticon
#>1 :p
#>
#>[[6]]
#> emoticon
#>1 :)