如何在R中使用gsub删除奇怪的字符?

时间:2016-08-08 11:57:34

标签: r unicode utf-8

我尝试使用readLines(..., encoding='UTF-8')清理一些加载到内存中的文本。

如果我没有指定编码,我会看到各种奇怪的字符,如:

> "The way I talk to my family......i would get my ass beat to
> DEATH....but they kno I cray cray & just leave it at that
> 😜ðŸ˜â˜º'"

这是readLines(...,encoding =' UTF-8')之后的样子:

> "The way I talk to my family......i would get my ass beat to
> DEATH....but they  kno I cray cray & just leave it at that
> \xf0\u009f\u0098\u009c\xf0\u009f\u0098\u009d☺"

你可以在最后看到unicode文字:\ u009f,\ u0098等。

我无法找到正确的命令和正则表达式来摆脱这些。我试过了:

gsub('[^[:punct:][:alnum:][\\s]]', '', text)

我也试过指定unicode字符,但我相信它们会被解释为文本:

gsub('\u009', '', text) # Unchanged

2 个答案:

答案 0 :(得分:5)

摆脱这些字符的最简单方法是将utf-8转换为ascii:

combined_doc <- iconv(combined_doc, 'utf-8', 'ascii', sub='')

答案 1 :(得分:2)

如果要使用正则表达式,只能使用一系列ASCII代码保留所需的字符:

text = "The way I talk to my family......i would get my ass beat to 
DEATH....but they kno I cray cray & just leave it at that 😜ðŸ˜â˜º'"

gsub('[^\x20-\x7E]', '', text)

# [1] "The way I talk to my family......i would get my ass beat to DEATH....but they kno I cray cray & just leave it at that '"

以下是从asciitable.com获取的ASCII代码表:

enter image description here

你可以看到我删除了不在x20(SPACE)和x7E(〜)范围内的任何字符。