从R语料库或数据框中删除英语以外的语言

时间:2018-03-17 15:27:17

标签: r inner-join text-mining sentiment-analysis tm

我目前正在寻找对25000条YouTube评论进行一些文字挖掘,我使用tuber软件包收集了这些评论。我对编码非常陌生,并且有了所有这些不同的信息,有时这可能有点压倒性。所以我已经清理了我创建的语料库:

# Build a corpus, and specify the source to be character vectors
corpus <- Corpus(VectorSource(comments_final$textOriginal))

# Convert to lower case
corpus <- tm_map(corpus, content_transformer(tolower))

# Remove URLs
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeURL))

# Remove anything other than English letters or space 
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x) 
corpus <- tm_map(corpus, content_transformer(removeNumPunct))

# Add extra stopwords
myStopwords <- c(stopwords('english'),"im", "just", "one","youre", 
"hes","shes","its","were","theyre","ive","youve","weve","theyve","id")

# Remove stopwords from corpus
corpus <- tm_map(corpus, removeWords, myStopwords)

# Remove extra whitespace
corpus <- tm_map(corpus, stripWhitespace)

# Remove other languages or more specifically anything with a non "a-z""0-9" character
corpus <- tm_map(corpus, content_transformer(function(s){
gsub(pattern = '[^a-zA-Z0-9\\s]+',
   x = s,
   replacement = " ",
   ignore.case = TRUE,
   perl = TRUE)}))

# Replace word elongations using the textclean package by Tyler Rinker. 
corpus <- tm_map(corpus, replace_word_elongation)

# Creating data frame from corpus 
corpus_asdataframe<-data.frame(text = sapply(corpus, as.character),stringsAsFactors = FALSE)

# Due to pre-processing some rows are empty. Therefore, the empty rows should be removed.

# Remove empty rows from data frame and "NA's"
corpus_asdataframe <-corpus_asdataframe[!apply(is.na(corpus_asdataframe) | corpus_asdataframe == "", 1, all),]
corpus_asdataframe<-as.data.frame(corpus_asdataframe)

# Create corpus of clean data frame
corpus <- Corpus(VectorSource(corpus_asdataframe$corpus_asdataframe))

所以现在的问题是我的语料库中有很多西班牙语或德语的评论,我想排除这些评论。我想也许可以下载一本英文字典,也许可以使用inner join检测英文单词并删除所有其他语言。但是,我非常新编码(我正在学习工商管理,从来没有对计算机科学做任何事情)所以我的技能不足以将我的想法应用到我的语料库(或数据框架) 。我真的希望在这里找到一些帮助。我非常感谢!谢谢你,德国的问候!

1 个答案:

答案 0 :(得分:0)

dftest <- data.frame(
       id = 1:3,
       text = c(
         "Holla this is a spanish word",
         "English online here",
         "Bonjour, comment ça va?"
      ) 
      )
library("cld3")
subset(dftest, detect_language(dftest$text) == "en")

##   id                         text
## 1  1 Holla this is a spanish word
## 2  2          English online here

信用:Ken Benoit 在:Find in a dfm non-english tokens and remove them