Bigram令牌化和Unigram令牌器

时间:2018-11-01 20:34:42

标签: r

我遇到一个Bigram标记化问题,显示的结果与ngram标记化结果相同。它在图表上一直只显示一个单词对两个单词。我是R的新手,因此我尝试遵循本教程:

https://rstudio-pubs-static.s3.amazonaws.com/40817_63c8586e26ea49d0a06bcba4e794e43d.html

我已经使用了帖子中的大多数代码,但也尝试使用了我一直在使用的不同来源的一些代码。我确实读过,将语料库更改为VCorpus可以解决此问题。我确实尝试更改此代码:corpus <- Corpus(review_source) to corpus <- VCorpus(review_source),但结果相同。任何指导将不胜感激。这是我目前正在使用的代码

##Load in the data setwd("G:\\Customer Analytics\\AdHoc")
reviews <- read.csv("Reseller Ratings_clean.csv", stringsAsFactors = FALSE, 
header=TRUE)

review_text <- paste(reviews$reviews)

#setting up source and corpus
review_source <- VectorSource(review_text)
corpus <- Corpus(review_source)

########################
# Data Cleaning - Applying Transformations
########################
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "/|@|//|$|:|:)|*|&|!|?|_|-|#|")  ## 
replace special characters by space
corpus <- tm_map(corpus, content_transformer(tolower)) # Conversion to Lower 
Case
corpus <- tm_map(corpus, removePunctuation) # Punctuation can provide 
gramatical context which supports 
corpus <- tm_map(corpus, removeWords, stopwords("english")) # common stop 
Words like for, very, and, of, are, etc,
corpus <- tm_map(corpus, removeWords, c("the", "will", "The", "also", 
"that", "and", "for", "in", "is", "it", "not", "to"))
corpus <- tm_map(corpus, removeNumbers) # removal of numbers
corpus <- tm_map(corpus, stripWhitespace) # removal of whitespace
corpus <- tm_map(corpus, stemDocument) # Stemming uses an algorithm that 
removes common word endings for English

#Making a document-term matrix
dtm <- DocumentTermMatrix(corpus)

################################
# The transpose of the document term matrix 
################################
tdm <- TermDocumentMatrix(corpus)

# Frequency
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
wf <- data.frame(word=names(freq), freq=freq)

# Plot Histogram
subset(wf, freq>500) %>%
ggplot(aes(word, freq)) +
geom_bar(stat="identity", fill="darkred", colour="darkgreen") +
theme(axis.text.x=element_text(angle=45, hjust=1))

# Create Wordcloud
library(wordcloud)
set.seed(100)
wordcloud(names(freq), freq, min.freq=100, colors=brewer.pal(6, "Dark2"))

###############################
# N-gram tokenization of the Corpus 
###############################
OnegramTokenizer <- function(x) NGramTokenizer(x, 
                                             Weka_control(min = 1, max =1))
dtm <- DocumentTermMatrix(docs, control = list(tokenize = OnegramTokenizer))
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
wof <- data.frame(word=names(freq), freq=freq)

pl <- ggplot(subset(wof, freq > 500), aes(word, freq))
pl <- pl + geom_bar(stat="identity", fill="darkred", colour="blue")
pl + theme(axis.text.x=element_text(angle=45, hjust=1)) + ggtitle("Uni-Gram 
Frequency")

############################
# Bi-Gram Tokenization of the Corpus
############################

BigramTokenizer <- function(x) NGramTokenizer(x, 
                                            Weka_control(min = 2, max = 2))
dtm <- DocumentTermMatrix(docs, control = list(tokenize = BigramTokenizer))
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
wof <- data.frame(word=names(freq), freq=freq)

pl <- ggplot(subset(wof, freq >500) ,aes(word, freq))
pl <- pl + geom_bar(stat="identity", fill="darkgreen", colour="blue")
pl + theme(axis.text.x=element_text(angle=45, hjust=1)) + ggtitle("Bi-Gram 
Frequency")

0 个答案:

没有答案