Bigram Tokenizer和Unigram Tokenizer呈现相同的输出

时间:2017-07-24 00:06:24

标签: r token tm

我无法让我的n-gram标记器工作。 unigram似乎工作正常,但是一旦我将bigram tokenizer应用到语料库,它就会给我回复与unigram tokenizer相同的单词列表。代码如下。

##Loading the data may be part of the problem
blogs <- readLines("en_US.blogs.txt", 
               encoding = "UTF-8", skipNul=TRUE)
news <- readLines("en_US.news.txt", 
              encoding = "UTF-8", skipNul=TRUE)
twitter <- readLines("en_US.twitter.txt", 
                 encoding = "UTF-8", skipNul=TRUE)

blogs_sample <- SampleData(blogs, 0.01)

writeLines(blogs_sample, "blogs_sample.txt")
news_sample <- SampleData(news, 0.01)
writeLines(news_sample, "news_sample.txt")
twitter_sample <- SampleData(twitter, 0.01)
writeLines(twitter_sample, "twitter_sample.txt")

这可能是问题,因为当我在TM包中使用DirSource时,我不确定实际的语料库是什么样的。

corpus <- Corpus(DirSource("/Users/calvin.hutto/Desktop/R/Coursera 
Capstone/final/en_US/sample", encoding = "UTF-8"), 
             readerControl = list(language = "en_US"))


UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

tdm_1 <- TermDocumentMatrix(corpus, control = list (tokenize = UnigramTokenizer))
tdm_2 <- TermDocumentMatrix(corpus, control = list (tokenize = BigramTokenizer))
tdm_3 <- TermDocumentMatrix(corpus, control = list (tokenize = TrigramTokenizer))

因此,当我检查bigram tdm和unigram tdm的头部时,它们都会呈现相同的单个单词列表。

任何帮助都会得到赞赏!!!

R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: OS X El Capitan 10.11.6

Matrix products: default
BLAS: 

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tm_0.7-1   NLP_0.1-10

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.10    digest_0.6.12   crayon_1.3.2    SnowballC_0.5.1 slam_0.1-40     bitops_1.0-6    R6_2.2.2       
 [8] magrittr_1.5    swirl_2.4.3     httr_1.2.1      stringi_1.1.5   testthat_1.0.2  tools_3.4.0     stringr_1.2.0  
[15] RCurl_1.95-4.8  yaml_2.1.14     parallel_3.4.0  compiler_3.4.0 

1 个答案:

答案 0 :(得分:0)

看起来真的很复杂。如何更简单的方法?

require(readtext)
require(quanteda)

mycorpus <- corpus(readtext("/Users/calvin.hutto/Desktop/R/Coursera Capstone/final/en_US/sample/*.txt"))
mydfm <- dfm(mycorpus, ngrams = 1:2, remove_punct = TRUE)
head(mydfm)

我无法显示输出,因为我没有您的数据,但这应该可以正常工作。