nict的dict函数

时间:2018-05-01 10:05:23

标签: r tm

我有这样的文字:

library(dplyr)
glimpse(text)
chr [1:11] "Welcome to Wikipedia ! [bla] Discover Ekopedia, the practical 
encyclopedia about alternative life techniques. \"| __truncated__ ...

和这种bi_grams:

glimpse(dict)
chr [1:34] "and i" "and the" "as a" "at the" "do not" "for the" "from the" 
"has been" "i am" "i dont" ...

我的目标是使用DocumentTermMatrix的bi_grams从text构建dict

为实现这一目标,我预处理了text

library(tm)
corpus <- VCorpus(VectorSource(text))
corpus_clean <- corpus %>% 
tm_map(content_transformer(tolower)) %>% 
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>% 
tm_map(stripWhitespace)

然后使用dictionary函数:

dtm <- DocumentTermMatrix(corpus_clean, list(dictionary=dict))

结果如下:

dtm <-  as.data.frame(as.matrix(dtm))
glimpse(dtm)
Observations: 4
Variables: 34
$ and.i       <dbl> 0, 0, 0, 0
$ and.the     <dbl> 0, 0, 0, 0
$ as.a        <dbl> 0, 0, 0, 0
$ at.the      <dbl> 0, 0, 0, 0
$ do.not      <dbl> 0, 0, 0, 0
$ for.the     <dbl> 0, 0, 0, 0

由于bi_grams之间的.计数都是0.任何想法如何正确使用dictionary函数用于bi_grams?

dput(text)
c("Welcome to Wikipedia ! [bla] Discover Ekopedia, the practical encyclopedia about alternative life techniques. \n\n \n[bla] Discover Ekopedia, the practical encyclopedia about alternative life techniques.", 
"Including some appropriate mention of the Solomon article is not without some level of support .", 
"\"\n\nComment. I could not verify the claim.  (talk) \"", "\"\n Czech Republic is in Central Europe. The state of this article is part of the reason why people are making such confusions. Especially more ridiculous is that they you would replace the mention of North Caucasus in favor of \"\"north slope of Caucasus Mountains\"\" which isnt even a geographical area other than denoting the mountains in that region. Countries are located within continents, yet for some reason you refuse to allow this article to be denoted a continent. This single factor alone would have made a massive difference for readers. I'm tired of arguing with people who are essentially wiki-squatters refusing to nudge on a given article. 24.90.230.216  \"", 
"Thanks, Josette. I enjoyed meeting you, too. I was shocked by the decision, which does not begin to reflect consensus. Does just one Grand Poobah make it alone? Serves me right for stealing time from more pressing real-world duties to indulge in a fun hobby. I've learned my lesson and won't waste time like that again. I'll stick to fixing the little things I run across as I read articles for my own information.", 
"Paleontologists agree that organic remains must be buried quickly so they can be preserved long enough to be come fossilized.  However, the term fossilized is not a very precise term.  There are several factors and metamorphic mineral processes which occur to organic remains that result in what is typically called a fossil.  One major factor concerns what kind of organisms are to be fossilized  vertebrate, invertebrates, radiolarians, sponges, plants, pollen, foot prints, etc.  And multiple processes may include permineralization, recrystalization, carbonization, replacement, dissolving, diagenesis, etc.  Talking about fossilization is a complex issue, however quick burial is not questioned.\n\nThe major question is, how long does it take for these processes to work on organic reamins in the environment they are found in?  Experimental taphonomy has resulted in an assortment of remains becoming fossilized by various processes in the lab, which of course implies that given the right conditions, vast ages are not an issue.  The metamorphic processes are ongoing until an equilibrium is met between the chemical enviroument of the burial site and the minerals of the organic remains.  Flood catastrophic geologists do not expect that organic remains buried during the flood were completely fossilized within the one year period of the flood, but rather that there has been some 4000 years for the processes to have been working.  Much more work needs to be done on the taphonomy of organic remains.  Yet, how one interprets even those results will depend upon which world view you choose to believe with.", 
"Also I think Vegetable Basket needs it's own Wikipedia page.", 
"Bigfoot Reference \n\nThe magazine is better known as just the Engineering and Mining Journal, which you may have a difficult time finding, depending on where you live.  I ran across the article a few years ago while researching something else, and made a copy.  It is clearly derived from press accounts, and treats the incident as a joke.  My whole point in citing it was to show that the incident, whatever it was, was not (entirely) created 40+ years after the fact.  If you leave me your email, I will scan the page and email you a PDF.", 
"Also see this if you cant trust Murkoth Ramunni\nhttp://books.google.com/books?id=HHev0U1GfpEC&pg;=PA51&dq;=Thiyya+matrilineal&hl;=en&sa;=X&ei;=TlpPUd2aH8mWiQLgvIDgBA&ved;=0CDYQ6AEwAQ#v=onepage&q;=Thiyya%20matrilineal&f;=false", 
"\"\n\n Chart performance of \"\"Single Ladies (Put a Ring on It)\"\" \n\nPlease take my advice and split up the paragraphs in the section. FAs generally have short paragraphs. It's hard and boring to ingest so much information at once, so splitting the paragraphs will improve the flow. — · [ TALK ]  \"", 
"\"\n\nhahahaha.... good one ......\nI have removed it.\n \""
)

dput(dict)
c("and i", "and the", "as a", "at the", "do not", "for the", 
"from the", "has been", "i am", "i dont", "i have", "i think", 
"if you", "in the", "is a", "is not", "is the", "it is", "of the", 
"on the", "should be", "talk page", "thank you", "that the", 
"that you", "the article", "there is", "this is", "to be", "to do", 
"to the", "with the", "you are", "you have")

1 个答案:

答案 0 :(得分:2)

当您创建自己的dtm时,字典会尝试将自己映射到单个单词并返回0.找不到匹配项。您需要在DocumentTermMatrix电话中使用bigram标记器。见下面的例子。

library(dplyr)
library(tm)
corpus <- VCorpus(VectorSource(text))
corpus_clean <- corpus %>% 
  tm_map(content_transformer(tolower)) %>% 
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>% 
  tm_map(stripWhitespace)

# Create tokenizer using NLP package
NLPBigramTokenizer <- function(x) {
  unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
}

# create dtm with call to tokenizer and dictionary
dtm <- DocumentTermMatrix(corpus_clean, list(tokenize = NLPBigramTokenizer,
                                             dictionary = dict))


inspect(dtm)
<<DocumentTermMatrix (documents: 11, terms: 34)>>
Non-/sparse entries: 23/351
Sparsity           : 94%
Maximal term length: 11
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs and the as a do not for the has been if you in the is not of the to be
  1        0    0      0       0        0      0      0      0      0     0
  10       0    0      0       0        0      0      1      0      0     0
  11       0    0      0       0        0      0      0      0      0     0
  2        0    0      0       0        0      0      0      1      1     0
  3        0    0      0       0        0      0      0      0      0     0
  4        0    0      0       0        0      0      0      0      1     1
  6        1    0      1       1        1      0      2      2      3     3
  7        0    0      0       0        0      0      0      0      0     0
  8        0    1      0       0        0      1      0      0      0     0
  9        0    0      0       0        0      1      0      0      0     0

数据:

text <- c("Welcome to Wikipedia ! [bla] Discover Ekopedia, the practical encyclopedia about alternative life techniques. \n\n \n[bla] Discover Ekopedia, the practical encyclopedia about alternative life techniques.", 
          "Including some appropriate mention of the Solomon article is not without some level of support .", 
          "\"\n\nComment. I could not verify the claim.  (talk) \"", "\"\n Czech Republic is in Central Europe. The state of this article is part of the reason why people are making such confusions. Especially more ridiculous is that they you would replace the mention of North Caucasus in favor of \"\"north slope of Caucasus Mountains\"\" which isnt even a geographical area other than denoting the mountains in that region. Countries are located within continents, yet for some reason you refuse to allow this article to be denoted a continent. This single factor alone would have made a massive difference for readers. I'm tired of arguing with people who are essentially wiki-squatters refusing to nudge on a given article. 24.90.230.216  \"", 
          "Thanks, Josette. I enjoyed meeting you, too. I was shocked by the decision, which does not begin to reflect consensus. Does just one Grand Poobah make it alone? Serves me right for stealing time from more pressing real-world duties to indulge in a fun hobby. I've learned my lesson and won't waste time like that again. I'll stick to fixing the little things I run across as I read articles for my own information.", 
          "Paleontologists agree that organic remains must be buried quickly so they can be preserved long enough to be come fossilized.  However, the term fossilized is not a very precise term.  There are several factors and metamorphic mineral processes which occur to organic remains that result in what is typically called a fossil.  One major factor concerns what kind of organisms are to be fossilized  vertebrate, invertebrates, radiolarians, sponges, plants, pollen, foot prints, etc.  And multiple processes may include permineralization, recrystalization, carbonization, replacement, dissolving, diagenesis, etc.  Talking about fossilization is a complex issue, however quick burial is not questioned.\n\nThe major question is, how long does it take for these processes to work on organic reamins in the environment they are found in?  Experimental taphonomy has resulted in an assortment of remains becoming fossilized by various processes in the lab, which of course implies that given the right conditions, vast ages are not an issue.  The metamorphic processes are ongoing until an equilibrium is met between the chemical enviroument of the burial site and the minerals of the organic remains.  Flood catastrophic geologists do not expect that organic remains buried during the flood were completely fossilized within the one year period of the flood, but rather that there has been some 4000 years for the processes to have been working.  Much more work needs to be done on the taphonomy of organic remains.  Yet, how one interprets even those results will depend upon which world view you choose to believe with.", 
          "Also I think Vegetable Basket needs it's own Wikipedia page.", 
          "Bigfoot Reference \n\nThe magazine is better known as just the Engineering and Mining Journal, which you may have a difficult time finding, depending on where you live.  I ran across the article a few years ago while researching something else, and made a copy.  It is clearly derived from press accounts, and treats the incident as a joke.  My whole point in citing it was to show that the incident, whatever it was, was not (entirely) created 40+ years after the fact.  If you leave me your email, I will scan the page and email you a PDF.", 
          "Also see this if you cant trust Murkoth Ramunni\nhttp://books.google.com/books?id=HHev0U1GfpEC&pg;=PA51&dq;=Thiyya+matrilineal&hl;=en&sa;=X&ei;=TlpPUd2aH8mWiQLgvIDgBA&ved;=0CDYQ6AEwAQ#v=onepage&q;=Thiyya%20matrilineal&f;=false", 
          "\"\n\n Chart performance of \"\"Single Ladies (Put a Ring on It)\"\" \n\nPlease take my advice and split up the paragraphs in the section. FAs generally have short paragraphs. It's hard and boring to ingest so much information at once, so splitting the paragraphs will improve the flow. — · [ TALK ]  \"", 
          "\"\n\nhahahaha.... good one ......\nI have removed it.\n \""
)

dict <- c("and i", "and the", "as a", "at the", "do not", "for the", 
  "from the", "has been", "i am", "i dont", "i have", "i think", 
  "if you", "in the", "is a", "is not", "is the", "it is", "of the", 
  "on the", "should be", "talk page", "thank you", "that the", 
  "that you", "the article", "there is", "this is", "to be", "to do", 
  "to the", "with the", "you are", "you have")