如何在ngrams中禁止标点符号和空格?

时间:2016-04-18 09:48:44

标签: r whitespace n-gram punctuation

我有一个像这样的角色矢量:

sent <- c("The quick brown fox jumps over the lazy dog.",
          "Over the lazy dog jumped the quick brown fox.",
          "The quick brown fox jumps over the lazy dog.")

我使用textcnt()生成双字母组合如下:

txt <- textcnt(sent, method = "string", split = " ", n=2, tolower = FALSE)

format(txt)给了我所有的双字母

              frq rank  bytes Encoding
Over the      1   4.5   8     unknown
The quick     2   11.5  9     unknown
brown fox     2   11.5  9     unknown
brown fox.    1   4.5   10    unknown
dog jumped    1   4.5   10    unknown
dog. Over     1   4.5   9     unknown
fox jumps     2   11.5  9     unknown
fox. The      1   4.5   8     unknown
jumped the    1   4.5   10    unknown
jumps over    2   11.5  10    unknown
lazy dog      1   4.5   8     unknown
lazy dog.     2   11.5  9     unknown
over the      2   11.5  8     unknown
quick brown   3   15.5  11    unknown
the lazy      3   15.5  8     unknown
the quick     1   4.5   9     unknown  

真实数据有更多句子。我有两个问题:
1.是否有可能提到每个句子末尾的点应该在结果的ngrams中被截断? 2.是否有可能阻止产生跨越两个句子的ngrams? dog. Overfox. The

1 个答案:

答案 0 :(得分:1)

您可以通过避免 texcnt 来避免 textcnt 中的特定ngram。 :-)为了充实@ lukeA的评论,这里是完整的 quanteda 解决方案。

require(quanteda)
packageVersion("quanteda")
## [1] ‘0.9.5.19’

这将标记化为双字母组合,并同时删除标点符号。因为每个句子都是&#34;文件&#34;,所以bigrams永远不会跨越文档。

(bigramToks <- tokenize(sent, ngrams = 2, removePunct = TRUE, concatenator = " "))
tokenizedText object from 3 documents.
## Component 1 :
## [1] "The quick"   "quick brown" "brown fox"   "fox jumps"   "jumps over"  "over the"    "the lazy"    "lazy dog"   
## 
## Component 2 :
## [1] "Over the"    "the lazy"    "lazy dog"    "dog jumped"  "jumped the"  "the quick"   "quick brown" "brown fox"  
## 
## Component 3 :
## [1] "The quick"   "quick brown" "brown fox"   "fox jumps"   "jumps over"  "over the"    "the lazy"    "lazy dog"   

要获得这些频率,您应该使用dfm()构建文档特征矩阵,将bigrams标记制成表格。 (注意:您可以跳过标记化步骤并使用dfm(sent, ngrams = 2, toLower = FALSE, concatenator = " ")直接完成此操作。)

(bigramDfm <- dfm(bigramToks, toLower = FALSE, verbose = FALSE))
## Document-feature matrix of: 3 documents, 12 features.
## 3 x 12 sparse Matrix of class "dfmSparse"
##        features
## docs    The quick quick brown brown fox fox jumps jumps over over the the lazy lazy dog Over the dog jumped
##   text1         1           1         1         1          1        1        1        1        0          0
##   text2         0           1         1         0          0        0        1        1        1          1
##   text3         1           1         1         1          1        1        1        1        0          0
## features
## docs    jumped the the quick
##   text1          0         0
##   text2          1         1
##   text3          0         0

topfeatures(bigramDfm, n = nfeature(bigramDfm))
## quick brown   brown fox    the lazy    lazy dog   The quick   fox jumps  jumps over    over the    Over the 
##           3           3           3           3           2           2           2           2           1 
##  dog jumped  jumped the   the quick 
##           1           1           1 
相关问题