Question

如何使用stanford corenlp获得计数最高的K ngrams？我知道我可以使用HashMap或Trai编写这个代码，但是我的语料库非常大（每篇文件大小为30KB，每篇文章200K），我想要5格，所以内存需求量很大。因此，我想知道我是否可以为此目的使用corenlp。所以给定一个语料库它应该只返回这种格式的前K个ngram：

word1 word2 word3 word4 word5：frequency

我不想要任何概率模型。

Answer 1

CoreNLP没有任何东西可以帮助您有效地存储ngrams。所有它可以帮助你在这里将标记文本（并可能将文本分割成句子，如果你关心它）。

如果您的语料库足够大，您不能只使用哈希表来保持n-gram计数，那么您将不得不使用另一种更节省空间的表示（例如前缀trie）。

例如，我刚在Clojure做了一个快速测试，在那里我计算了Gutenberg King James V圣经中的5克。使用hashmap存储752K的计数，不同的5-gram使用了248 MB的堆。使用前缀trie来存储使用的计数57 MB - 减少77％。

供参考，这是使用前缀尝试的完整Clojure程序：

(ns nlp.core
  (:require [clojure.string :as string]))

(defn tokenize
  "Very simplistic tokenizer."
  [text]
  (string/split text #"[\s\:_\-\.\!\,\;]+"))

(defn get-bible-kjv-tokens []
  (tokenize (slurp "/Users/wiseman/nltk_data/corpora/gutenberg/bible-kjv.txt")))

(defn ngrams [n tokens]
  (partition n 1 tokens))

(defn build-ngram-trie [n tokens]
  (->> tokens
       (ngrams n)
       (reduce (fn [trie ngram]
                 (update-in trie ngram #(if % (inc %) 1)))
               {})))

(defn enumerate-trie [trie]
  (if (not (map? trie))
    (list (list trie))
    (apply concat
           (for [[k v] trie]
             (map #(cons k %)
                  (enumerate-trie v))))))

(defn print-trie [trie]
  (doseq [path (enumerate-trie trie)]
    (println (string/join " " (butlast path)) ":" (last path))))


(defn -main []
  (let [ngram-counts (->> (get-bible-kjv-tokens)
                          (build-ngram-trie 5))]
    (print-trie ngram-counts)))

King James V Bible的输出：

$ lein run -m nlp.core | sort -r -k7,7 -n ngrams.txt  | head
And it came to pass : 383
the house of the LORD : 233
the word of the LORD : 219
of the children of Israel : 162
it came to pass when : 142
the tabernacle of the congregation : 131
saith the LORD of hosts : 123
it shall come to pass : 119
And the LORD said unto : 114
And the LORD spake unto : 107

关于获得更高效率的一些指示，以下论文讨论了大型语料库的高效n-gram存储：

ADtrees for Sequential Data and N-gram Counting - 使用自定义数据结构。

Faster and Smaller N-Gram Language Models - “我们最紧凑的表示法可以存储所有40亿n-gram和Google n-gram语料库的相关计数，每n-gram 23位，是迄今为止最紧凑的无损表示”

Stanford corenlp：计数最高的K ngrams

1 个答案: