Question

我需要使用来自Google Books N-grams的词汇数据来构造术语共现的（稀疏！）矩阵（其中行是单词和列是相同的单词，单元格反映它们出现的次数在相同的上下文窗口中）。然后，生成的tcm将用于测量一堆词汇统计数据，并作为向量语义方法（Glove，LSA，LDA）的输入。

作为参考，Google Books（v2）数据集的格式如下（以制表符分隔）

ngram      year    match_count    volume_count
some word  1999    32             12            # example bigram

然而，问题当然是，这些数据是超级的。虽然，我只需要几十年的数据子集（大约20年左右的ngrams），我很满意最多2的上下文窗口（即使用三元语料库）。我有一些想法，但似乎没有特别，好，好。

-Idea 1 - 最初或多或少是这样的：

# preprocessing (pseudo)
for file in trigram-files:
    download $file
    filter $lines where 'year' tag matches one of years of interest
    find the frequency of each of those ngrams (match_count)
    cat those $lines * $match_count >> file2
     # (write the same line x times according to the match_count tag)  
    remove $file

# tcm construction (using R)
grams <- # read lines from file2 into list
library(text2vec)
# treat lines (ngrams) as documents to avoid unrelated ngram overlap
it         <- itoken(grams)
vocab      <- create_vocabulary(it)
vectorizer <- vocab_vectorizer(vocab, skip_grams_window = 2)
tcm        <- create_tcm(it, vectorizer) # nice and sparse

但是，我有预感这可能不是最好的解决方案。 ngram数据文件已经包含n-gram形式的共现数据，并且有一个标记给出了频率。我觉得应该有更直接的方式。

-Idea 2 - 我还在考虑将每个已过滤的ngram只进入新文件一次（而不是将其复制match_count次），然后创建一个空的tcm然后循环遍历整个（年度过滤的）ngram数据集和记录实例（使用match_count标记），其中任何两个单词共同出现以填充tcm。但是，再次，数据很大，这种循环可能需要很长时间。

-Idea 3 - 我发现了一个名为google-ngram-downloader的Python库，显然有一个共生矩阵创建函数，但是看一下代码，它会创建一个常规的（不稀疏））矩阵（如果大多数条目是0，那将是巨大的），并且（如果我做对了）它只是loops through everything（我假设在这么多数据上的Python循环将是超低的），所以它似乎更多地针对相当小的数据子集。

编辑 -Idea 4 - 遇到this old SO question询问有关使用Hadoop和Hive执行类似任务的问题，简短回答链接断开，关于MapReduce的评论（我都不熟悉，所以我不知道从哪里开始）。

但是我认为，鉴于Ngram数据集的受欢迎程度以及（非）的受欢迎程度，我无法成为第一个需要解决此类任务的人-word2vec）分布式语义方法，对tcm或dtm输入进行操作;因此 - ＆gt;

...问题：从Google Books Ngram数据构建术语共现矩阵的更合理/有效的方法是什么？（可以是提议的想法的变体）完全不同的东西; R首选，但不是必要的）

Answer 1

我会告诉你如何做到这一点。但它可以在几个地方得到改善。我特意写了一个“spagetti-style”以获得更好的可解释性，但它可以推广到超过三克

ngram_dt = data.table(ngram = c("as we know", "i know you"), match_count = c(32, 54))
# here we split tri-grams to obtain words
tokens_matrix = strsplit(ngram_dt$ngram, " ", fixed = T) %>% simplify2array()

# vocab here is vocabulary from chunk, but you can be interested first 
# to create vocabulary from whole corpus of ngrams and filter non 
# interesting/rare words

vocab = unique(tokens_matrix)
# convert char matrix to integer matrix for faster downstream calculations 
tokens_matrix_int = match(tokens_matrix, vocab)
dim(tokens_matrix_int) = dim(tokens_matrix)

ngram_dt[, token_1 := tokens_matrix_int[1, ]]
ngram_dt[, token_2 := tokens_matrix_int[2, ]]
ngram_dt[, token_3 := tokens_matrix_int[3, ]]

dt_12 = ngram_dt[, .(cnt = sum(match_count)), keyby = .(token_1, token_2)]
dt_23 = ngram_dt[, .(cnt = sum(match_count)), keyby = .(token_2, token_3)]
# note here 0.5 - discount for more distant word - we follow text2vec discount of 1 / distance
dt_13 = ngram_dt[, .(cnt = 0.5 * sum(match_count)), keyby = .(token_1, token_3)]

dt = rbindlist(list(dt_12, dt_13, dt_23))
# "reduce" by word indices again - sum pair co-occurences which were in different tri-grams
dt = dt[, .(cnt = sum(cnt)), keyby = .(token_1, token_2)]

tcm = Matrix::sparseMatrix(i = dt$token_1, j = dt$token_2, x = dt$cnt, dims = rep(length(vocab), 2), index1 = T, 
                   giveCsparse = F, check = F, dimnames = list(vocab, vocab))

从Google Ngrams

1 个答案: