Question

我正在尝试构建一个逐点互信息矩阵。我有一个60k乘60k scipy矩阵的单词共现，我想把它转换成另一个稀疏矩阵，其中入口i，j对应log（p（i，j）/ p（i）* p（j）），对于单词i和j。我删除正值以获得PPMI矩阵。我正在寻找一种有效的方法来迭代第一个矩阵来生成第二个矩阵，而不需要占用太多内存。

我尝试使用第一个矩阵的副本并对其进行迭代，并逐行构建新的CSR矩阵，在2个稀疏矩阵上使用vstack添加每个新行。由于内存错误，这两个进程都被终止。构建此矩阵的最佳方法是什么，然后将其保存以便以后重用？

from scipy.sparse import vstack
from scipy import sparse
if(inplace):
    for i in range(ctxt_matrix.shape[0]): #row-wise operation    
        #for each row (word vector), reweigh this in 3 steps:
        # 1. get the probability of this context, instead of the raw count (divide by total words)
        # 2. divide this probability by the probability of this row/context occurring together randomly (multiply entry
        #        for word all the other words, do element wise division)
        # 3. take the log of this division, and reassign the row to this.
        row_pmi = np.log(np.divide((ctxt_matrix[i].toarray().T/total_words),(word_probas*word_probas[i]))).T
        if(cutoff_0):
            row_pmi[row_pmi<0] = 0 #0 cutoff
        ctxt_matrix[i, :] = row_pmi
    print('PMI matrix building took:', time.time()-start)
    return ctxt_matrix

else:
    #same as above, but on a new matrix, using vstack.
    pmi_matrix = scipy.sparse.csr_matrix((1, ctxt_matrix.shape[1]))
    for i in range(ctxt_matrix.shape[0]): #row-wise operation
        row_pmi = scipy.sparse.csr_matrix(np.log(np.divide( ((ctxt_matrix[i].toarray().T)/total_words) , word_probas*word_probas[i] )).T)
        if(cutoff_0):
            row_pmi[row_pmi<0] = 0 #0 cutoff            
        pmi_matrix = scipy.vstack((pmi_matrix, row_pmi))
        del row_pmi
    print('PMI matrix building took:', time.time()-start)
    return pmi_matrix

TL; DR - 我需要进行逐行操作，通过迭代另一个来创建稀疏矩阵。这里有一些简化的代码，用于了解我正在做的事情：

from scipy import sparse
import time
start = time.time()
ctxt_matrix = scipy.sparse.csr_matrix(scipy.sparse.rand(5000, 5000))
for i in range(ctxt_matrix.shape[0]):   
    row_pmi = np.log(ctxt_matrix[i,:].toarray().T/500) #some row-wise operation on the other matrix
    row_pmi[row_pmi<0] = 0 # don't store negatives in memory
    ctxt_matrix[i,:] = scipy.sparse.csr_matrix(row_pmi).T
    ctxt_matrix[i, :].eliminate_zeros()
print('PMI matrix building took:', time.time()-start)

Answer 1

我尝试了一些代码变体：

import numpy as np
from scipy.sparse import vstack
from scipy import sparse

n, m = 10, 50000
source = sparse.random(n,m, 0.2, format='csr')*5000
print(repr(source))

ctxt_matrix = source.copy()
for i in range(ctxt_matrix.shape[0]):
    print(ctxt_matrix[i,:].nnz, end=' ')
    row_pmi = np.log(ctxt_matrix[i,:].toarray().T/500) #some row-wise operation on the other matrix
    row_pmi[row_pmi<0] = 0 # don't store negatives in memory
    temp = sparse.csr_matrix(row_pmi).T
    print(temp.nnz)
    ctxt_matrix[i,:] = temp
ctxt_matrix.eliminate_zeros()
print(repr(ctxt_matrix))

print('\nrow lil')
ctxt_matrix = source.tolil()
for i in range(ctxt_matrix.shape[0]):
    print(ctxt_matrix[i,:].nnz, end=' ')
    row_pmi = np.log(ctxt_matrix[i,:].toarray().T/500) #some row-wise operation on the other matrix
    row_pmi[row_pmi<0] = 0 # don't store negatives in memory
    temp = sparse.lil_matrix(row_pmi).T
    print(temp.nnz)
    ctxt_matrix[i,:] = temp
print(repr(ctxt_matrix))

print('\nrow lil data')
ctxt_matrix = source.tolil()
for i in range(ctxt_matrix.shape[0]):
    data = np.array(ctxt_matrix.data[i])
    print(len(data))
    data = np.log(data/500) #some row-wise operation on the other matrix
    data[data<0] = 0 # don't store negatives in memory
    ctxt_matrix.data[i][:] = data
#print(repr(ctxt_matrix))
ctxt_matrix = ctxt_matrix.tocsr()
ctxt_matrix.eliminate_zeros()
print(repr(ctxt_matrix))

print('\nwhole csr data')
ctxt_matrix = source.copy()
data = ctxt_matrix.data
data = np.log(data/500)
data[data<0] = 0
ctxt_matrix.data[:] = data
ctxt_matrix.eliminate_zeros()
print(repr(ctxt_matrix))

结果

1407:~/mypy$ python3 stack47615473.py 
<10x50000 sparse matrix of type '<class 'numpy.float64'>'
    with 100000 stored elements in Compressed Sparse Row format>
stack47615473.py:12: RuntimeWarning: divide by zero encountered in log
  row_pmi = np.log(ctxt_matrix[i,:].toarray().T/500) #some row-wise operation on the other matrix
10069 9081
9931 8943
10159 9134
10069 9043
9940 8924
9961 9009
9941 8939
9935 8923
9943 8983
10052 9072
<10x50000 sparse matrix of type '<class 'numpy.float64'>'
    with 90051 stored elements in Compressed Sparse Row format>

row lil
stack47615473.py:24: RuntimeWarning: divide by zero encountered in log
  row_pmi = np.log(ctxt_matrix[i,:].toarray().T/500) #some row-wise operation on the other matrix
10069 9081
9931 8943
10159 9134
10069 9043
9940 8924
9961 9009
9941 8939
9935 8923
9943 8983
10052 9072
<10x50000 sparse matrix of type '<class 'numpy.float64'>'
    with 90051 stored elements in LInked List format>

row lil data
10069
9931
10159
10069
9940
9961
9941
9935
9943
10052
<10x50000 sparse matrix of type '<class 'numpy.float64'>'
    with 90051 stored elements in Compressed Sparse Row format>

whole csr data
<10x50000 sparse matrix of type '<class 'numpy.float64'>'
    with 90051 stored elements in Compressed Sparse Row format>

lil row次迭代比csr次慢。

lil和csr数据操作几乎是即时的。

还有一种方法可以直接迭代data格式的csr。这需要使用indptr属性中的值对其进行索引。在之前的SO问题中已经讨论过这个问题（可能会出现问题）。

csr行迭代有点慢，因为它每次都必须构造一个新的csr矩阵。 toarray步骤有点慢。如果您只能操作行或矩阵的非零data值，则速度会更快。

这不能解决高内存使用问题。我希望对矩阵的内部更改使用更少的内存，而重复的vstack使用很多。我想知道，矩阵是如此之大，只是构造它的副本会产生内存错误吗？

内存管理：通过有效地迭代其他稀疏矩阵来构建稀疏矩阵

1 个答案: