Question

我有一个相当大的稀疏矩阵，我估计，当加载到内存中时会占用1Gb。

我不需要一直访问整个矩阵，因此某种内存映射可以工作;但是，使用numpy或者辣（我熟悉的工具）来记忆映射稀疏矩阵似乎是不可能的。

它可以很容易地融入内存，但如果我每次运行程序时都必须加载它，那将是一件痛苦的事。也许某种方法可以在运行之间将其保存在内存中？

那么，你有什么建议： 1.找到一种记忆映射稀疏矩阵的方法; 2.每次只需将整个思想加载到内存中 3.？

Answer 1

以下可能是一般概念，但您必须弄清楚很多细节......您应该首先熟悉CSR format，其中存储数组的所有信息3个数组，两个长度为非零项的数量，一个长度为行数加一：

>>> import scipy.sparse as sps
>>> a = sps.rand(10, 10, density=0.05, format='csr')
>>> a.toarray()
array([[ 0.        ,  0.46531486,  0.03849468,  0.51743202,  0.        ],
       [ 0.        ,  0.67028033,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.9967058 ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ]])
>>> a.data
array([ 0.46531486,  0.03849468,  0.51743202,  0.67028033,  0.9967058 ])
>>> a.indices
array([1, 2, 3, 1, 4])
>>> a.indptr
array([0, 3, 4, 4, 5, 5])

因此a.data具有非零条目，按行主要顺序，a.indices具有非零条目的相应列索引，a.indptr具有进入的非零条目其他两个数组，其中每行的数据开始，例如a.indptr[3] = 4和a.indptr[3+1] = 5，因此第四行中的非零条目为a.data[4:5]，其列索引为a.indices[4:5]。

因此，您可以将这三个数组存储在磁盘中，并将其作为memmaps进行访问，然后您可以按如下方式检索行m到n：

ip = indptr[m:n+1].copy()
d = data[ip[0]:ip[-1]]
i = indices[ip[0]:ip[-1]]
ip -= ip[0]
rows = sps.csr_matrix((d, i, ip))

作为概念的一般证明：

>>> c = sps.rand(1000, 10, density=0.5, format='csr')
>>> ip = c.indptr[20:25+1].copy()
>>> d = c.data[ip[0]:ip[-1]]
>>> i = c.indices[ip[0]:ip[-1]]
>>> ip -= ip[0]
>>> rows = sps.csr_matrix((d, i, ip))
>>> rows.toarray()
array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.55683501,
         0.61426248,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.67789204,  0.        ,  0.71821363,
         0.01409666,  0.        ,  0.        ,  0.58965142,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.1575835 ,  0.08172986,
         0.41741147,  0.72044269,  0.        ,  0.72148343,  0.        ],
       [ 0.        ,  0.73040998,  0.81507086,  0.13405909,  0.        ,
         0.        ,  0.82930945,  0.71799358,  0.8813616 ,  0.51874795],
       [ 0.43353831,  0.00658204,  0.        ,  0.        ,  0.        ,
         0.10863725,  0.        ,  0.        ,  0.        ,  0.57231074]])
>>> c[20:25].toarray()
array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.55683501,
         0.61426248,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.67789204,  0.        ,  0.71821363,
         0.01409666,  0.        ,  0.        ,  0.58965142,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.1575835 ,  0.08172986,
         0.41741147,  0.72044269,  0.        ,  0.72148343,  0.        ],
       [ 0.        ,  0.73040998,  0.81507086,  0.13405909,  0.        ,
         0.        ,  0.82930945,  0.71799358,  0.8813616 ,  0.51874795],
       [ 0.43353831,  0.00658204,  0.        ,  0.        ,  0.        ,
         0.10863725,  0.        ,  0.        ,  0.        ,  0.57231074]])

Answer 2

Scipy支持different kinds of sparse matrices。但是你必须编写一个例程来将其读入内存。你应该使用哪种类型取决于你想用它做什么。

如果矩阵非常稀疏，可以使用struct模块将(row, column, value)元组作为二进制数据保存到磁盘。假设可移植性不是问题，这将使磁盘上的数据更小并且更容易加载。

然后你可以读取这样的数据：

import struct
from functools import partial

fmt = 'IId'
size = struct.calcsize(fmt)

with open('sparse.dat', 'rb') as infile:
    f = partial(infile.read, size)
    for chunk in iter(f, ''):
        row, col, value = struct.unpack(fmt, chunk)
        # put it in your matrix here

存储和检索大型稀疏矩阵

2 个答案: