Question

我在处理大型矩阵时遇到麻烦。故事是这样的：

我有一个大矩阵（行x col最多为2000万x 2000万）。
由于矩阵的行稀疏，所以我使用scipy稀疏csr矩阵来存储矩阵。
在我的主要算法中，有一部分我需要从该矩阵中随机抽取一组行（例如1000行）。

如果列数太大，我将无法一次提取所有行

# get a random batch of size b
index = random.sample(range(n), b)
X_batch = X[index]

我当前的解决方案是分批提取并基于此进行计算：

# get a random batch of size b
index = random.sample(range(n), b)

# calculate number of batches
total_mem_batch = 1e9 # a large number represent the total available memory

batch_size = int(total_mem_batch // nnzX) # nnz is average number of nonzero per row
num_batches = math.ceil(b / batch_size)
result = np.zeros(d)

for j in range(num_batches): 
    # calculate start/end indices for each batch
    startIdx = batch_size*j
    endIdx = np.minimum(batch_size*(j+1), b)

    batch_X = X[index[startIdx:endIdx],:]
    batch_Y = Y[index[startIdx:endIdx]]
    batch_bias = bias[index[startIdx:endIdx]]

    # do main operation
    result += ...

现在瓶颈位于矩阵的一组行的检索中。由于索引数组是随机排列的，因此可以认为它是对输入矩阵X的行的随机访问。因此，它比顺序读取要慢得多。

我的问题是：有没有一种方法可以通过以下两种方法来改善这一点

不时地对输入矩阵进行一次随机排序（行的顺序并不重要，因此可以将其随机排序），以便我们稍后可以依次读取该元素，或者
有没有一种更快的方法来随机访问大型矩阵的行？

感谢您阅读我的帖子。

最好

随机洗净稀疏矩阵

0 个答案: