Question

我有一个存储为scipy.sparse.csc_matrix的大矩阵，并希望从大矩阵中的每一列中减去一个列向量。当你正在进行规范化/标准化这样的事情时，这是一个非常常见的任务，但我似乎找不到有效地做到这一点的正确方法。

以下是一个示例：

# mat is a 3x3 matrix
mat = scipy.sparse.csc_matrix([[1, 2, 3],
                               [2, 3, 4],
                               [3, 4, 5]])

#vec is a 3x1 matrix (or a column vector)
vec = scipy.sparse.csc_matrix([1,2,3]).T

""" 
I want to subtract `vec` from each of the columns in `mat` yielding...
    [[0, 1, 2],
     [0, 1, 2],
     [0, 1, 2]]
"""

实现我想要的一种方法是将vec自身隐藏3次，产生3x3矩阵，其中每列为vec，然后从mat中减去该列。但同样，我正在寻找一种有效地做到这一点的方法，并且hstacked矩阵需要很长时间才能创建。我确信有一些神奇的方法可以用切片和广播来做到这一点，但它让我望而却步。

谢谢！

编辑：删除了“就地”约束，因为稀疏性结构将在就地分配方案中不断变化。

Answer 1

首先，我们将如何处理密集阵列？

mat-vec.A # taking advantage of broadcasting
mat-vec.A[:,[0]*3] # explicit broadcasting
mat-vec[:,[0,0,0]] # that also works with csr matrix

在https://codereview.stackexchange.com/questions/32664/numpy-scipy-optimization/33566 我们发现在as_strided向量上使用mat.indptr是跨越稀疏矩阵行的最有效方法。（x.rows的{{1}}，x.cols几乎一样好。lil_matrix很慢。这个函数实现了迭代等。

getrow

我使用def sum(X,v): rows, cols = X.shape row_start_stop = as_strided(X.indptr, shape=(rows, 2), strides=2*X.indptr.strides) for row, (start, stop) in enumerate(row_start_stop): data = X.data[start:stop] data -= v[row] sum(mat, vec.A) print mat.A来简化。如果我们保持vec.A稀疏，我们必须在vec添加非零值的测试。此类迭代也只修改row的非零元素。 mat不变。

我怀疑时间优势将在很大程度上取决于矩阵和向量的稀疏性。如果0's有很多零，那么迭代，仅修改vec非{0}的mat行是有意义的。但vec就像这个例子一样密集，可能很难击败vec。

Answer 2

摘要

简而言之，如果你使用CSR而不是CSC，那就是一个单行：

mat.data -= numpy.repeat(vec.toarray()[0], numpy.diff(mat.indptr))

解释

如果你意识到这一点，最好以行方式完成，因为我们将从每一行中扣除相同的数字。在您的示例中，然后：从第一行扣除1，从第二行扣除2，从第三行扣除3。

我实际上在实际应用程序中遇到过这种情况，我希望对文档进行分类，每个文档在矩阵中表示为一行，而列则表示单词。每个文档的分数应该乘以该文档中每个单词的分数。使用稀疏矩阵的行表示，我做了类似的事情（我修改了我的代码来回答你的问题）：

mat = scipy.sparse.csc_matrix([[1, 2, 3],
                               [2, 3, 4],
                               [3, 4, 5]])

#vec is a 3x1 matrix (or a column vector)
vec = scipy.sparse.csc_matrix([1,2,3]).T

# Use the row version
mat_row = mat.tocsr()
vec_row = vec.T

# mat_row.data contains the values in a 1d array, one-by-one from top left to bottom right in row-wise traversal.
# mat_row.indptr (an n+1 element array) contains the pointer to each first row in the data, and also to the end of the mat_row.data array
# By taking the difference, we basically repeat each element in the row vector to match the number of non-zero elements in each row
mat_row.data -= numpy.repeat(vec_row.toarray()[0],numpy.diff(mat_row.indptr))
print mat_row.todense()

结果是：

[[0 1 2]
 [0 1 2]
 [0 1 2]]

可视化是这样的：

>>> mat_row.data
[1 2 3 2 3 4 3 4 5]
>>> mat_row.indptr
[0 3 6 9]
>>> numpy.diff(mat_row.indptr)
[3 3 3]
>>> numpy.repeat(vec_row.toarray()[0],numpy.diff(mat_row.indptr))
[1 1 1 2 2 2 3 3 3]
>>> mat_row.data -= numpy.repeat(vec_row.toarray()[0],numpy.diff(mat_row.indptr))
[0 1 2 0 1 2 0 1 2]
>>> mat_row.todense()
[[0 1 2]
 [0 1 2]
 [0 1 2]]

Answer 3

您可以通过更改矢量的strides来引入假尺寸。无需额外分配，您可以使用np.lib.stride_tricks.as_strided将矢量“转换”为3 x 3矩阵。这个page有一个例子和一些关于它的讨论以及对相关主题（如视图）的一些讨论。在页面中搜索“示例：带有步幅的假维度”。

关于这个问题也有不少例子......但我的搜索能力现在让我失望了。

从矩阵中有效地减去向量（Scipy）

3 个答案:

摘要

解释