计算稀疏矩阵上的Jaccard距离

时间:2015-09-27 08:14:39

标签: python numpy scipy sparse-matrix

我有一个大的稀疏矩阵 - 使用scipy的sparse.csr_matrix。值是二进制的。对于每一行,我需要计算相同矩阵中每行的Jaccard距离。最有效的方法是什么?即使对于10.000 x 10.000矩阵,我的运行时间也需要几分钟才能完成。

目前的解决方案:

def jaccard(a, b):
    intersection = float(len(set(a) & set(b)))
    union = float(len(set(a) | set(b)))
    return 1.0 - (intersection/union)

def regions(csr, p, epsilon):
    neighbors = []
    for index in range(len(csr.indptr)-1):
        if jaccard(p, csr.indices[csr.indptr[index]:csr.indptr[index+1]]) <= epsilon:
            neighbors.append(index)
    return neighbors
csr = scipy.sparse.csr_matrix("file")
regions(csr, 0.51) #this is called for every row

3 个答案:

答案 0 :(得分:11)

如果使用矩阵乘法计算集合交叉点,然后使用规则|union(a, b)| == |a| + |b| - |intersection(a, b)|来确定联合,则矢量化相对容易:

# Not actually necessary for sparse matrices, but it is for 
# dense matrices and ndarrays, if X.dtype is integer.
from __future__ import division

def pairwise_jaccard(X):
    """Computes the Jaccard distance between the rows of `X`.
    """
    X = X.astype(bool).astype(int)

    intrsct = X.dot(X.T)
    row_sums = intrsct.diagonal()
    unions = row_sums[:,None] + row_sums - intrsct
    dist = 1.0 - intrsct / unions
    return dist

注意强制转换为bool然后是int,因为X的dtype必须足够大才能累积最大行总和的两倍,并且X的条目必须为0或1。这段代码的缺点是RAM很重,因为unionsdists是密集矩阵。

如果您只对距离小于某些截止epsilon的距离感兴趣,则可以针对稀疏矩阵调整代码:

from scipy.sparse import csr_matrix

def pairwise_jaccard_sparse(csr, epsilon):
    """Computes the Jaccard distance between the rows of `csr`,
    smaller than the cut-off distance `epsilon`.
    """
    assert(0 < epsilon < 1)
    csr = csr_matrix(csr).astype(bool).astype(int)

    csr_rownnz = csr.getnnz(axis=1)
    intrsct = csr.dot(csr.T)

    nnz_i = np.repeat(csr_rownnz, intrsct.getnnz(axis=1))
    unions = nnz_i + csr_rownnz[intrsct.indices] - intrsct.data
    dists = 1.0 - intrsct.data / unions

    mask = (dists > 0) & (dists <= epsilon)
    data = dists[mask]
    indices = intrsct.indices[mask]

    rownnz = np.add.reduceat(mask, intrsct.indptr[:-1])
    indptr = np.r_[0, np.cumsum(rownnz)]

    out = csr_matrix((data, indices, indptr), intrsct.shape)
    return out

如果这仍然占用大量内存,你可以尝试在一个维度上进行向量化,而在另一个维度上进行Python循环。

答案 1 :(得分:1)

要添加到已接受的答案中:我曾使用上述方法的加权版本,该方法的实现简单为:

def pairwise_jaccard_sparse_weighted(csr, epsilon, weight):
    csr = scipy.sparse.csr_matrix(csr).astype(bool).astype(int)
    csr_w = csr * scipy.sparse.diags(weight)

    csr_rowsum = numpy.array(csr_w.sum(axis = 1)).flatten()
    intrsct = csr.dot(csr_w.T)

    rowsum_i = numpy.repeat(csr_rowsum, intrsct.getnnz(axis = 1))
    unions = rowsum_i + csr_rowsum[intrsct.indices] - intrsct.data
    dists = 1.0 - 1.0 * intrsct.data / unions

    mask = (dists > 0) & (dists <= epsilon)
    data = dists[mask]
    indices = intrsct.indices[mask]

    rownnz = numpy.add.reduceat(mask, intrsct.indptr[:-1])
    indptr = numpy.r_[0, numpy.cumsum(rownnz)]

    out = scipy.sparse.csr_matrix((data, indices, indptr), intrsct.shape)
    return out

我怀疑这是最有效的实现,但是比scipy.spatial.distance.jaccard中密集的实现快得多了。

答案 2 :(得分:0)

这里有一个类似scikit-learn的API的解决方案。

def pairwise_sparse_jaccard_distance(X, Y=None):
    """
    Computes the Jaccard distance between two sparse matrices or between all pairs in
    one sparse matrix.

    Args:
        X (scipy.sparse.csr_matrix): A sparse matrix.
        Y (scipy.sparse.csr_matrix, optional): A sparse matrix.

    Returns:
        numpy.ndarray: A similarity matrix.
    """

    if Y is None:
        Y = X

    assert X.shape[1] == Y.shape[1]

    X = X.astype(bool).astype(int)
    Y = Y.astype(bool).astype(int)

    intersect = X.dot(Y.T)

    x_sum = X.sum(axis=1).A1
    y_sum = Y.sum(axis=1).A1
    xx, yy = np.meshgrid(x_sum, y_sum)
    union = ((xx + yy).T - intersect)

    return (1 - intersect / union).A

这里有一些测试和基准测试:

>>> import timeit

>>> import numpy as np
>>> from scipy.sparse import csr_matrix
>>> from sklearn.metrics import pairwise_distances

>>> X = csr_matrix(np.random.choice(a=[False, True], size=(10000, 1000), p=[0.9, 0.1]))
>>> Y = csr_matrix(np.random.choice(a=[False, True], size=(1000, 1000), p=[0.9, 0.1]))

断言所有结果大致相等

>>> custom_jaccard_distance = pairwise_sparse_jaccard_distance(X, Y)
>>> sklearn_jaccard_distance = pairwise_distances(X.todense(), Y.todense(), "jaccard")

>>> np.allclose(custom_jaccard_distance, sklearn_jaccard_distance)
True

基准化运行时(来自Jupyer Notebook)

>>> %timeit pairwise_jaccard_index(X, Y)
795 ms ± 58.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit 1 - pairwise_distances(X.todense(), Y.todense(), "jaccard")
14.7 s ± 694 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)