Question

假设一个包含字符串的列表列表如下

docs = [["hello", "world", "hello"], ["goodbye", "cruel"]]

如何创建稀疏矩阵，其中每一行代表上面列表的子列表，每列代表子列表中的“残忍”等令牌字符串。

我查看了scipy docs here和其他一些stackoverflow帖子，但是，我不清楚这个。

row_idx = 0                                                                                                                                   
col_idx = 0                                                                                                                                   
rows = []                                                                                                                                     
cols = []                                                                                                                                     
vals = []                                                                                                                                     
for doc in tokens_list:                                                                                                                       
    col_idx = 0                                                                                                                               
    for token in doc:                                                                                                                         
        rows.append(row_idx)                                                                                                                  
        cols.append(col_idx)                                                                                                                  
        col_idx = col_idx + 1                                                                                                                 
        vals.append(1)                                                                                                                        
    row_idx = row_idx + 1                                                                                                                                                                                                                                                                                                                                                                 
X = csr_matrix((vals, (rows, cols)))

我尝试了类似上面的内容，但我感觉这是不对的，我无法与scipy文档中的示例相关。

Answer 1

我会创建一个字典而不是使用列表。然后，您可以将一个元组（row，col）作为您的键，值将是该行包含的任何内容，col索引。只通过在矩阵中添加非空，0等字典的元素，您就可以获得稀疏性。

您也可以将元组替换为列表。

Answer 2

csr文档上的示例直接生成csr属性，indptr，indices和data。 coo的输入为row，col和data。区别在于row和indptr;其他属性是相同的。

乍一看，你错过了vocabulary字典。将row与列表中的项索引匹配很容易。但是col必须以某种方式映射到单词列表或词典。

In [498]: docs = [["hello", "world", "hello"], ["goodbye", "cruel", "world"]]
In [499]: indptr = [0]
In [500]: indices = []
In [501]: data = []
In [502]: vocabulary = {}  # a dictionary
In [503]: for d in docs:
     ...: ...     for term in d:
     ...: ...         index = vocabulary.setdefault(term, len(vocabulary))
     ...: ...         indices.append(index)
     ...: ...         data.append(1)
     ...: ...     indptr.append(len(indices))
     ...:     
In [504]: indptr
Out[504]: [0, 3, 6]
In [505]: indices
Out[505]: [0, 1, 0, 2, 3, 1]
In [506]: data
Out[506]: [1, 1, 1, 1, 1, 1]
In [507]: vocabulary
Out[507]: {'cruel': 3, 'goodbye': 2, 'hello': 0, 'world': 1}
In [508]: M = sparse.csr_matrix((data, indices, indptr), dtype=int)
In [510]: M
Out[510]: 
<2x4 sparse matrix of type '<class 'numpy.int32'>'
    with 6 stored elements in Compressed Sparse Row format>
In [511]: M.A
Out[511]: 
array([[2, 1, 0, 0],
       [0, 1, 1, 1]])

coo输入看起来像：

In [515]: Mc = M.tocoo()
In [516]: Mc.row
Out[516]: array([0, 0, 0, 1, 1, 1], dtype=int32)
In [517]: Mc.col
Out[517]: array([0, 1, 0, 2, 3, 1], dtype=int32)

所以相同的迭代工作，除了我们在row列表中记录行号：

In [519]: row, col, data = [],[],[]
In [520]: vocabulary = {}
In [521]: for i,d in enumerate(docs):
     ...:     for term in d:
     ...:         index = vocabulary.setdefault(term, len(vocabulary))
     ...:         col.append(index)
     ...:         data.append(1)
     ...:         row.append(i)
     ...:         
In [522]: row
Out[522]: [0, 0, 0, 1, 1, 1]
In [523]: col
Out[523]: [0, 1, 0, 2, 3, 1]
In [524]: M1 = sparse.coo_matrix((data, (row, col)))
In [525]: M1
Out[525]: 
<2x4 sparse matrix of type '<class 'numpy.int32'>'
    with 6 stored elements in COOrdinate format>
In [526]: M1.A
Out[526]: 
array([[2, 1, 0, 0],
       [0, 1, 1, 1]])

'hello'在第一个列表中出现两次;所有其他单词出现一次，或没有。 vocabulary具有单词和列索引之间的映射。

替代方案将进行两次传球。第一个收集所有单词并标识唯一单词 - 即生成vocabulary或等效单词。然后第二个构建矩阵。

给定带字符串的列表列表，生成稀疏矩阵

2 个答案: