TF-IDF稀疏矩阵,值错误“列索引超出矩阵尺寸”

时间:2020-03-13 15:32:10

标签: python-3.x

在自定义转换函数中,仅针对前n个特征计算TF-IDF。在函数返回中, 尝试返回形状为(语料库长度,前n个特征长度)的矩阵,该矩阵显示值 错误,因为“列索引超出矩阵尺寸”。如果我不限制则不给出错误。为什么 这正在发生。对此的任何帮助都会很棒!

from collections import Counter
from scipy.sparse import csr_matrix    
def transform(corpus, vocab_t2, idf):
rows = []
columns = []
values = []
temp = dict()
max_feature_count = 0
top_vocab = list(idf.keys())
if isinstance(corpus, (list,)):
for idx, row in enumerate(corpus):
totalwords_inrow = len(row.split())
word_freq = dict(Counter(row.split()))
if max_feature_count >= len(top_vocab):
break
for word, freq in word_freq.items(): 
if len(word) < 2:
continue
if word in top_vocab:
col_index = vocab_t2.get(word, -1)                    
temp[word] = (freq / totalwords_inrow) * (idf[word])
values.append(temp[word])
print("values:", values)
if col_index != -1:
rows.append(idx)
columns.append(col_index)                        
max_feature_count += 1                         
# x = len(corpus)
#y = len(top_vocab)
#z = len(columns)        

test = csr_matrix((values,(rows,columns)),shape =(len(corpus),len(top_vocab)))#这给了错误#when #限制最大词汇的长度 #test = csr_matrix((values,(rows,columns)),shape =(len(corpus),len(vocab_t2)))#这没有给#错误 #as“ len(vocab_t2)”是语料库总独特vocab长度 #test = csr_matrix((值,(行,列)),shape =(x,y)) 退货测试 其他: 打印(“您需要传递字符串列表”) print(transform(corpus,vocab_t2,idf))

0 个答案:

没有答案