Question

我正试图在我的语料库中获得最频繁的单词，以及未来的双字母组，双字母组等。我找到this quiestion但它对我不起作用，我想避免使用zip，因为我想以更有效的方式使用它。

到目前为止，我有这段代码：

vectorizer_words = CountVectorizer(input=u'content',
                         analyzer=u'word',
                         lowercase=True,
                         stop_words=cached_stopwords,
                         strip_accents=u'unicode',
                         ngram_range=(1, 1), binary=False)

vectors = vectorizer_words.fit_transform(X, y)


N, V = vectors.shape

count_words = np.array(np.sum(vectors, axis=0))
count_words = np.squeeze(count_words)

assert count_words.shape == (V,), "count_words.shape = {}".format(count_words.shape)
words = np.array(vectorizer_words.get_feature_names())
assert words.shape[0] == V

a = count_words.argsort()[::-1]

print(words[a][:10])
print(count_words[a][:10])

plt.bar(words[a][:10], count_words[a][:10])
plt.title('title')
plt.show()

我期待我的图表下降，但事实并非如此，我无法理解为什么。我做错了什么（什么？）或者我误解了输出？

修改问题似乎出现在plt.bar中：更仔细地查看以下行的输出：

print(words[a][:10])
print(count_words[a][:10])
# Output:
['atencion' 'bien' 'mas' 'banco' 'buena' 'siempre' 'problemas' 'problema' 'tarjeta' 'rapido']
[10442  7594  6322  6121  5382  4953  4316  4202  4041  3097]

所以count_words [a]按预期排序，但是图表按字母顺序排列（如评论中所示，谢谢！），所以问题可能在于情节

使用sklearn

0 个答案: