应用词袋

时间:2018-07-07 07:11:25

标签: python machine-learning nlp word2vec

嘿,我正在处理大量的单词,并且正在尝试实现,所以假设我下面有语料库,但是我不想使用print( vectorizer.fit_transform(corpus).todense() )作为词汇,而是创建了一个类似 {u'all': 0, u'sunshine': 1, u'some': 2, u'down': 3, u'reason': 4} 如何使用该词汇表生成矩阵?

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
'All my cats in a row',
'When my cat sits down, she looks like a Furby toy!',
'The cat from outer space',
'Sunshine loves to sit like this for some reason.'
]

vectorizer = CountVectorizer()
print( vectorizer.fit_transform(corpus).todense() )
print( vectorizer.vocabulary_ )

1 个答案:

答案 0 :(得分:1)

使用自定义词汇表实例化CountVectorizer,然后转换您的语料库。

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
'All my cats in a row',
'When my cat sits down, she looks like a Furby toy!',
'The cat from outer space',
'Sunshine loves to sit like this for some reason.'
]

vocabulary = {u'all': 0, u'sunshine': 1, u'some': 2, u'down': 3, u'reason': 4}

vectorizer = CountVectorizer(vocabulary=vocabulary)

print( vectorizer.transform(corpus).todense() )
[[1 0 0 0 0]
 [0 0 0 1 0]
 [0 0 0 0 0]
 [0 1 1 0 1]]

print( vectorizer.vocabulary_ )
{'all': 0, 'sunshine': 1, 'some': 2, 'down': 3, 'reason': 4}