CountVectorizer忽略了我'

时间:2015-10-21 13:22:39

标签: python scikit-learn

为什么Sklearn中的CountVectorizer忽略代词"我"?

ngram_vectorizer = CountVectorizer(analyzer = "word", ngram_range = (2,2), min_df = 1)
ngram_vectorizer.fit_transform(['HE GAVE IT TO I'])
<1x3 sparse matrix of type '<class 'numpy.int64'>'
ngram_vectorizer.get_feature_names()
['gave it', 'he gave', 'it to']

1 个答案:

答案 0 :(得分:11)

默认标记生成器仅考虑2个字符(或更多)字。

您可以通过将适当的token_pattern传递到CountVectorizer

来更改此行为

默认模式是(参见the signature in the docs):

'token_pattern': u'(?u)\\b\\w\\w+\\b'

您可以通过更改默认值来获取不会删除单字母单词的CountVectorizer,例如:

from sklearn.feature_extraction.text import CountVectorizer
ngram_vectorizer = CountVectorizer(analyzer="word", ngram_range=(2,2), 
                                   token_pattern=u"(?u)\\b\\w+\\b",min_df=1)
ngram_vectorizer.fit_transform(['HE GAVE IT TO I'])
print(ngram_vectorizer.get_feature_names())

给出了:

['gave it', 'he gave', 'it to', 'to i']