Question

This question解释了如何将自己的单词添加到CountVectorizer的内置英语单词中。我有兴趣看到分类器上消除任何数字作为标记的影响。

ENGLISH_STOP_WORDS被存储为一个冻结集，所以我想我的问题归结为（除非有一个我不知道的方法）如果可以将任意数量的represnetation添加到冻结列表中？

我对这个问题的感觉是，这是不可能的，因为你必须通过的列表的有限性排除了这一点。

我认为实现相同目标的一种方法是循环测试语料库和弹出单词，其中word.isdigit()为真，然后我可以与ENGLISH_STOP_WORDS（{{{{}}联合3}}），但我宁愿懒惰并将更简单的东西传递给stop_words参数。

Answer 1

您可以将其实现为preprocessor的自定义CountVectorizer，而不是扩展禁用词列表。以下是bpython中显示的简单版本。

>>> import re
>>> cv = CountVectorizer(preprocessor=lambda x: re.sub(r'(\d[\d\.])+', 'NUM', x.lower()))
>>> cv.fit(['This is sentence.', 'This is a second sentence.', '12 dogs eat candy', '1 2 3 45'])
CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1),
        preprocessor=<function <lambda> at 0x109bbcb18>, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
>>> cv.vocabulary_
{u'sentence': 6, u'this': 7, u'is': 4, u'candy': 1, u'dogs': 2, u'second': 5, u'NUM': 0, u'eat': 3}

预编译正则表达式可能会为大量样本提供一些加速。

Answer 2

import re
from sklearn.feature_extraction.text import CountVectorizer

list_of_texts = ['This is sentence.', 'This is a second sentence.', '12 dogs eat candy', '1 2 3 45']

def no_number_preprocessor(tokens):
    r = re.sub('(\d)+', 'NUM', tokens.lower())
    # This alternative just removes numbers:
    # r = re.sub('(\d)+', '', tokens.lower())
    return r

for t in list_of_texts:
    no_num_t = no_number_preprocessor(t)
    print(no_num_t)

cv = CountVectorizer(input='content', preprocessor=no_number_preprocessor)
dtm = cv.fit_transform(list_of_texts)
cv_vocab = cv.get_feature_names()

print(cv_vocab)

前前后后

this is sentence.

this is a second sentence.

NUM dogs eat candy

NUM NUM NUM NUM

['NUM', 'candy', 'dogs', 'eat', 'is', 'second', 'sentence', 'this']

将数字添加到stop_words以scikit-learn的CountVectorizer

2 个答案: