Question

我不希望长度小于3或大于7的条款。在R中这是一种直接的方式，但在Python中我不确定。我试过这个，但仍然没有工作

from sklearn.feature_extraction.text import CountVectorizer
regex1 = '/^[a-zA-Z]{3,7}$/'
vectorizer = CountVectorizer( analyzer='word',tokenizer= tokenize,stop_words = stopwords,token_pattern  = regex1,min_df= 2, max_df = 0.9,max_features = 2000)
vectorizer1 = vectorizer.fit_transform(token_dict.values())

尝试了其他正则表达式 -

  "^[a-zA-Z]{3,7}$"
r'^[a-zA-Z]{3,7}$'

Answer 1

在CountVectorizer的文档中，默认token_pattern提供了2个或更多字母数字字符的标记。如果你想改变它，请传递你自己的正则表达式

在您的情况下，将token_pattern = "^[a-zA-Z]{3,7}$"添加到CountVectorizer

的选项中

修改

应该使用的正则表达式是[a-zA-Z]{3,7}。见下面的例子 -

doc1 = ["Elon Musk is genius", "Are you mad", "Constitutional Ammendments in Indian Parliament",\ "Constitutional Ammendments in Indian Assembly", "House of Cards", "Indian House"] from sklearn.feature_extraction.text import CountVectorizer regex1 = '[a-zA-Z]{3,7}' vectorizer = CountVectorizer(analyzer='word', stop_words = 'english', token_pattern = regex1) vectorizer1 = vectorizer.fit_transform(doc1) vectorizer.vocabulary_

结果 -

{u'ammendm': 0, u'assembl': 1, u'cards': 2, u'constit': 3, u'elon': 4, u'ent': 5, u'ents': 6, u'genius': 7, u'house': 8, u'indian': 9, u'mad': 10, u'musk': 11, u'parliam': 12, u'utional': 13}

Answer 2

我认为你的正则表达式模式在这里是错误的。它的Javscript。它应该像

regex1 = r'^[a-zA-Z]{3,7}$'

另外我假设正则表达式应匹配整个字符串 NOT 某些子字符串。因此，如果字符串像aaaaabbb cc那样应该被丢弃。

如果不是，您应该使用字边界\b而不是开始^和结束$锚点。所以它应该是

regex1 = r'\b[a-zA-Z]{3,7}\b'

这是一个工作示例

from sklearn.feature_extraction.text import CountVectorizer
regex1 = r'\b[a-zA-Z]{3,7}\b'
token_dict = {123: 'horses', 345: 'ab'}
vectorizer = CountVectorizer(token_pattern  = regex1)
vectorizer1 = vectorizer.fit_transform(token_dict.values())

print(vectorizer.get_feature_names())

<强>输出

['horses']

如何在使用CountVectorizer时限制令牌长度？

2 个答案: