Question

拥有这样的语料库：

'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?'

我正在使用此词汇表[“ this”，“ document”，“ this document”]。在向量化器适合之后，我得到了以下结果：

[[1 1 0]
[1 2 1]
[1 0 0]
[1 1 0]]

这是正确的。有没有一种方法可以使用正则表达式（或其他方法）来在语料库的第一行使用“此文档”功能？更确切地说，这个[1 1 1]比[1 1 0]？

我的行是这样：[“这是第一个文档”]。我可以以某种方式“删除”单词“是第一个”（或其他任何单词）以获得“此文档”功能吗？也许使用token_pattern？

Answer 1

弄清楚。我实际上想要做的是基于我的语料库（单字和双字）上的所有单词组合创建特征。例如，我的行：这是第一个文档。提取的功能：

this, 
is, 
the, 
first, 
document, 
this is, 
this the, 
this document, 
is the, 
is first, 
is document, 
the first, 
the document, 
first document

我是通过编写自己的令牌生成器并在CountVectorizer（）的tokenizer参数上使用它来实现的。