Question

我正在努力创建文本分类代码，但是在使用标记器对文档进行编码时遇到了问题。

1）我首先在文档中安装了一个分词器，如下所示：

vocabulary_size = 20000
tokenizer = Tokenizer(num_words= vocabulary_size, filters='')
tokenizer.fit_on_texts(df['data'])

2）然后，我想检查我的数据是否正确拟合，因此我按如下所示转换为序列：

sequences = tokenizer.texts_to_sequences(df['data'])
data = pad_sequences(sequences, maxlen= num_words) 
print(data)

这给了我很好的输出。即将单词编码成数字

[[ 9628  1743    29 ...   161    52   250]
 [14948     1    70 ...    31   108    78]
 [ 2207  1071   155 ... 37607 37608   215]
 ...
 [  145    74   947 ...     1    76    21]
 [   95 11045  1244 ...   693   693   144]
 [   11   133    61 ...    87    57    24]]

现在，我想使用相同的方法将文本转换为序列。像这样：

sequences = tokenizer.texts_to_sequences("physics is nice ")
text = pad_sequences(sequences, maxlen=num_words)
print(text)

它给了我奇怪的输出：

[[   0    0    0    0    0    0    0    0    0  394]
 [   0    0    0    0    0    0    0    0    0 3136]
 [   0    0    0    0    0    0    0    0    0 1383]
 [   0    0    0    0    0    0    0    0    0  507]
 [   0    0    0    0    0    0    0    0    0    1]
 [   0    0    0    0    0    0    0    0    0 1261]
 [   0    0    0    0    0    0    0    0    0    0]
 [   0    0    0    0    0    0    0    0    0 1114]
 [   0    0    0    0    0    0    0    0    0    1]
 [   0    0    0    0    0    0    0    0    0 1261]
 [   0    0    0    0    0    0    0    0    0  753]]

根据Keras文档（Keras）：

texts_to_sequences（文本）

参数：文本：转换为序列的文本列表。

返回：列表   序列（每个文本输入一个）。

是否不应该将每个单词编码为相应的数字？如果文本短于50到50，则填充文本？错误在哪里？

Answer 1

您应该尝试这样拨打电话：

sequences = tokenizer.texts_to_sequences(["physics is nice"])

Answer 2

错误是您填充序列的地方。 maxlen的值应为所需的最大令牌数，例如50.因此，将行更改为：

maxlen = 50
data = pad_sequences(sequences, maxlen=maxlen)
sequences = tokenizer.texts_to_sequences("physics is nice ")
text = pad_sequences(sequences, maxlen=maxlen)

这会将序列切割为50个令牌，并用零填充较短的令牌。请注意padding选项。默认值为pre，这意味着如果句子短于maxlen，则填充序列将从零开始填充。如果希望零在序列的末尾添加到pad_sequences选项padding='post'。

Answer 3

我想你应该这样打：

sequences = tokenizer.texts_to_sequences(["physics is nice "])

Answer 4

使用时，请按相同的长度填充序列，即您的情况下为num_words = vocabulary_size，这就是为什么要获得输出的原因，只需尝试：tokenizer.texts_to_sequences，这将为您提供单词序列。阅读有关填充的更多信息，它仅用于匹配数据的每一行，而胰岛则需要2句话。句子1和句子2 sentanec1的长度为5，而句子2的长度为8。现在，如果我们不将句子1填充3，则确实将数据发送给训练时，就无法执行批量Wiese训练。希望对您有帮助

Answer 5

您应该这样调用方法：

new_sample = ['A new sample to be classified']
seq = tokenizer.texts_to_sequences(new_sample )
padded = pad_sequences(seq, maxlen=MAX_SEQUENCE_LENGTH)
pred = model.predict(padded)

Answer 6

您可以像下面那样传递以获取输出。

twt = ['He is a lazy person.']
twt = tokenizer.texts_to_sequences(twt)
print (twt)

或

twt = tokenizer.texts_to_sequences(['He is a lazy person.'])
print (twt)

tokenizer.texts_to_sequences Keras令牌生成器几乎提供所有零

6 个答案: