我一直在考虑单词序列的0填充,然后将0填充转换为Embedding层。乍一看,人们会认为您也希望保持嵌入= 0.0。但是,Embedding
中的keras
层会为任何输入令牌生成随机值,并且无法强制其生成0.0。请注意,mask_zero
的功能有所不同,我已经检查过了。
一个人可能会问,为什么要为此担心,即使嵌入不是0.0,只要它们相同,代码似乎也可以正常工作。所以我想出了一个例子,尽管有些人为的,但将0填充令牌的嵌入设置为0.0会有所不同。
我使用了20个新闻组数据集from sklearn.datasets import fetch_20newsgroups
。我做了一些最少的预处理:删除标点符号,停用词和数字。我使用from keras.preprocessing.sequence import pad_sequences
进行0填充。我将约18K帖子分为培训和验证集,其中培训/验证的比例= 4/1。
我创建了一个简单的1密集隐藏层网络,其输入是嵌入的平坦序列:
EMBEDDING_DIM = 300
MAX_SEQUENCE_LENGTH = 1100
layer_size = 25
dropout = 0.3
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32', name='dnn_input')
embedding_layer = Embedding(len(word_index) + 1, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH, name = 'embedding_dnn')
embedded_sequences = embedding_layer(sequence_input)
x = Flatten(name='flatten_dnn')(embedded_sequences)
x = Dense(layer_size, activation='relu', name ='hidden_dense_dnn')(x)
x = Dropout(dropout, name='dropout')(x)
preds = Dense(num_labels, activation='softmax', name = 'output_dnn')(x)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
该模型具有约14M的可训练参数(如我已经提到的,此示例有些人为设计)。 培训时
earlystop = EarlyStopping(monitor='val_loss', patience=5)
history = model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=30, batch_size=BATCH_SIZE, callbacks=[earlystop])
在4个时期内,该算法一直在努力摆脱“随机性”:
Train on 15048 samples, validate on 3798 samples
Epoch 1/30
15048/15048 [==============================] - 58s 4ms/step - loss: 3.1118 - acc: 0.0519 - val_loss: 2.9894 - val_acc: 0.0534
Epoch 2/30
15048/15048 [==============================] - 56s 4ms/step - loss: 2.9820 - acc: 0.0556 - val_loss: 2.9827 - val_acc: 0.0527
Epoch 3/30
15048/15048 [==============================] - 55s 4ms/step - loss: 2.9712 - acc: 0.0626 - val_loss: 2.9718 - val_acc: 0.0579
Epoch 4/30
15048/15048 [==============================] - 55s 4ms/step - loss: 2.9259 - acc: 0.0756 - val_loss: 2.8363 - val_acc: 0.0874
Epoch 5/30
15048/15048 [==============================] - 56s 4ms/step - loss: 2.7092 - acc: 0.1390 - val_loss: 2.3251 - val_acc: 0.2796
...
Epoch 13/30
15048/15048 [==============================] - 56s 4ms/step - loss: 0.0698 - acc: 0.9807 - val_loss: 0.5010 - val_acc: 0.8736
它最终的精度为〜0.87
print ('Best validation accuracy is ', max(history.history['val_acc']))
Best validation accuracy is 0.874934175379845
但是,当我将填充0的嵌入明确设置为0.0时
def myMask(x):
mask= K.greater(x,0) #will return boolean values
mask= K.cast(mask, dtype=K.floatx())
return mask
layer_size = 25
dropout = 0.3
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32', name='dnn_input')
embedding_layer = Embedding(len(word_index) + 1, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH, name = 'embedding_dnn')
embedded_sequences = embedding_layer(sequence_input)
y = Lambda(myMask, output_shape=(MAX_SEQUENCE_LENGTH,))(sequence_input)
y = Reshape(target_shape=(MAX_SEQUENCE_LENGTH,1))(y)
merge_layer = Multiply(name = 'masked_embedding_dnn')([embedded_sequences,y])
x = Flatten(name='flatten_dnn')(merge_layer)
x = Dense(layer_size, activation='relu', name ='hidden_dense_dnn')(x)
x = Dropout(dropout, name='dropout')(x)
preds = Dense(num_labels, activation='softmax', name = 'output_dnn')(x)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
参数数量相同的模型会立即摆脱“随机性”:
Train on 15048 samples, validate on 3798 samples
Epoch 1/30
15048/15048 [==============================] - 64s 4ms/step - loss: 2.4356 - acc: 0.3060 - val_loss: 1.2424 - val_acc: 0.7754
Epoch 2/30
15048/15048 [==============================] - 61s 4ms/step - loss: 0.6973 - acc: 0.8267 - val_loss: 0.5240 - val_acc: 0.8797
...
Epoch 10/30
15048/15048 [==============================] - 61s 4ms/step - loss: 0.0496 - acc: 0.9881 - val_loss: 0.4176 - val_acc: 0.8944
并以〜0.9更好的精度结束。
同样,这是一个有些人为的示例,但是仍然表明,将那些“填充的”嵌入保持为0.0可能是有益的。
我在这里错过了什么吗?如果我什么都没错过,那么Keras没有立即提供此功能的原因是什么?
更新
@DanielMöller我尝试了您的建议:
layer_size = 25
dropout = 0.3
init = RandomUniform(minval=0.0001, maxval=0.05, seed=None)
constr = NonNeg()
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32', name='dnn_input')
embedding_layer = Embedding(len(word_index) + 1,
EMBEDDING_DIM,
input_length=MAX_SEQUENCE_LENGTH,
name = 'embedding_dnn',
embeddings_initializer=init,
embeddings_constraint=constr)
embedded_sequences = embedding_layer(sequence_input)
y = Lambda(myMask, output_shape=(MAX_SEQUENCE_LENGTH,))(sequence_input)
y = Reshape(target_shape=(MAX_SEQUENCE_LENGTH,1))(y)
merge_layer = Multiply(name = 'masked_embedding_dnn')([embedded_sequences,y])
x = Flatten(name='flatten_dnn')(merge_layer)
x = Dense(layer_size, activation='relu', name ='hidden_dense_dnn')(x)
x = Dropout(dropout, name='dropout')(x)
preds = Dense(num_labels, activation='softmax', name = 'output_dnn')(x)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
不幸的是,网络陷入了“随机性”:
Train on 15197 samples, validate on 3649 samples
Epoch 1/30
15197/15197 [==============================] - 60s 4ms/step - loss: 3.1354 - acc: 0.0505 - val_loss: 2.9943 - val_acc: 0.0496
....
Epoch 24/30
15197/15197 [==============================] - 60s 4ms/step - loss: 2.9905 - acc: 0.0538 - val_loss: 2.9907 - val_acc: 0.0496
我也尝试了没有NonNeg()
约束的情况,结果相同。
答案 0 :(得分:1)
好吧,您消除了与填充步骤有关的权重梯度的计算。
如果填充步骤太多,则有关填充值的嵌入权重将参与大量计算,并将与其他权重明显竞争。但是训练这些权重是浪费计算,并且换句话说肯定会干扰。
还要考虑,例如,某些填充权重的值可能在有意义的单词的值之间。因此,增加权重可能会使它与另一个单词相似。并且也在减少。...
这些额外的计算,对损耗和梯度计算的额外贡献等将产生更多的计算需求和更多的障碍。就像在数据中间有很多垃圾一样。
还请注意,这些零直接进入密集层,这也将消除许多密集权重的梯度。尽管与较短的序列相比很少,但这可能会适合较长的序列。
出于好奇,如果这样做,会发生什么?
from keras.initializers import RandomUniform
from keras.constraints import NonNeg
init = RandomUniform(minval=0.0001, maxval=0.05, seed=None)
constr = NonNeg()
......
embedding_layer = Embedding(len(word_index) + 1,
EMBEDDING_DIM,
input_length=MAX_SEQUENCE_LENGTH,
name = 'embedding_dnn',
embeddings_initializer=init,
embeddings_constraint=constr)
..........