在递归网络上使用预训练的句子嵌入

时间:2019-07-09 20:02:53

标签: python keras deep-learning nlp embedding

我想在循环网络上使用Universal Sentence Embedding

因此,使用RNN嵌入的传统单词会将每个单词编码为一个向量,而RNN的time_step将是句子中单词的数量。

我想做的是使用句子嵌入将每个句子编码为512维矢量。 RNN的time_step将是文本中的句子数,对于我来说就是IMDB审阅。

我正在IMDB二进制分类上尝试此操作。问题是,无论我如何调整超参数,模型都不会学习。训练和测试的准确性保持在50%,这意味着该模型只能预测2个类别中的1个。

我将不胜感激!

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
lstm_1 (LSTM)                (None, 128)               131584
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 258
=================================================================
Total params: 131,842
Trainable params: 131,842
Non-trainable params: 0
_________________________________________________________________
WARNING:tensorflow:From C:\Users\shaggyday\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\ops\math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
W0709 14:26:44.883890  9716 deprecation.py:323] From C:\Users\shaggyday\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\ops\math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Epoch 1/10
249/249 [==============================] - 55s 220ms/step - loss: 0.6937 - acc: 0.5004 - val_loss: 0.6931 - val_acc: 0.5061
Epoch 2/10
249/249 [==============================] - 68s 274ms/step - loss: 0.6970 - acc: 0.5002 - val_loss: 0.6942 - val_acc: 0.5009
Epoch 3/10
249/249 [==============================] - 71s 285ms/step - loss: 0.6947 - acc: 0.4961 - val_loss: 0.6980 - val_acc: 0.5009
Epoch 4/10
249/249 [==============================] - 70s 279ms/step - loss: 0.6938 - acc: 0.4998 - val_loss: 0.6956 - val_acc: 0.5033
Epoch 5/10
249/249 [==============================] - 66s 267ms/step - loss: 0.6936 - acc: 0.5018 - val_loss: 0.6939 - val_acc: 0.5046
Epoch 6/10
249/249 [==============================] - 63s 251ms/step - loss: 0.6931 - acc: 0.5003 - val_loss: 0.6933 - val_acc: 0.5058

用于预嵌入文本的代码是

file = 'train.csv'
df = pd.read_csv(file)
# df['sentiment'] = [1 if sentiment == 'positive' else 0 for sentiment in df['sentiment'].values]
x = df['review'].values
y = df['sentiment'].values
x_sent = []
for review in x:
    x_sent.append(sent_tokenize( review ) )


num_sample = len(x)
val_split = int(num_sample*0.5)
x_train, y_train = x_sent, y
x_test, y_test = x_sent[val_split:], y[val_split:]

module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/2" 
out_dir = 'use(dan)'
embed = hub.Module(module_url)

num_files = 10
n_file = num_sample // num_files

for n in range( num_files ):

    def batch_embed( batch, labels, lens, set_ ):
        """
        batch:   1-D array of sentences
        labels:  labels for each reviews
        lens:    offsets for the reviews
        set_:    'train' | 'test'
        """
        with tf.Session( config=config ) as session:
          session.run([tf.global_variables_initializer(), tf.tables_initializer()])

          print( 'Getting embeddings for the {} data'.format( set_ ) )
          path = os.path.join( out_dir, 'embed_{}_{}.bin'.format( set_ , n ) )
          if not os.path.exists( path ):
            embeddings = session.run( embed( batch ) )
            offset = 0
            review_embeddings = []
            for l in lens:
                review_embeddings.append( embeddings[ offset : offset + l ] )
                offset += l
            with open( path, 'wb' ) as f:
                pickle.dump( (review_embeddings, labels), f )

            for i, re in enumerate(embeddings):
                if re.shape[0]==0:
                    print( i, batch[i] )

    train_batch = x_train[ n * n_file : min( len( x_train ),  ( n + 1 ) * n_file )  ]
    labels = y_train[ n * n_file : min( len( x_train ),  ( n + 1 ) * n_file )  ]
    lens = [ len( x ) for x in train_batch ]
    sent_batch = [ sent for review in train_batch for sent in review ]
    print( len( sent_batch ) )
    batch_embed(sent_batch, labels, lens, 'train')

    test_batch = x_test[ n * n_file : min( len( x_test ),  ( n + 1 ) * n_file )  ]
    labels = y_test[ n * n_file : min( len( x_test ),  ( n + 1 ) * n_file )  ]
    lens = [ len( x ) for x in test_batch ]
    sent_batch = [ sent for review in test_batch for sent in review ]
    print( len( sent_batch ) )
    batch_embed(sent_batch, labels, lens, 'test')

该模型是一个非常简单的lstm,具有一层和256个神经元。因为每个IMDB评论的句子数都不相同,所以每一批都被填充

0 个答案:

没有答案