Keras的分词器的单词索引和NN权重之间是否存在1:1的对应关系?

时间:2019-01-11 17:09:52

标签: keras

我试图自学喀拉拉邦,所以我写了一个简单的NN(实际上只是一个矩阵变换),它从一组整数编码的单词到一些与这些单词相关的文档的预训练嵌入。为了提出整数编码,我使用了Keras的分词器(“ texts_to_matrix”)。分词器将单词排列成字典(“ word_index”)。

训练神经网络后,我得到的权重的行数=我的词汇量。我的问题是,是否可以确保权重矩阵中的行严格按照令牌生成器的word_index中的索引排序?这意味着,例如,行= 3的权重对应于分词器词典中索引3的词。

这是我的玩具示例:

import pandas as pd
import numpy as np

# some random "documents" and embeddings
docs = ['machine learning','deep learning','artificial intelligence','supervised learning']
emds = np.random.rand(4,5).tolist()

df = pd.DataFrame({'searches':docs,'mean_embed':emds})

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

search_train = df['searches']
upc_embed_train = df['mean_embed']

# we got 6 different words                                                                                                           
vocab_size = 6
t = Tokenizer(num_words = vocab_size,char_level=False)
t.fit_on_texts(search_train)

print("word index  = ",t.word_index)

这给我们:

word index = {'learning':1,'machine':2,'deep':3,'artificial':4,4,'intelligence':5,'supervised':6}

# integer encode documents                                                                                                           
encoded_docs_train = t.texts_to_matrix(search_train, mode='count')                                                  
output_emb_train = np.array(upc_embed_train)
output_emb_train2 = np.matrix(upc_embed_train.tolist())

from keras.models import Model, Sequential
from keras.layers import Input, Dense, Flatten, Embedding, LSTM

model = Sequential()
# the 5 is to match the dimensions of the embedding vector 

model.add(Dense(5,input_dim=vocab_size, activation='linear',use_bias=True))
model.compile(loss='mse', optimizer='adam', metrics=['mse'])

model.fit(encoded_docs_train,output_emb_train2)

tmp1 = model.layers[0].get_weights()[0]
tmp2 = model.layers[0].get_weights()[1]

# now can we guarantee that weight 2 corresponds to 
# word_index with value =2 ("machine"), etc?                                                               
value_of_interest = 2
for word, counter in t.word_index.items():
   if counter == value_of_interest:
       theword = word
       # subtract 1 because the count starts with 0 in the np array
       weight_for_word  = tmp1[(value_of_interest-1),:]
       print("word = ",theword,", its weight ",weight_for_word)

它打印出来:

word =机器,其重量[-0.69765383 -0.63167644 0.62771523 -0.68510187 0.5576754]

我在Keras文档中找不到任何东西可以使我确信情况确实如此,因此,非常感谢您的帮助。

0 个答案:

没有答案
相关问题