我正在创建一个电子邮件分类模型,在该模型中,我使用TfidfVectorizer处理电子邮件内容,并将tf-idf值用作模型的输入。
将模型另存为文件“ label_email_model.h5”后,我想加载文件并在新数据集中进行预测。我使用函数get_tfidf_vectorizer返回一个新的TfidfVectorizer对象,该对象的参数与用于训练数据集的参数相同。但是我收到错误“ NotFittedError:TfidfVectorizer-词汇不正确。”
from keras.models import load_model
model = load_model('label_email_model.h5')
def get_tfidf_vectorizer(max_features = 1000):
from nltk import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
return TfidfVectorizer(sublinear_tf=True, min_df= 5 , norm='l2', encoding='utf-8', ngram_range=(1, 2), stop_words='english', max_features = max_features)
tfidf = get_tfidf_vectorizer(2000)
result = model.predict(tfidf.transform(data["body"]))
问题是:
更新: 添加模型信息:
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_141 (Dense) (None, 128) 256128
_________________________________________________________________
activation_141 (Activation) (None, 128) 0
_________________________________________________________________
dropout_86 (Dropout) (None, 128) 0
_________________________________________________________________
dense_142 (Dense) (None, 256) 33024
_________________________________________________________________
activation_142 (Activation) (None, 256) 0
_________________________________________________________________
dropout_87 (Dropout) (None, 256) 0
_________________________________________________________________
dense_143 (Dense) (None, 7) 1799
_________________________________________________________________
activation_143 (Activation) (None, 7) 0
=================================================================
Total params: 290,951
Trainable params: 290,951
Non-trainable params: 0
_________________________________________________________________
model.input
Out[290]: <tf.Tensor 'dense_141_input_1:0' shape=(?, 2000) dtype=float32>
model.output
Out[291]: <tf.Tensor 'activation_143_1/Softmax:0' shape=(?, 7) dtype=float32>