使用TF-IDF的电影收视率预测

时间:2019-01-26 11:40:46

标签: scikit-learn tf-idf python-textprocessing

我有一个格式为-

的数据集
  

电影名称,TomatoCritics,T​​arget_Variable

在这里,TomatoCritics属性具有来自不同用户的不同电影的自由文本。 Target_Variable是一个二进制值(0或1),指示是否应观看这部电影。

我正在使用TF-IDF进行处理,我的代码如下-

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer


# Read textual training data-
text_training = pd.read_csv("Textual-Training_Data.csv")

# Read textual testing data-
text_testing = pd.read_csv("Textual-Testing_Data.csv")

# Get dimensions of training data-
text_training.shape
# (95, 3)

# Get dimensions of testing data-
text_testing.shape
# (224, 3)


# Check for missing values in training data-
text_training.isnull().values.any()
# True

# Check for missing values in testing data-
text_testing.isnull().values.any()
# True

# Remove any row having missing value from training data-
text_training_nona = text_training.dropna(axis = 0, how='any')

# Remove any row having missing value from testing data-
text_testing_nona = text_testing.dropna(axis = 0, how = 'any')

# Get dimensions of training data AFTER removing empty rows-
text_training_nona.shape
# (73, 3)

# Get dimensions of testing data AFTER removing empty rows-
text_testing_nona.shape
# (158, 3)


# Attributes to use for training and testing sets for ML-
cols_train = ['tomatoConsensus', 'goodforairplanes']
cols_test = ['tomatoConsensus', 'goodforairplanes']



# Split training dataset into features (X) and label (y) for training-
X_train = text_training_nona['tomatoConsensus']
y_train = text_training_nona['goodforairplanes']


# Split training dataset into features (X) and label (y) for testing-
X_test = text_testing_nona["tomatoConsensus"]
y_test = text_testing_nona['goodforairplanes']




# Initialize Count Vectorizer using TF-IDF ->
cv = TfidfVectorizer(min_df = 1, stop_words='english')

# Convert text to TF-IDF ->
X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.fit_transform(X_test)

# Multinomial Naive Bayes classifier-
mnb = MultinomialNB()

# Train model on training data-
mnb.fit(X_train_cv, y_train)

print(X_test_cv[0])
'''
(0, 1168)   0.20066499253877468
  (0, 31)   0.2419027475877309
  (0, 1090) 0.22790133982975397
  (0, 5)    0.2616366234663056
  (0, 877)  0.2616366234663056
  (0, 1279) 0.2419027475877309
  (0, 850)  0.1786670002268731
  (0, 1341) 0.2616366234663056
  (0, 2)    0.2616366234663056
  (0, 695)  0.2616366234663056
  (0, 1221) 0.2419027475877309
  (0, 884)  0.1786670002268731
  (0, 1070) 0.2616366234663056
  (0, 782)  0.2616366234663056
  (0, 252)  0.20066499253877468
  (0, 1259) 0.2419027475877309
  (0, 1093) 0.20816746395117927
  (0, 122)  0.2170410042381541
'''

y_pred = mnb.predict(X_test_cv[0])

使用mnb.predict()的最后一行给出了错误-

  

ValueError:尺寸不匹配

怎么了?

谢谢!

1 个答案:

答案 0 :(得分:1)

您应该fit_transform一次,然后使用已存在的cv和受过训练的cv对象进行转换。更改

X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.fit_transform(X_test)

X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.transform(X_test)

-,这应该可以解决您的问题。

如果再次使用其他数据调用fit_transofrm,则它可能包含其他数量的唯一单词,并且将产生另一个大小的词汇表-然后,mnb的维度将与其他数据一起训练,并且 other 词汇量会有所不同-这就是 ValueError:尺寸不匹配的原因。

修改
只需检查两种情况下的X_test_cvX_train_cv-如果您为fit_transformX_trainX_test,它会给出不同的形状,但是如果您替换了第二个fit_transform fot转换-它们将是相同的。

相关问题