我目前正在开展一个项目,在那里我从评论网站上删除了与零售商相关的评论。目的是使用随机森林分类器按主题对数据集中的每个评论进行分类; “交付”或“客户服务”。
在查看数据集后,超过90%的评论(培训和测试数据)与“交付”相关。我的讲师告诉我,我们需要考虑样本偏差。我已经研究过这个并尝试使用ADASYN在下面的Python中实现一些修正(在下面的代码底部附近):
import pandas as pd
chunksize = 10
TextFileReader = pd.read_csv('TestToSentimentAnalyse.csv', chunksize=chunksize, header=None)
dataset = pd.concat(TextFileReader, ignore_index=False)
dataset.columns = ['Reviews', 'Delivery', 'Customer_Service', 'Purchase_Date', 'Likelihood_to_Recommend',
'Overall_Satisfaction', 'Location', 'Date_Published', 'Sentiment']
dataset = dataset.iloc[1:]
# Cleaning the texts
import re
corpus = []
for i in range(1, 29779):
corpus.append(review)
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
# Set up dependant variable - delivery is 0, customer service is 1
y = []
for i in range(1, 29779):
if dataset['Delivery'][i] == '2':
y.append(1)
elif dataset['Customer_Service'][i] == '2':
y.append(0)
elif dataset['Delivery'][i] == '0' and dataset['Customer_Service'][i] == '0':
y.append(0.5) ## flaw in this as we had to choose one
elif dataset['Delivery'][i] == '1' and dataset['Customer_Service'][i] == '1':
y.append(0.5) ## flaw in this as we had to choose one
elif dataset['Delivery'][i] == '0' and dataset['Customer_Service'][i] == '1':
y.append(0)
elif dataset['Delivery'][i] == '1' and dataset['Customer_Service'][i] == '0':
y.append(1)
elif dataset['Delivery'][i] == 2:
y.append(1)
elif dataset['Customer_Service'][i] == 2:
y.append(0)
elif dataset['Delivery'][i] == 0 and dataset['Customer_Service'][i] == 0:
y.append(0.5) ## flaw in this as we had to choose one
elif dataset['Delivery'][i] == 1 and dataset['Customer_Service'][i] == 1:
y.append(0.5) ## flaw in this as we had to choose one
elif dataset['Delivery'][i] == 0 and dataset['Customer_Service'][i] == 1:
y.append(0)
elif dataset['Delivery'][i] == 1 and dataset['Customer_Service'][i] == 0:
y.append(1)
else:
y.append('Needs Review')
get_indexes = lambda y, xs: [i for (j, i) in zip(xs, range(len(xs))) if y == j]
del_idx = get_indexes('Needs Review', y)
del_idx.sort(reverse=True)
import numpy as np
for item in del_idx:
y = np.delete(y, (item), axis=0)
X = np.delete(X, (item), axis=0)
from imblearn.over_sampling import ADASYN
ada = ADASYN(random_state=42)
X_ada, y_ada = ada.fit_sample(X, y)
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_ada, y_ada, test_size=0.25, random_state=0)
from sklearn.ensemble import RandomForestClassifier
classifier_10E = RandomForestClassifier(n_estimators=10, criterion='entropy', random_state=0)
classifier_10E.fit(X_train, y_train)
y_pred_10E = classifier_10E.predict(X_test)
我在一夜之间运行代码(它在没有ADASYN功能的情况下快速运行)并且没有完成。我正在处理大约32,000条评论。运行此操作的目的是为我的示例(“客户服务”)中代表性不足的类创建虚拟条目,因此随机森林分类器受到更好的训练。目前,我可以盲目地预测我的测试数据中所有评论的“交付”,并在90%以上的时间内正确。
如果有人能指出我做错了什么,或者Python中有更好的选择,我将不胜感激。
由于