Question

我目前正在开展一个项目，在那里我从评论网站上删除了与零售商相关的评论。目的是使用随机森林分类器按主题对数据集中的每个评论进行分类; “交付”或“客户服务”。

在查看数据集后，超过90％的评论（培训和测试数据）与“交付”相关。我的讲师告诉我，我们需要考虑样本偏差。我已经研究过这个并尝试使用ADASYN在下面的Python中实现一些修正（在下面的代码底部附近）：

import pandas as pd

chunksize = 10
TextFileReader = pd.read_csv('TestToSentimentAnalyse.csv', chunksize=chunksize, header=None)
dataset = pd.concat(TextFileReader, ignore_index=False)
dataset.columns = ['Reviews', 'Delivery', 'Customer_Service', 'Purchase_Date', 'Likelihood_to_Recommend',
                   'Overall_Satisfaction', 'Location', 'Date_Published', 'Sentiment']
dataset = dataset.iloc[1:]

# Cleaning the texts
import re

corpus = []
for i in range(1, 29779):
    corpus.append(review)

# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()

# Set up dependant variable - delivery is 0, customer service is 1
y = []
for i in range(1, 29779):
    if dataset['Delivery'][i] == '2':
        y.append(1)
    elif dataset['Customer_Service'][i] == '2':
        y.append(0)
    elif dataset['Delivery'][i] == '0' and dataset['Customer_Service'][i] == '0':
        y.append(0.5)  ## flaw in this as we had to choose one
    elif dataset['Delivery'][i] == '1' and dataset['Customer_Service'][i] == '1':
        y.append(0.5)  ## flaw in this as we had to choose one
    elif dataset['Delivery'][i] == '0' and dataset['Customer_Service'][i] == '1':
        y.append(0)
    elif dataset['Delivery'][i] == '1' and dataset['Customer_Service'][i] == '0':
        y.append(1)
    elif dataset['Delivery'][i] == 2:
        y.append(1)
    elif dataset['Customer_Service'][i] == 2:
        y.append(0)
    elif dataset['Delivery'][i] == 0 and dataset['Customer_Service'][i] == 0:
        y.append(0.5)  ## flaw in this as we had to choose one
    elif dataset['Delivery'][i] == 1 and dataset['Customer_Service'][i] == 1:
        y.append(0.5)  ## flaw in this as we had to choose one
    elif dataset['Delivery'][i] == 0 and dataset['Customer_Service'][i] == 1:
        y.append(0)
    elif dataset['Delivery'][i] == 1 and dataset['Customer_Service'][i] == 0:
        y.append(1)
    else:
        y.append('Needs Review')

get_indexes = lambda y, xs: [i for (j, i) in zip(xs, range(len(xs))) if y == j]
del_idx = get_indexes('Needs Review', y)
del_idx.sort(reverse=True)

import numpy as np

for item in del_idx:
    y = np.delete(y, (item), axis=0)
    X = np.delete(X, (item), axis=0)

from imblearn.over_sampling import ADASYN

ada = ADASYN(random_state=42)
X_ada, y_ada = ada.fit_sample(X, y)

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_ada, y_ada, test_size=0.25, random_state=0)

from sklearn.ensemble import RandomForestClassifier

classifier_10E = RandomForestClassifier(n_estimators=10, criterion='entropy', random_state=0)
classifier_10E.fit(X_train, y_train)

y_pred_10E = classifier_10E.predict(X_test)

我在一夜之间运行代码（它在没有ADASYN功能的情况下快速运行）并且没有完成。我正在处理大约32,000条评论。运行此操作的目的是为我的示例（“客户服务”）中代表性不足的类创建虚拟条目，因此随机森林分类器受到更好的训练。目前，我可以盲目地预测我的测试数据中所有评论的“交付”，并在90％以上的时间内正确。

如果有人能指出我做错了什么，或者Python中有更好的选择，我将不胜感激。

由于

解决文本分类中的样本偏差 - Python - ADASYN

0 个答案: