随机森林分类器的准确性不超过50%

时间:2018-11-12 19:08:43

标签: python machine-learning scikit-learn random-forest

我对机器学习非常陌生,我正在尝试使用sklearn的随机森林分类器对此UCI Heart Disease Dataset进行分类。我的方法非常基础,我想问一下如何通过算法(某些技巧,链接等)来提高准确性。每次我的准确性最高可达50%。这是我的代码:

import pandas as pd
import numpy as np
import random as random
import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

df = pd.read_excel('/Users/Mady/Documents/ClevelandData.xlsx')
df.replace('?', -99999, inplace=True)

labels = df.iloc[:,-1]
labels = labels.values

df.drop(df.columns[len(df.columns)-1], axis=1, inplace=True)
riskFactors = df.values

random.seed(123)
random.shuffle(labels)
random.seed(123)
random.shuffle(riskFactors)

labels_train = labels[:(int(len(labels) * 0.8))]
labels_test = labels[(int(len(labels) * 0.8)):]

riskFactors_train = riskFactors[:(int(len(riskFactors) * 0.8))]
riskFactors_test = riskFactors[(int(len(riskFactors) * 0.8)):]

model = RandomForestClassifier(n_estimators = 1000)
model.fit(riskFactors_train,labels_train)
predicted_labels = model.predict(riskFactors_test)
acc = accuracy_score(labels_test,predicted_labels)
print(acc)

1 个答案:

答案 0 :(得分:0)

通过删除随机部分来解决此问题,因为那里肯定有一些错误。 正如张玉林建议的那样,我使用了sklearn提供的train_test_split