如何调整随机森林代码以进行质量预测

时间:2019-05-10 05:01:51

标签: r machine-learning statistics random-forest

我是机器学习的新手。我有这个数据集-http://archive.ics.uci.edu/ml/datasets/Wine+Quality。我必须预测数据集的最后一列``葡萄酒质量'',我考虑为此应用神经网络或随机森林,因为NN的准确率约为55%,到目前为止,随机森林的成功率达到了73%。我想进一步提高准确性。下面是我编写的代码。

wineq <- read.csv("wine-quality.csv",header = TRUE)
str(wineq)

wineq$taste <- ifelse(wineq$quality < 6, 'bad', 'good')
wineq$taste[wineq$quality == 6] <- 'normal'
wineq$taste <- as.factor(wineq$taste)
set.seed(54321)
train <- sample(1:nrow(wineq), .75 * nrow(wineq))
wineq_train <- wineq[train, ]
wineq_test  <- wineq[-train, ]

library(randomForest)

rf=randomForest(taste~.- 
quality,data=wineq_train,importance=TRUE,ntree=100)

rf_preds = predict(rf,wineq_test)
rf_preds
table(rf_preds, wineq_test$taste)

输出:

  

表(rf_preds,wineq_test $ taste)

rf_preds bad good normal
bad    302   11     81
good     7  163     36
normal  93  101    431

如果我想使用tuneRF,则会出现以下错误:

   fgl.res <- tuneRF(x = wineq[train, ], y= wineq[-train, ], 
   stepFactor=1.5)
  

randomForest.default(x,y,mtry = mtryStart,ntree =错误   ntreeTry,
         :          响应长度必须与预测变量相同

1 个答案:

答案 0 :(得分:0)

您需要将tuneRF的特征变量和x的响应变量传递给y

因此,首先找到您的响应变量(taste)的列位置:

resp_pos <- which(colnames(wineq) == "taste")

然后:

fgl.res <- tuneRF(x = wineq[train, -resp_pos ], y= wineq[-train, resp_pos], 
   stepFactor=1.5)

我还注意到,您根据列wineq$taste <- ifelse(wineq$quality < 6, 'bad', 'good')使用taste查找“新”响应(quality)。请注意,这很好,但是在训练之前,您需要删除列quality

如果您不这样做,您的模型将过于乐观,因为它将选择以下示例:

quality < 6始终表示taste=="bad"