插入符号列的数据顺序会影响结果

时间:2016-04-25 17:52:32

标签: r r-caret

似乎使用相同数据但列顺序不同会改变结果。

最小,可重复的例子:

n

结果:

library(mlbench)
data(Sonar)
library(caret)
set.seed(998)
inTraining <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training <- Sonar[ inTraining,]
testing  <- Sonar[-inTraining,]
fitControl <- trainControl(## 10-fold CV
                           method = "repeatedcv",
                           number = 10,
                           ## repeated ten times
                           repeats = 10)

set.seed(825)
gbmFit1 <- train(Class ~ ., data = training,
                 method = "gbm",
                 trControl = fitControl,
                 ## This last option is actually one
                 ## for gbm() that passes through
                 verbose = FALSE)
gbmFit1

然后我尝试了:

Stochastic Gradient Boosting 

157 samples
 60 predictor
  2 classes: 'M', 'R' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 142, 142, 140, 142, 142, 141, ... 
Resampling results across tuning parameters:

  interaction.depth  n.trees  Accuracy   Kappa    
  1                   50      0.7609191  0.5163703
  1                  100      0.7934216  0.5817734
  1                  150      0.7977230  0.5897796
  2                   50      0.7858235  0.5667749
  2                  100      **0.8188897**  **0.6316548**
  2                  150      **0.8194363**  **0.6329037**
  3                   50      **0.7889436**  **0.5713790**
  3                  100      0.8130564  0.6195719
  3                  150      0.8221348  0.6383441

Tuning parameter 'shrinkage' was held constant at a value of 0.1

Tuning parameter 'n.minobsinnode' was held constant at a value of 10
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were n.trees = 150, interaction.depth =
 3, shrinkage = 0.1 and n.minobsinnode = 10. 

结果:

finalVars <- colnames(training)
# reorder columns
finalVars <- finalVars[order(finalVars)]

set.seed(825)
gbmFit1 <- train(Class ~ ., data = training[, finalVars],
                 method = "gbm",
                 trControl = fitControl,
                 ## This last option is actually one
                 ## for gbm() that passes through
                 verbose = FALSE)
gbmFit1

我们可以从粗体数字中看出,使用不同的列顺序会产生不同的结果。

Stochastic Gradient Boosting 

157 samples
 60 predictor
  2 classes: 'M', 'R' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 142, 142, 140, 142, 142, 141, ... 
Resampling results across tuning parameters:

  interaction.depth  n.trees  Accuracy   Kappa    
  1                   50      0.7609191  0.5163703
  1                  100      0.7934216  0.5817734
  1                  150      0.7977230  0.5897796
  2                   50      0.7858235  0.5669550
  2                  100      **0.8194779**  **0.6331626**
  2                  150      **0.8207279**  **0.6354601**
  3                   50      **0.7946936**  **0.5831441**
  3                  100      0.8130564  0.6195719
  3                  150      0.8220931  0.6381234

Tuning parameter 'shrinkage' was held constant at a value of 0.1

Tuning parameter 'n.minobsinnode' was held constant at a value of 10
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were n.trees = 150, interaction.depth =
 3, shrinkage = 0.1 and n.minobsinnode = 10. 

此问题适用于我检查的其他几个型号:rpart,C5.0。有谁知道为什么会这样?

1 个答案:

答案 0 :(得分:0)

你不是要发现不同的结果,而是使用&#34; gbm&#34;算法本身。在&#34; gbm&#34;中,重新排序列与更改种子非常相似。