为什么决策树为完全相同的训练数据返回不同的解决方案

时间:2018-01-14 17:35:16

标签: python scikit-learn decision-tree

我正在尝试一个ML示例并且它在大多数情况下都有效但是当我连续运行代码时,python开始吐出不同的预测结果,现在我现在是ML专家,但这似乎很难?

# Example file from Google Developers: "Hello World - Machine Learning Recipes": YouTube: https://youtu.be/cKxRvEZd3Mw
# Category: Supervised Learning                                                                               
# January 14, 2018                                                                                            
from sklearn import tree                                                                                      

# Declarations: Texture                                                                                        
bumpy = 0                                                                                                      
smooth = 1                                                                                                     

# Declarations: Labels                                                                                         
apple = 0                                                                                                      
orange = 1                                                                                                                                                                 

# Step(1): Collect training data                                                                               
# Features: [Weight, Texture]                                                                                  
features = [[140, smooth], [130, smooth], [150, bumpy], [170, bumpy]]                                          

# labels will be used as the index for the features                                                            
labels = [apple, apple, orange, orange]                                                                        

# Step(2): Train Classifier: Decision Tree                                                                     
# Use the decision tree object and then fit 'find' paterns in features and labels                              
clf = tree.DecisionTreeClassifier()                                                                            
clf = clf.fit(features, labels)                                                                                

# Step(3): Make Predictions                                                                                    
# the prdict method will return the best fit from the decesion tree                                            
result = clf.predict([[150, bumpy], [130, smooth], [125.5, bumpy], [110, smooth]])                             
# result = clf.predict([[150, bumpy]])                                                                         
print("Step(3): Make Predictions: ")                                                                           
for x in result:                                                                                               
    if x == 0:
    print("Apple")                                                                                        
        continue                                                                                              
    elif x == 1:                                                                                              
        print("Orange")                                                                                       
        continue                                                                                              
    print("Orange")                                                                                        

Click link to see vim and bash windows

1 个答案:

答案 0 :(得分:6)

对于(大多数?)决策树算法来说,有一个随机元素,你的训练集非常小,可能会夸大效果。随机性通常用于确定使用多少/哪些样本,在您的情况下,样本非常少。

创建DecisionTreeClassifier时,请尝试将random_state设置为某个固定的整数。如果您想要一个可重复的测试结果,您需要使用相同的种子"每次都有价值。他们在示例文档中使用零随机种子:

clf = DecisionTreeClassifier(random_state=0)