Question

致力于Kaggle Titanic数据集。我试图更好地理解决策树，我已经使用线性回归一点点但从不决定树。我正在尝试在python中为我的树创建一个可视化。有些东西不起作用。检查下面的代码。

import pandas as pd
from sklearn import tree
from sklearn.datasets import load_iris
import numpy as np


train_file='.......\RUN.csv'
train=pd.read_csv(train_file)

#impute number values and missing values
train["Sex"][train["Sex"] == "male"] = 0
train["Sex"][train["Sex"] == "female"] = 1
train["Embarked"] = train["Embarked"].fillna("S")
train["Embarked"][train["Embarked"] == "S"]= 0
train["Embarked"][train["Embarked"] == "C"]= 1
train["Embarked"][train["Embarked"] == "Q"]= 2
train["Age"] = train["Age"].fillna(train["Age"].median())
train["Pclass"] = train["Pclass"].fillna(train["Pclass"].median())
train["Fare"] = train["Fare"].fillna(train["Fare"].median())

target = train["Survived"].values
features_one = train[["Pclass", "Sex", "Age", "Fare","SibSp","Parch","Embarked"]].values


# Fit your first decision tree: my_tree_one
my_tree_one = tree.DecisionTreeClassifier(max_depth = 10, min_samples_split = 5, random_state = 1)

iris=load_iris()

my_tree_one = my_tree_one.fit(features_one, target)

tree.export_graphviz(my_tree_one, out_file='tree.dot')

我如何实际看到决策树？试图想象它。

帮助表示赞赏！

Answer 1

您检查过：http://scikit-learn.org/stable/modules/tree.html提到如何将树绘制为PNG图像：

 from IPython.display import Image 
 import pydotplus
 dot_data = tree.export_graphviz(my_tree_one, out_file='tree.dot')  
 graph = pydotplus.graph_from_dot_data(dot_data)  `
 Image(graph.create_png())

Answer 2

来自维基百科：

DOT语言定义了一个图形，但没有提供渲染图形的工具。有几个程序可用于以DOT语言呈现，查看和操作图形：

Graphviz - 用于操作和呈现图形的库和实用程序的集合

Canviz - 用于渲染点文件的JavaScript库。

Viz.js - 一个简单的Graphviz JavaScript客户端

Grappa - Graphviz到Java的部分端口。[4] [5]

Beluging - Python＆amp;基于Google Cloud的DOT和Beluga扩展的查看器。 [1]

Tulip可以导入点文件进行分析

OmniGraffle可以导入DOT的子集，生成可编辑的文档。（但是，结果无法导出回DOT。）

ZGRViewer，GraphViz / DOT查看器链接

VizierFX，Flex图形渲染库链接

Gephi - 一个交互式可视化和探索平台，适用于各种网络和复杂系统，动态和分层图形

因此，这些程序中的任何一个都能够可视化您的树。

Answer 3

我使用条形图进行了可视化。第一个图表显示了类的分布。第一个标题代表第一个分裂标准。满足此标准的所有数据都会产生左侧底层子图。如果没有，那么就是结果。因此，所有标题都表明了下一次分裂的分裂标准。

百分比是初始分布的值。因此，通过查看百分比，可以轻松获得在初始分割后剩余的初始数据量。

注意，如果你设置max_depth为高，这将需要很多子图（max_depth，2 ^深度）

Tree visualization using bar plots

代码：

def give_nodes(nodes,amount_of_branches,left,right):
    amount_of_branches*=2
    nodes_splits=[]
    for node in nodes:
        nodes_splits.append(left[node])
        nodes_splits.append(right[node])
    return (nodes_splits,amount_of_branches)

def plot_tree(tree, feature_names):
    from matplotlib import gridspec 
    import matplotlib.pyplot as plt
    from matplotlib import rc
    import pylab

    color = plt.cm.coolwarm(np.linspace(1,0,len(feature_names)))

    plt.rc('text', usetex=True)
    plt.rc('font', family='sans-serif')
    plt.rc('font', size=14)

    params = {'legend.fontsize': 20,
             'axes.labelsize': 20,
             'axes.titlesize':25,
             'xtick.labelsize':20,
             'ytick.labelsize':20}
    plt.rcParams.update(params)

    max_depth=tree.max_depth
    left      = tree.tree_.children_left
    right     = tree.tree_.children_right
    threshold = tree.tree_.threshold
    features  = [feature_names[i] for i in tree.tree_.feature]
    value = tree.tree_.value

    fig = plt.figure(figsize=(3*2**max_depth,2*2**max_depth))
    gs = gridspec.GridSpec(max_depth, 2**max_depth)
    plt.subplots_adjust(hspace = 0.6, wspace=0.8)

    # All data
    amount_of_branches=1
    nodes=[0]
    normalize=np.sum(value[0][0])

    for i,node in enumerate(nodes):
        ax=fig.add_subplot(gs[0,(2**max_depth*i)/amount_of_branches:(2**max_depth*(i+1))/amount_of_branches])
        ax.set_title( features[node]+"$<= "+str(threshold[node])+"$")
        if( i==0): ax.set_ylabel(r'$\%$')
        ind=np.arange(1,len(value[node][0])+1,1)
        width=0.2
        bars= (np.array(value[node][0])/normalize)*100
        plt.bar(ind-width/2, bars, width,color=color,alpha=1,linewidth=0)
        plt.xticks(ind, [int(i) for i in ind-1])
        pylab.ticklabel_format(axis='y',style='sci',scilimits=(0,2))

    # Splits
    for j in range(1,max_depth):
        nodes,amount_of_branches=give_nodes(nodes,amount_of_branches,left,right)
        for i,node in enumerate(nodes):
            ax=fig.add_subplot(gs[j,(2**max_depth*i)/amount_of_branches:(2**max_depth*(i+1))/amount_of_branches])
            ax.set_title( features[node]+"$<= "+str(threshold[node])+"$")
            if( i==0): ax.set_ylabel(r'$\%$')
            ind=np.arange(1,len(value[node][0])+1,1)
            width=0.2
            bars= (np.array(value[node][0])/normalize)*100
            plt.bar(ind-width/2, bars, width,color=color,alpha=1,linewidth=0)
            plt.xticks(ind, [int(i) for i in ind-1])
            pylab.ticklabel_format(axis='y',style='sci',scilimits=(0,2))


    plt.tight_layout()
    return fig

示例：

X=[]
Y=[]
amount_of_labels=5
feature_names=[ '$x_1$','$x_2$','$x_3$','$x_4$','$x_5$']
for i in range(200):
    X.append([np.random.normal(),np.random.randint(0,100),np.random.uniform(200,500) ])
    Y.append(np.random.randint(0,amount_of_labels))

clf = tree.DecisionTreeClassifier(criterion='entropy',max_depth=4)
clf = clf.fit(X,Y )
fig=plot_tree(clf, feature_names)

使用SKlearn和可视化的决策树

3 个答案: