決策樹範例四: Understanding the decision tree structure

http://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html#sphx-glr-auto-examples-tree-plot-unveil-tree-structure-py

範例目的

此範例主要在進一步探討決策樹內部的結構,分析以獲得特徵與目標之間的關係,並進而進行預測。

  1. 當每個節點的分支最多只有兩個稱之為二元樹結構。
  2. 判斷每個深度的節點是否為葉,在二元樹中若該節點為判斷的最後一層稱之為葉。
  3. 利用 decision_path 獲得決策路徑的資訊。
  4. 利用 apply 得到預測結果,也就是決策樹最後抵達的葉。
  5. 建立完成後的規則變能用來預測。
  6. 一組多個樣本可以尋得其中共同的決策路徑。

(一)引入函式庫及測試資料

引入函式資料庫

  • load_iris 引入鳶尾花資料庫。
  1. from sklearn.model_selection import train_test_split
  2. from sklearn.datasets import load_iris
  3. from sklearn.tree import DecisionTreeClassifier

建立訓練、測試集及決策樹分類器

  • X (特徵資料) 以及 y (目標資料)。
  • train_test_split(X, y, random_state) 將資料隨機分為測試集及訓練集。

    X為特徵資料集、y為目標資料集,random_state 隨機數生成器。
  • DecisionTreeClassifier(max_leaf_nodes, random_state) 建立決策樹分類器。

    max_leaf_nodes 節點為葉的最大數目,random_state 若存在則為隨機數生成器,若不存在則使用np.random
  • fit(X, y) 用做訓練,X為訓練用特徵資料,y為目標資料。
  1. iris = load_iris()
  2. X = iris.data
  3. y = iris.target
  4. X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
  5. estimator = DecisionTreeClassifier(max_leaf_nodes=3, random_state=0)
  6. estimator.fit(X_train, y_train)

(二) 決策樹結構探討

DecisionTreeClassifier 中有個屬性 tree_,儲存了整個樹的結構。

二元樹被表示為多個平行的矩陣,每個矩陣的第i個元素儲存著關於節點”i”的信息,節點0代表樹的根。

需要注意的是,有些矩陣只適用於有分支的節點,在這種情況下,其他類型的節點的值是任意的。

上述所說的矩陣包含了:

  1. node_count :總共的節點個數。
  2. children_left:節點左邊的節點的ID,”-1”代表該節點底下已無分支。
  3. children_righ:節點右邊的節點的ID,”-1”代表該節點底下已無分支。
  4. feature:使節點產生分支的特徵,”-2”代表該節點底下已無分支。
  5. threshold:節點的閥值。若距離不超過 threshold ,則邊的兩端就視作同一個群集。
  1. n_nodes = estimator.tree_.node_count
  2. children_left = estimator.tree_.children_left
  3. children_right = estimator.tree_.children_right
  4. feature = estimator.tree_.feature
  5. threshold = estimator.tree_.threshold

以下為各矩陣的內容

  1. n_nodes = 5
  2. children_left [ 1 -1 3 -1 -1]
  3. children_right [ 2 -1 4 -1 -1]
  4. feature [ 3 -2 2 -2 -2]
  5. threshold [ 0.80000001 -2. 4.94999981 -2. -2. ]

二元樹的結構所通過的各個屬性是可以被計算的,例如每個節點的深度以及是否為樹的最底層。

  • node_depth :節點在決策樹中的深度(層)。
  • is_leaves :該節點是否為決策樹的最底層(葉)。
  • stack:存放尚未判斷是否達決策樹底層的節點資訊。

將stack的一組節點資訊pop出來,判斷該節點的左邊節點ID是否等於右邊節點ID。

若不相同分別將左右節點的資訊加入stack中,若相同則該節點已達底層is_leaves設為True。

  1. node_depth = np.zeros(shape=n_nodes)
  2. is_leaves = np.zeros(shape=n_nodes, dtype=bool)
  3. stack = [(0, -1)] #initial
  4. while len(stack) > 0:
  5. node_id, parent_depth = stack.pop()
  6. node_depth[node_id] = parent_depth + 1
  7. # If we have a test node
  8. if (children_left[node_id] != children_right[node_id]):
  9. stack.append((children_left[node_id], parent_depth + 1))
  10. stack.append((children_right[node_id], parent_depth + 1))
  11. else:
  12. is_leaves[node_id] = True

執行過程

  1. stack len 1
  2. node_id 0 parent_depth -1
  3. node_depth [ 0. 0. 0. 0. 0.]
  4. stack [(1, 0), (2, 0)]
  5. stack len 2
  6. node_id 2 parent_depth 0
  7. node_depth [ 0. 0. 1. 0. 0.]
  8. stack [(1, 0), (3, 1), (4, 1)]
  9. stack len 3
  10. node_id 4 parent_depth 1
  11. node_depth [ 0. 0. 1. 0. 2.]
  12. stack [(1, 0), (3, 1)]
  13. stack len 2
  14. node_id 3 parent_depth 1
  15. node_depth [ 0. 0. 1. 2. 2.]
  16. stack [(1, 0)]
  17. stack len 1
  18. node_id 1 parent_depth 0
  19. node_depth [ 0. 1. 1. 2. 2.]
  20. stack []

Ex 4: Understanding the decision tree structure - 图1

下面這個部分是以程式的方式印出決策樹結構,這個決策樹共有5個節點。

若遇到的是test node則用閥值決定該往哪個節點前進,直到走到葉為止。

  1. print("The binary tree structure has %s nodes and has "
  2. "the following tree structure:"
  3. % n_nodes)
  4. for i in range(n_nodes):
  5. if is_leaves[i]:
  6. print("%snode=%s leaf node." % (node_depth[i] * "\t", i)) #"\t"縮排
  7. else:
  8. print("%snode=%s test node: go to node %s if X[:, %s] <= %s else to "
  9. "node %s."
  10. % (node_depth[i] * "\t",
  11. i,
  12. children_left[i],
  13. feature[i],
  14. threshold[i],
  15. children_right[i],
  16. ))

執行結果

  1. The binary tree structure has 5 nodes and has the following tree structure:
  2. node=0 test node: go to node 1 if X[:, 3] <= 0.800000011921 else to node 2.
  3. node=1 leaf node.
  4. node=2 test node: go to node 3 if X[:, 2] <= 4.94999980927 else to node 4.
  5. node=3 leaf node.
  6. node=4 leaf node.

接下來要來探索每個樣本的決策路徑,利用decision_path方法可以讓我們得到這些資訊,apply存放所有sample最後抵達哪個葉。

以第0筆樣本當作範例,indices存放每個樣本經過的節點,indptr存放每個樣本存放節點的位置,node_index中存放了第0筆樣本所經過的節點ID。

  1. node_indicator = estimator.decision_path(X_test)
  2. # Similarly, we can also have the leaves ids reached by each sample.
  3. leave_id = estimator.apply(X_test)
  4. # Now, it's possible to get the tests that were used to predict a sample or
  5. # a group of samples. First, let's make it for the sample.
  6. sample_id = 0
  7. node_index = node_indicator.indices[node_indicator.indptr[sample_id]:
  8. node_indicator.indptr[sample_id + 1]]
  9. print('node_index', node_index)
  10. print('Rules used to predict sample %s: ' % sample_id)
  11. for node_id in node_index:
  12. if leave_id[sample_id] != node_id:
  13. continue
  14. if (X_test[sample_id, feature[node_id]] <= threshold[node_id]):
  15. threshold_sign = "<="
  16. else:
  17. threshold_sign = ">"
  18. print("decision id node %s : (X[%s, %s] (= %s) %s %s)"
  19. % (node_id,
  20. sample_id,
  21. feature[node_id],
  22. X_test[i, feature[node_id]],
  23. threshold_sign,
  24. threshold[node_id]))

執行結果

  1. node_index [0 2 4]
  2. Rules used to predict sample 0:
  3. decision id node 4 : (X[0, -2] (= 1.5) > -2.0)

接下來是探討多個樣本,是否有經過相同的節點。

以樣本0、1當作範例,node_indicator.toarray()存放多個矩陣0代表沒有經過該節點,1代表經過該節點。common_nodes中存放true與false,若同一個節點相加的值等於輸入樣本的各樹,則代表該節點都有被經過。

  1. # For a group of samples, we have the following common node.
  2. sample_ids = [0, 1]
  3. common_nodes = (node_indicator.toarray()[sample_ids].sum(axis=0) ==
  4. len(sample_ids))
  5. print('node_indicator',node_indicator.toarray()[sample_ids])
  6. print('common_nodes',common_nodes)
  7. common_node_id = np.arange(n_nodes)[common_nodes]
  8. print('common_node_id',common_node_id)
  9. print("\nThe following samples %s share the node %s in the tree"
  10. % (sample_ids, common_node_id))
  11. print("It is %s %% of all nodes." % (100 * len(common_node_id) / n_nodes,))

執行結果

  1. node_indicator [[1 0 1 0 1]
  2. [1 0 1 1 0]]
  3. common_nodes [ True False True False False]
  4. common_node_id [0 2]
  5. The following samples [0, 1] share the node [0 2] in the tree
  6. It is 40.0 % of all nodes.

(三)完整程式碼

  1. import numpy as np
  2. from sklearn.model_selection import train_test_split
  3. from sklearn.datasets import load_iris
  4. from sklearn.tree import DecisionTreeClassifier
  5. iris = load_iris()
  6. X = iris.data
  7. y = iris.target
  8. X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
  9. estimator = DecisionTreeClassifier(max_leaf_nodes=3, random_state=0)
  10. estimator.fit(X_train, y_train)
  11. # The decision estimator has an attribute called tree_ which stores the entire
  12. # tree structure and allows access to low level attributes. The binary tree
  13. # tree_ is represented as a number of parallel arrays. The i-th element of each
  14. # array holds information about the node `i`. Node 0 is the tree's root. NOTE:
  15. # Some of the arrays only apply to either leaves or split nodes, resp. In this
  16. # case the values of nodes of the other type are arbitrary!
  17. #
  18. # Among those arrays, we have:
  19. # - left_child, id of the left child of the node
  20. # - right_child, id of the right child of the node
  21. # - feature, feature used for splitting the node
  22. # - threshold, threshold value at the node
  23. #
  24. # Using those arrays, we can parse the tree structure:
  25. n_nodes = estimator.tree_.node_count
  26. children_left = estimator.tree_.children_left
  27. children_right = estimator.tree_.children_right
  28. feature = estimator.tree_.feature
  29. threshold = estimator.tree_.threshold
  30. # The tree structure can be traversed to compute various properties such
  31. # as the depth of each node and whether or not it is a leaf.
  32. node_depth = np.zeros(shape=n_nodes)
  33. is_leaves = np.zeros(shape=n_nodes, dtype=bool)
  34. stack = [(0, -1)] # seed is the root node id and its parent depth
  35. while len(stack) > 0:
  36. node_id, parent_depth = stack.pop()
  37. node_depth[node_id] = parent_depth + 1
  38. # If we have a test node
  39. if (children_left[node_id] != children_right[node_id]):
  40. stack.append((children_left[node_id], parent_depth + 1))
  41. stack.append((children_right[node_id], parent_depth + 1))
  42. else:
  43. is_leaves[node_id] = True
  44. print("The binary tree structure has %s nodes and has "
  45. "the following tree structure:"
  46. % n_nodes)
  47. for i in range(n_nodes):
  48. if is_leaves[i]:
  49. print("%snode=%s leaf node." % (node_depth[i] * "\t", i))
  50. else:
  51. print("%snode=%s test node: go to node %s if X[:, %s] <= %ss else to "
  52. "node %s."
  53. % (node_depth[i] * "\t",
  54. i,
  55. children_left[i],
  56. feature[i],
  57. threshold[i],
  58. children_right[i],
  59. ))
  60. print()
  61. # First let's retrieve the decision path of each sample. The decision_path
  62. # method allows to retrieve the node indicator functions. A non zero element of
  63. # indicator matrix at the position (i, j) indicates that the sample i goes
  64. # through the node j.
  65. node_indicator = estimator.decision_path(X_test)
  66. # Similarly, we can also have the leaves ids reached by each sample.
  67. leave_id = estimator.apply(X_test)
  68. # Now, it's possible to get the tests that were used to predict a sample or
  69. # a group of samples. First, let's make it for the sample.
  70. sample_id = 0
  71. node_index = node_indicator.indices[node_indicator.indptr[sample_id]:
  72. node_indicator.indptr[sample_id + 1]]
  73. print('Rules used to predict sample %s: ' % sample_id)
  74. for node_id in node_index:
  75. if leave_id[sample_id] != node_id:
  76. continue
  77. if (X_test[sample_id, feature[node_id]] <= threshold[node_id]):
  78. threshold_sign = "<="
  79. else:
  80. threshold_sign = ">"
  81. print("decision id node %s : (X[%s, %s] (= %s) %s %s)"
  82. % (node_id,
  83. sample_id,
  84. feature[node_id],
  85. X_test[i, feature[node_id]],
  86. threshold_sign,
  87. threshold[node_id]))
  88. # For a group of samples, we have the following common node.
  89. sample_ids = [0, 1]
  90. common_nodes = (node_indicator.toarray()[sample_ids].sum(axis=0) ==
  91. len(sample_ids))
  92. common_node_id = np.arange(n_nodes)[common_nodes]
  93. print("\nThe following samples %s share the node %s in the tree"
  94. % (sample_ids, common_node_id))
  95. print("It is %s %% of all nodes." % (100 * len(common_node_id) / n_nodes,))