Classification using decision trees¶
Decision trees are among the popular classification methods. Among the essential properties that make decision tree-based classifiers popular is their interpretability. Trees represent a well-understood and at the same time presentable structure, which can be suitable if it is necessary to explain or present the internal structure of the model (and thus also the way the model "arrived" at the given result).
The decision tree classifier is implemented in the Scikit-learn library by the DecisionTreeClassifier
class.
Despite the fact that many algorithms for the induction of decision trees are able to work with categorical attributes, the implementation of this classifier in Scikit-learn unfortunately does not know how to work with categorical variables. The data must therefore be pre-processed by transforming categorical attributes into numerical ones. It is recommended to transform all attributes without arrangement using One Hot Encoder.
In addition, we can remove irrelevant and redundant attributes from the dataset in the case of decision trees. In the case of the Titanic dataset, we can therefore remove the attributes age_ordinal
and fare_ordinal
, as they were created by derivation from the existing attributes age
and fare
.
But decision trees do not require attribute normalization.
# Titanic import a preprocessing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
titanic = pd.read_csv("../data/titanic-processed.csv")
titanic = titanic.drop(columns=['cabin','ticket','title','deck','fare_ordinal','age_ordinal'])
titanic['sex'] = titanic['sex'].map({"male": 0, "female": 1})
titanic['has_family'] = titanic['has_family'].map({False: 0, True: 1})
titanic = pd.get_dummies(titanic, columns=['embarked', 'title_short'])
titanic.head()
X_titanic = titanic.drop('survived', axis=1) # create feature matrix - use all columns except the target attribute and store in X_titanic
y_titanic = titanic['survived'] # vector of target attribute values as 'survived' column
print(X_titanic.shape) # check the dimensions of the matrix of values and the vector of the target attribute
print(y_titanic.shape)
from sklearn.model_selection import train_test_split # import train_test_split() function
X_train, X_test, y_train, y_test = train_test_split(X_titanic, y_titanic, test_size=0.3, random_state=1) # split the dataset into training and testing parts, so that the testing part will be 30% of the total dataset
We then train the classifier model based on decision trees in the same way as in the case of the k-NN model. We will use the DecisionTreeClassifier
class, initialize the model (or set the model parameters) and train the model on the training data.
When learning the tree model, we can set the model with the following parameters:
- criterion - criterion for selecting an attribute: "gini" or "entropy"
- max_depth - maximum depth of the tree (if set to None, the complete tree is expanded)
- min_samples_split - the smallest number of samples needed for node branching
- min_samples_leaf - the smallest possible number of examples in a leaf node
- presort - True/False - attribute sorting to speed up training
from sklearn.tree import DecisionTreeClassifier # functions import
from sklearn.metrics import confusion_matrix
dt = DecisionTreeClassifier() # initialize the tree
dt.fit(X_train, y_train) # train model on training set
y_dt = dt.predict(X_test) # test model on testing set
from sklearn.metrics import accuracy_score,precision_score, recall_score # metric computation
print(f"Presnosť (accuracy) modelu: {accuracy_score(y_test, y_dt)}")
print(f"Presnosť (precision) modelu: {precision_score(y_test, y_dt)}")
print(f"Návratnosť (recall) modelu: {recall_score(y_test, y_dt)}")
cm = confusion_matrix(y_test, y_dt) # write confusion matrix
print(cm)
Tree model display¶
As we mentioned at the beginning, the tree model can be visualized, which is essential for understanding the model and how it works. On the other hand - the essential factor is the complexity of the model itself. Excessively branched trees, with a very rich structure, are very difficult to read and confusing, thus losing this benefit.
Let's try to see what the decision tree looks like for the implicitly set tree classifier trained on the training data of the Titanic dataset.
There are several ways to visualize tree models in Python and the Scikit-learn library. Most include the installation of external programs such as GraphViz or various other modules, therefore for the purposes of the exercise we will only use the export of the tree to a file in GraphViz format. We can then open the created file (a file with the extension .dot
) and view the tree in the web version of the GraphViz application. It is available at www.webgraphivz.com.
To export the trained decision tree, we use the Scikit-learn function export_graphviz()
. We specify as a parameter the tree model we want to render, feature_names
containing the list of attribute headers (for rendering nodes), class_names
containing the list of values of the target attribute and the output file in which we save the visualization.
from sklearn import tree
from sklearn.tree import export_graphviz
with open("decision_tree.txt", "w") as f:
f = tree.export_graphviz(dt, feature_names=X_titanic.columns.values, class_names=['0','1'], out_file=f)
After running this paragraph, you will see the decision_tree.txt
file on the left in the Jupyter Lab file browser. Open it (it can also be opened directly in the Jupyter environment) and copy its complete content into the window of the web application www.webgraphviz.com. In it, after pressing the Generate Graph button, you will generate a visualization of the decision tree. Explore her.
Task 7.2.¶
If the resulting tree is too complex and unreadable - which of the parameters would you use and with what value to generate a more general and clear tree? Edit the model training code and train the model with the parameter identified and set and train such a model. Then visualize it and compare it to the model trained without parameter settings.
Model representation using rules¶
In addition to visualization, decision trees allow a different way of presenting the structure of the model - by generating rules from the structure of the tree. Such rules can be in the form of if
condition then
conclusion. They can be extracted directly from the tree structure and correspond to individual tests (condition) and branches (conclusions).
Below is the code for the tree_to_code
function that we can use to transform the tree structure into rules. The function contains an implemented indentation of the text of the rule part to improve the readability of the output.
Run the function on the created tree model and compare the extracted rules with the tree structure:
from sklearn.tree import _tree
def tree_to_code(tree, feature_names):
tree_ = tree.tree_
feature_name = [
feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
for i in tree_.feature
]
print("def tree({}):".format(", ".join(feature_names)))
def recurse(node, depth):
indent = " " * depth
if tree_.feature[node] != _tree.TREE_UNDEFINED:
name = feature_name[node]
threshold = tree_.threshold[node]
print("{}if {} <= {}:".format(indent, name, threshold))
recurse(tree_.children_left[node], depth + 1)
print("{}else: # if {} > {}".format(indent, name, threshold))
recurse(tree_.children_right[node], depth + 1)
else:
print("{}return {}".format(indent, tree_.value[node]))
recurse(0, 1)
# call tree_to_code on trained model dt, feature names are X_titanic.column.values
tree_to_code(dt, X_titanic.columns.values)
We can also calculate the importance of attributes for classification separately. We can use e.g. the SelectKBest function, with which we can filter out the most essential attributes for model creation. We can also use it to calculate the importance of attributes and make the selection manually.
from sklearn.feature_selection import SelectKBest, mutual_info_classif
fs = SelectKBest(score_func=mutual_info_classif, k='all')
fs.fit(X_train, y_train)
We can then write the importance of the attributes on the screen or plot them in a graph.
for i in range(len(fs.scores_)):
print('Atribut %d: %f' % (i, fs.scores_[i]))
plt.bar([i for i in range(len(fs.scores_))], fs.scores_)
plt.show()
Task 7.3.¶
Use Grid Search to find the optimal combination of decision tree model parameters on the Titanic data. In Grid Search, use 5-fold cross-validation and use accuracy
as the evaluation metric. Identify the best model and list the accuracy and return metrics for it (also list the contingency table - confusion matrix).
# YOUR CODE HERE
Task 7.4.¶
From the Grid Search results, determine and plot the interdependence of model accuracy with the maximum tree depth parameter. Use matplotlib to visualize.
# YOUR CODE HERE