Toto je statické zobrazenie, ak chcete Notebook spustiť, prihláste sa do prostredia Data Lab.

Notebook

Classification using decision trees¶

Decision trees are among the popular classification methods. Among the essential properties that make decision tree-based classifiers popular is their interpretability. Trees represent a well-understood and at the same time presentable structure, which can be suitable if it is necessary to explain or present the internal structure of the model (and thus also the way the model "arrived" at the given result).

The decision tree classifier is implemented in the Scikit-learn library by the DecisionTreeClassifier class.

Despite the fact that many algorithms for the induction of decision trees are able to work with categorical attributes, the implementation of this classifier in Scikit-learn unfortunately does not know how to work with categorical variables. The data must therefore be pre-processed by transforming categorical attributes into numerical ones. It is recommended to transform all attributes without arrangement using One Hot Encoder.

In addition, we can remove irrelevant and redundant attributes from the dataset in the case of decision trees. In the case of the Titanic dataset, we can therefore remove the attributes age_ordinal and fare_ordinal, as they were created by derivation from the existing attributes age and fare.

But decision trees do not require attribute normalization.

In [ ]:

# Titanic import a preprocessing

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

titanic = pd.read_csv("../data/titanic-processed.csv")

titanic = titanic.drop(columns=['cabin','ticket','title','deck','fare_ordinal','age_ordinal'])

titanic['sex'] = titanic['sex'].map({"male": 0, "female": 1})
titanic['has_family'] = titanic['has_family'].map({False: 0, True: 1})

titanic = pd.get_dummies(titanic, columns=['embarked', 'title_short'])

titanic.head()

In [ ]:

X_titanic = titanic.drop('survived', axis=1) # create feature matrix - use all columns except the target attribute and store in X_titanic
y_titanic = titanic['survived'] # vector of target attribute values as 'survived' column

print(X_titanic.shape) # check the dimensions of the matrix of values and the vector of the target attribute
print(y_titanic.shape)

from sklearn.model_selection import train_test_split # import train_test_split() function
X_train, X_test, y_train, y_test = train_test_split(X_titanic, y_titanic, test_size=0.3, random_state=1) # split the dataset into training and testing parts, so that the testing part will be 30% of the total dataset

We then train the classifier model based on decision trees in the same way as in the case of the k-NN model. We will use the DecisionTreeClassifier class, initialize the model (or set the model parameters) and train the model on the training data.

When learning the tree model, we can set the model with the following parameters:

criterion - criterion for selecting an attribute: "gini" or "entropy"
max_depth - maximum depth of the tree (if set to None, the complete tree is expanded)
min_samples_split - the smallest number of samples needed for node branching
min_samples_leaf - the smallest possible number of examples in a leaf node
presort - True/False - attribute sorting to speed up training

In [ ]:

from sklearn.tree import DecisionTreeClassifier # functions import
from sklearn.metrics import confusion_matrix

dt = DecisionTreeClassifier()   # initialize the tree
dt.fit(X_train, y_train)        # train model on training set
y_dt = dt.predict(X_test)       # test model on testing set

from sklearn.metrics import accuracy_score,precision_score, recall_score # metric computation

print(f"Presnosť (accuracy) modelu: {accuracy_score(y_test, y_dt)}")
print(f"Presnosť (precision) modelu: {precision_score(y_test, y_dt)}")
print(f"Návratnosť (recall) modelu: {recall_score(y_test, y_dt)}")

cm = confusion_matrix(y_test, y_dt)  # write confusion matrix
print(cm)

Tree model display¶

As we mentioned at the beginning, the tree model can be visualized, which is essential for understanding the model and how it works. On the other hand - the essential factor is the complexity of the model itself. Excessively branched trees, with a very rich structure, are very difficult to read and confusing, thus losing this benefit.

Let's try to see what the decision tree looks like for the implicitly set tree classifier trained on the training data of the Titanic dataset.

There are several ways to visualize tree models in Python and the Scikit-learn library. Most include the installation of external programs such as GraphViz or various other modules, therefore for the purposes of the exercise we will only use the export of the tree to a file in GraphViz format. We can then open the created file (a file with the extension .dot) and view the tree in the web version of the GraphViz application. It is available at www.webgraphivz.com.

To export the trained decision tree, we use the Scikit-learn function export_graphviz(). We specify as a parameter the tree model we want to render, feature_names containing the list of attribute headers (for rendering nodes), class_names containing the list of values of the target attribute and the output file in which we save the visualization.

In [ ]:

from sklearn import tree
from sklearn.tree import export_graphviz

with open("decision_tree.txt", "w") as f:
    f = tree.export_graphviz(dt, feature_names=X_titanic.columns.values, class_names=['0','1'], out_file=f)

After running this paragraph, you will see the decision_tree.txt file on the left in the Jupyter Lab file browser. Open it (it can also be opened directly in the Jupyter environment) and copy its complete content into the window of the web application www.webgraphviz.com. In it, after pressing the Generate Graph button, you will generate a visualization of the decision tree. Explore her.

Task 7.2.¶

If the resulting tree is too complex and unreadable - which of the parameters would you use and with what value to generate a more general and clear tree? Edit the model training code and train the model with the parameter identified and set and train such a model. Then visualize it and compare it to the model trained without parameter settings.

Model representation using rules¶

In addition to visualization, decision trees allow a different way of presenting the structure of the model - by generating rules from the structure of the tree. Such rules can be in the form of if condition then conclusion. They can be extracted directly from the tree structure and correspond to individual tests (condition) and branches (conclusions).

Below is the code for the tree_to_code function that we can use to transform the tree structure into rules. The function contains an implemented indentation of the text of the rule part to improve the readability of the output.

Run the function on the created tree model and compare the extracted rules with the tree structure:

In [ ]:

from sklearn.tree import _tree

def tree_to_code(tree, feature_names): 

    tree_ = tree.tree_
    feature_name = [
        feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
        for i in tree_.feature
    ]
    print("def tree({}):".format(", ".join(feature_names)))

    def recurse(node, depth):
        indent = "  " * depth
        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            name = feature_name[node]
            threshold = tree_.threshold[node]
            print("{}if {} <= {}:".format(indent, name, threshold))
            recurse(tree_.children_left[node], depth + 1)
            print("{}else:  # if {} > {}".format(indent, name, threshold))
            recurse(tree_.children_right[node], depth + 1)
        else:
            print("{}return {}".format(indent, tree_.value[node]))

    recurse(0, 1)

In [ ]:

# call tree_to_code on trained model dt, feature names are X_titanic.column.values

tree_to_code(dt, X_titanic.columns.values)

We can also calculate the importance of attributes for classification separately. We can use e.g. the SelectKBest function, with which we can filter out the most essential attributes for model creation. We can also use it to calculate the importance of attributes and make the selection manually.

In [ ]:

from sklearn.feature_selection import SelectKBest, mutual_info_classif

fs = SelectKBest(score_func=mutual_info_classif, k='all')
fs.fit(X_train, y_train)

We can then write the importance of the attributes on the screen or plot them in a graph.

In [ ]:

for i in range(len(fs.scores_)):
    print('Atribut %d: %f' % (i, fs.scores_[i]))
    
plt.bar([i for i in range(len(fs.scores_))], fs.scores_)
plt.show()

Task 7.3.¶

Use Grid Search to find the optimal combination of decision tree model parameters on the Titanic data. In Grid Search, use 5-fold cross-validation and use accuracy as the evaluation metric. Identify the best model and list the accuracy and return metrics for it (also list the contingency table - confusion matrix).

In [ ]:

# YOUR CODE HERE

Task 7.4.¶

From the Grid Search results, determine and plot the interdependence of model accuracy with the maximum tree depth parameter. Use matplotlib to visualize.

In [ ]:

# YOUR CODE HERE