Classification using composite models - Random Forests¶
Random Forests are currently one of the most widely used classification models. Sciki-learn contains the implementation of this algorithm in the RandomForestClassifier
class. This classifier is used in the same way as the other classifiers.
We load the data again as in the previous examples. This time, however, for the Random Forests algorithm, we will not remove the redundant attributes and we will also keep the nominal attribute Deck
(even with missing values) and transform it using the One Hot Encoder.
# Titanic import a preprocessing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
titanic = pd.read_csv("../data/titanic-processed.csv")
titanic = titanic.drop(columns=['ticket', 'cabin'])
titanic['sex'] = titanic['sex'].map({"male": 0, "female": 1})
titanic['has_family'] = titanic['has_family'].map({False: 0, True: 1})
titanic['fare_ordinal'] = titanic['fare_ordinal'].map({"normal": 0, "more expensive": 1, "most expensive": 2})
titanic['age_ordinal'] = titanic['age_ordinal'].map({"child": 0, "young": 1, "adult": 2, "old": 3})
titanic = pd.get_dummies(titanic, columns=['embarked', 'title_short', 'deck', 'title'])
titanic.head()
We can modify the model with several parameters. Since this is a model that consists of a number of different tree classifiers on different subsets of the input data, most of the parameters will be identical to the decision trees:
n_estimators
- number of trees in the "forest"bootstrap
-oob_score
- True/False - whether or not to use out-of-bag examples to estimate accuracycriterion
- criterion for choosing attributes - "gini", "entropy"max_depth
- maximum tree depth (if set to None, full tree is expanded)min_samples_split
- the smallest number of samples needed for node branchingmin_samples_leaf
- the smallest possible number of examples in a leaf node
X_titanic = titanic.drop('survived', axis=1) # create a flag matrix - use all columns except the target attribute and store in X_titanic
y_titanic = titanic['survived'] # create a vector of target attribute values as column 'survived'
print(X_titanic.shape) # for checking, we can print the dimensions of the matrix of values and the vector of the target attribute
print(y_titanic.shape)
from sklearn.model_selection import train_test_split # we import the function train_test_split()
X_train, X_test, y_train, y_test = train_test_split(X_titanic, y_titanic, test_size=0.3, random_state=1) # split the dataset into training and testing parts, so that the testing part will be 30% of the total dataset
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_rf = rf.predict(X_test)
print(f"Presnosť (accuracy) modelu: {accuracy_score(y_test, y_rf)}")
cm = confusion_matrix(y_test, y_rf)
print(cm)
Attribute importance is an essential output from the Random Forest model. We get to the importance of attributes using feature_importances_
, which are part of the Random Forest model. We can also arrange them and list them with individual corresponding attribute names.
sorted(zip(rf.feature_importances_, X_train.columns), reverse=True)
Task 7.5.¶
Try training the Random Forests model on the Titanic data.
Try different values of the parameters (especially set n_estimators
to different orders of magnitude, e.g. 10, 100, 1000).
But use different tree parameter settings - compare Random Forest with the tree settings that emerged as optimal from the previous task. Compare the results with Random Forest with such trees that contain many shallow trees. Are they different in any way?
# YOUR CODE HERE