Automated search for model parameters¶
In the last exercise, we showed how model tuning works by setting the values of its parameters. This process can also take place automatically - by generating a number of models with different parameters and evaluating them. The aim of this task is to demonstrate how to search for the most suitable parameters of the classification model in such a way. In the following task, we demonstrate how to search for optimal parameters for the k-NN model in this way.
As in the previous task, we will work with the Titanic dataset, which we preprocessed in exercise no. 7. For the purposes of parameter tuning, we will preprocess it in the same way (using the same transformations) as in the previous exercise.
So first we import all the necessary libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
So, we load the preprocessed data from the Titanic dataset, from exercise no. 7. They are located in the file ../data/titanic-processed.csv
.
titanic = pd.read_csv("../data/titanic-processed.csv")
titanic.head()
Since we will create the same model (k-NN) as in the previous tasks, we will not use some of the attributes (those that contain too many missing values, or those that contain too many categorical values), we will transform others using the One Hot Encoder, or by assigning numerical indexes.
titanic = titanic.drop(columns=['cabin','deck','ticket','title'])
titanic['sex'] = titanic['sex'].map({"male": 0, "female": 1})
titanic['has_family'] = titanic['has_family'].map({False: 0, True: 1})
titanic['fare_ordinal'] = titanic['fare_ordinal'].map({"normal": 0, "more expensive": 1, "most expensive": 2})
titanic['age_ordinal'] = titanic['age_ordinal'].map({"child": 0, "young": 1, "adult": 2, "old": 3})
titanic = pd.get_dummies(titanic, columns=['embarked', 'title_short'])
titanic.head()
Since I am creating a k-NN model, it is also appropriate to preprocess the data by normalization. So we use MinMaxScaler
again to scale the attributes to a uniform scale.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
titanic = pd.DataFrame(scaler.fit_transform(titanic), index=titanic.index, columns=titanic.columns)
titanic.head()
We can already try to train the classification model on this pre-processed set. Similar to the previous exercise, we first divide the data into a flag matrix and a vector of target attribute values.
The target attribute in this task is survived' (expresses whether the given passenger survived the accident or not). Thus, the target attribute will be a vector of
yvalues and the remaining columns will be a
X` flag matrix.
X_titanic = titanic.drop('survived', axis=1) # create feature matrix - we will use all columns except the target attribute and store in X_titanic
y_titanic = titanic['survived'] # vector of target attribute values as the 'survived' column
print(X_titanic.shape)
print(y_titanic.shape)
Now we divide the data into a training and a test set. We will use the function `train_test_split()' to split the data into training and testing, the training set will be in a ratio of 30/70 to the training set.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_titanic, y_titanic, test_size=0.3, random_state=1) # split the dataset into training and testing parts, so that the testing part will be 30% of the total dataset
We will create an object of the k-NN model. Without specific parameters - this time we will search for them using the GridSearch function.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier() # initialize the kNN classifier
GridSearch for finding optimal algorithm settings¶
With the GridSearchCV
function, we can automate the search for optimal algorithm parameters. Grid Search is an approach that automatically creates a set of models with different settings, which it validates using cross-validation.
Grid Search parameters setting¶
We specify several input parameters to the Grid Search function in Scikit-learn, which then define how the automated testing of the model parameters will take place.
In the example below, we will try to find the optimal value of the k
parameter for the k-NN model using Grid Search. First, we define the range of values of the parameter k
that we want to test.
from sklearn.model_selection import GridSearchCV # import libraries
# define the hyperparameter values
# for the k parameter, we will generate values range from 1 to 50
k_range = list(range(1, 50))
print(k_range)
We will create an array of model parameters. We have to be careful here - we have to create the field for individual parameters so that the names of the parameters correspond to the names of the parameters of the individual models.
In this example, we have generated an array of integers that we want to use as different values for testing the k
parameter. In the k-NN model, this parameter is called n_neighbors
(when setting the parameter of the k-NN classifier, we created the models as, for example, KNeighborsClassifier(n_neighbors = 3)
), so we map it to a variable that will store the collection of parameters (param_grids
) array of these values, which we assign to the n_neighbors
parameter.
# we will create the so-called parameter grid: we map the generated values to the parameter field
# in this case, we create a n_neighbors parameter to which we assign an array of its examined values
param_grid = dict(n_neighbors=k_range)
print(param_grid)
Now that we have the parameter array we want to explore, we'll run the Grid Search. GridSearchCV
has the following parameters:
estimator
- the model we want to train (in our caseknn
)param_grid
- collection of model parameters and lists of their values - beware, the parameter array must be compatible with the model parameters!cv
- factor of chain validationscoring
- metric used for evaluation in cross-validation of models (e.g.accuracy
,precision
,recall
, etc.)
# apply Grid Search - set the parameters:
# model - knn
# parameter array - param_grid
# we will use 5-fold cross-validation
# we will use the accuracy metric for evaluation
grid = GridSearchCV(estimator=knn, param_grid=param_grid, cv=5, scoring='accuracy') #set the Grid Search
grid.fit(X_train, y_train) # apply Grid Search on training data
Evaluation of Grid Search results¶
Now we have a trained set of classifiers with different settings, and through different functions of the grid
object we can look at the concrete results of the models with different values of the input parameter k
.
Using best_params_
we can see which model achieved the best results.
print("Best hyperparameters:")
print()
print(grid.best_params_)
print()
print(grid.best_score_)
Using cv_results_
we can get various metrics:
mean_test_score
- average score (defined as a Grid Search parameter)std_test_score
- standard deviation of the scorerank_test_score
- ranking in the score on the test datamean_train_score
- average score (on training subsets)std_train_score
- standard deviation of the score on the training datamean_fit_time
- average training time of the modelstd_fit_time
- standard deviation of model training timemean_score_time
- average evaluation time of unknown examplesstd_score_time
- standard deviation of the evaluation timeparams
- model parameters
So we can see what different metrics and information the cv_results_
object stores:
sorted(grid.cv_results_.keys())
So we can look at specific results of specific models.
print(grid.cv_results_["mean_test_score"][24]) # výsledky pre konkrétnu metriku a pre konkrétny model
We can of course look at what results all models achieved at once. We will list the average cross-validation score, its standard deviation and the parameters of the given model. In order to make the statement more readable, it is necessary to format the statement sensibly.
# see comlete results
print("Individual scores for individual values of the parameter k:")
print()
means = grid.cv_results_['mean_test_score'] # we assign the results of test score averages to the variable means
stds = grid.cv_results_['std_test_score'] # we assign a list of standard deviations to the stds variable
params = grid.cv_results_['params']
for mean, std, params in zip(means, stds, params): # for all records, we print formatted output - zip maps the same indexes of multiple containers/fields so that they can be used as one entity
print("%0.3f (+/-%0.03f) for value %s" % (mean, std, params)) # output formatting
print()
We can also look at the specific results of the selected model (using access to the individual elements of the results). We can also examine the individual partial results from the cross-validation of the given model. After specifying the parameter, we can also specify the index with which we access the specific model.
# we can examine individual models and their specific results
print('Parameter k of model 0:')
print(grid.cv_results_["params"][0])
# model score with index 0 (k=1) for individual cross-validation splits
print()
print('CV score of model 0:')
print(grid.cv_results_["split0_test_score"][0])
print(grid.cv_results_["split1_test_score"][0])
print(grid.cv_results_["split2_test_score"][0])
print(grid.cv_results_["split3_test_score"][0])
print(grid.cv_results_["split4_test_score"][0])
# Average model score with index 0
print()
print('Average score of model 0')
print(grid.cv_results_["mean_test_score"][0])
Visualization of the dependence of the value of the parameter k on the score¶
For a better understanding of the dependence of one parameter on the resulting model score, we can visualize. Plotting the dependence of the accuracy and the value of the k
parameter and the Accuracy metric on the test set is then very simply visualized in this case using matplotlib.
# using matplotlib, we plot the dependence of the values of the parameter k and the score between these two quantities:
# YOUR CODE HERE
plt.plot( # YOUR CODE HERE )
plt.xlabel(' ... ')
plt.ylabel(' ... ')
Simultaneous search of several parameters¶
We can specify several parameters simultaneously to the Grid Search method. The algorithm will thus search for all combinations of the defined parameters.
We will try to find a combination of other parameters as well. For the k-NN algorithm, we can also set parameters specifying distance weighting or the metric used. So we create another list of values of the weights
parameter and a list that corresponds to the values of the metric
parameter.
# create parameter lists for k-NN algorithm weights and metrics
weights_range = # YOUR CODE HERE
metric_range = # YOUR CODE HERE
We then add both of these lists together with the k_range
list to the parameter list.
The parameter of the kNN algorithm that specifies the weighting is called weights
and the parameter defining the metrics is called metric
. We assign them lists of their values that we want to examine. So, we insert value fields for individual parameters into the param_grid
collection.
# create a parameter array for the defined parameters and their ranges
param_grid = # YOUR CODE HERE
print(param_grid)
Now we run Grid Search in the same way - we specify the model, parameter list, cross-validation setting parameter and define the evaluation metric.
# set the Grid Search parameters
grid = # YOUR CODE HERE
grid.fit(X_train, y_train)
We can approach the results again in the same way.
Let's look at the best of the models and its result, and then in the same way as in the previous task, we will print out the complete results.
# We will list the parameters and scores for the best of the models
# YOUR CODE HERE
# Print all the results
# YOUR CODE HERE
Using GridSearchCV
, we trained models with different parameters on the training set. At the same time, using cross-validation, we also validated them on the training set. We thus identified the best parameters of the model. If we want to test the model on a test set to verify its quality or to use the model to predict new, unlabeled examples, we have to train the model with the identified parameters again. Then we can test it on the test set and output the classification contingency table.
Task 7.1.¶
Train the model with the best parameters on the training set and test it on the test set. Write a contingency table of results (confusion matrix).
# YOUR CODE HERE