Scikit-learn - predictive modeling using the k-nearest neighbors algorithm¶
The goal of the examples in this lesson is to demonstrate the creation of a nearest neighbor classification model on the Titanic database. We choose survived
as the target attribute and the created classification model will be able to predict whether the given passenger survived the sinking of the ship or not, based on the passenger data.
The goal of the last exercise was to familiarize yourself with the basic procedure of working with the Scikit-learn library. The nearest neighbor algorithm was used as an example classifier. In the sample task (also on the homework) we worked with datasets that contained only numerical attributes and we did not pre-process the data in any way before applying modeling.
As part of this exercise, we will show more detailed work with k-NN on the task of predicting the survived
attribute of the Titanic dataset. In addition to setting various parameters of the algorithm, this will also include how the data must be pre-processed before applying kNN and in what different ways the model can be qualitatively evaluated.
So first we import all the necessary libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Now load the preprocessed data from the Titanic dataset, from previous exercises. They are located in the file ../data/titanic-processed.csv
.
Use the head()
function to list the first 5 examples so that we can see what attributes we have described in the dataset.
# YOUR CODE HERE
The Titanic dataset contains, in addition to numeric, also categorical features. The k-NN model is not able to process such features. Therefore, in order to use the k-nearest neighbors model, we will first have to appropriately transform the data.
In addition, we can remove some attributes from the dataset. The ticket number will probably not have a significant impact on the classification, so we can remove the ticket
attribute. We can also remove the title
attribute, since we will use the title_short
attribute, which was created by transformation. We will also remove the deck
and cabin
attributes, as they contain a huge number of missing values.
Use the drop
function above the titanic
data frame for the above transformations.
# YOUR CODE HERE
Features transformation¶
In the following steps, we will show how we can transform the attributes in an inappropriate form.
The following attributes contain string values and cannot be used as such in k-NN modeling:
sex
embarked
has_family
fare_ordinal
title_short
age_ordinal
Transformation of binary (categorical) attributes to nominal using LabelEncoder
¶
Attributes sex
and has_family
are attributes taking 2 values. So we can use a simple transformation that replaces their values with an integer. We can use LabelEncoder
from the Scikit-learn library for such an operation. Using the function fit_transform()
for the attribute entered as a parameter, it replaces all different values with an index.
However, we could also use the transformation using data frames and the map()
function. In the commented section there is code that implements an identical operation using the map()
function as using the Label Encoder.
from sklearn.preprocessing import LabelEncoder # function imports
titanic['sex'] = LabelEncoder().fit_transform(titanic['sex']) # create a LabelEncoder, apply it to `sex`, write output to titanic[`sex`]
titanic['has_family'] = LabelEncoder().fit_transform(titanic['has_family']) # create a LabelEncoder, apply it to `has_family`, write output to titanic[`has_family`]
## same transformation using map() function
# titanic['sex'] = titanic['sex'].map({"male": 0, "female": 1})
# titanic['has_family'] = titanic['has_family'].map({False: 0, True: 1})
titanic.head() # write first 5 records
Transformation of numeric attributes to categorical ones using the One Hot Encoding
method¶
Not all attributes should be transformed with a simple encoder (by assigning numerical values for different values of a categorical attribute). For categorical attributes with more than 2 values, we "inadvertently" create their order. Some models (including k-NN) could then take the attribute transformed in this way as ordinal, although there was no order in the original attribute before the transformation. If we want to avoid this, we can use the so-called One Hot Encoding. In such a transformation, a new, binary attribute is derived for each value of the categorical attribute, which will specify whether the examples acquire the given value or not.
We can also implement such encoding in python using the get_dummies()
function of the Pandas data frame. Its parameters are the data frame we are working with and the list of columns we want to transform.
In our case, we transform the attributes embarked
and title_short
in this way, since in both cases they are categorical attributes that do not have an arrangement.
titanic = pd.get_dummies(titanic, columns=['embarked', 'title_short']) # specify the features to be binary encoded for get_dummies function
titanic.head() # write first 5 records
Transformation of ordinal categorical attributes to numerical ones¶
After the mentioned transformations, we still have 2 attributes that need to be transformed into numeric ones. In both cases, these are ordinal categorical attributes, i.e. attributes with a clearly defined arrangement. For such attributes, we can use the encoding as in the case of the sex
or has_family
attribute transformation, but due to the existing order of values, we must specify the encoding manually.
In the case of fare_ordinal
and age_ordinal
attributes, we will define how to replace the original values with indices. In the case of the fare_ordinal
attribute, the arrangement of its values is in the order normal
< more expensive
< most expensive
, so we assign indices 0, 1, or 2, thus preserving the arrangement. We will proceed analogously in the case of the age_ordinal
attribute.
titanic['fare_ordinal'] = titanic['fare_ordinal'].map({"normal": 0, "more expensive": 1, "most expensive": 2}) # tare_ordinal feature transformation
titanic['age_ordinal'] = titanic['age_ordinal'].map({"child": 0, "young": 1, "adult": 2, "old": 3}) # age_ordinal feature transformation
titanic.head()
Modeling¶
We can already try to train the classification model on this pre-processed set. Similar to the previous exercise, we first divide the data into a flag matrix and a vector of target attribute values.
The target attribute in this task is survived' (expresses whether the given passenger survived the accident or not). Thus, the target attribute will be a vector of
yvalues and the remaining columns will be a
X` flag matrix.
# YOUR CODE HERE
Now we divide the data into a training and a test set. It is essential that we do all preprocessing (attribute transformations, etc.) before this step or if later, we must be careful to apply the same procedures to both the training and test sets. Both of the sets must be in the same format so that the trained model can be evaluated on the test set.
We will use the function train_test_split()
to split the data into training and testing, the training set will be in a ratio of 30/70 to the training set.
# YOUR CODE HERE
Evaluation using accuracy
, precision
, recall
metrics.¶
Next, we will train the k-NN model with default parameters. We will train the model on the training set (X_train
and y_train
) and use the predict()
function to evaluate its quality on the test set and display its accuracy and this time also the precision and recall metrics.
# YOUR CODE HERE
# store the model prediction to y_model variable (to ensure the following code compatibility)
from sklearn.metrics import accuracy_score, precision_score, recall_score # import libraries
print(f"Presnosť (accuracy) modelu: {accuracy_score(y_test, y_model)}") # compute and write the accuracy metric
print(f"Presnosť (precision) modelu: {precision_score(y_test, y_model)}") # compute and write the precision metric
print(f"Návratnosť (recall) modelu: {recall_score(y_test, y_model)}") # compute and write the recall metric
Using the confusion_matrix()
function, we can see how the classifier classified individual classes and where the biggest error occurred.
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_model)
print(cm)
Evaluating models using the ROC curve¶
We can use Scikit-learn to calculate the ROC curve with the roc_curve()
function. Its input parameters are:
- vector of values of the target attribute of the test set
- vector of values of the target attribute predicted by the model
pos_label
parameter - indicates a positive value
The output of the function is tpr
, fpr
and threshold
, which represent the values of the relative frequency of truly positive examples (True Positive Rate) and the relative frequency of false positive cases (False Positive Rate) and threshold values.
The function auc()
then calculates the AUC value.
We can then plot the ROC curve itself using matplotlib.
from sklearn.metrics import roc_curve,auc # import functions
# using roc_curve we can compute:
# fpr - false positive rate
# tpr - true positive rate
# thresholds - cut-off values
y_model_probs = model.predict_proba(X_test) # we calculate the probabilities for classification into one or second class
preds = y_model_probs[:,1] # keep only one column
fpr, tpr, threshold = roc_curve(y_test, preds) # compute FPR, TPR
roc_auc = auc(fpr, tpr) # AUC computation
# plot the ROC curve using matplotlib
plt.title('ROC Krivka') # figure name
# plot the ROC curve, with the color "navy" (we can use the name of the color as a parameter), write the AUC coefficient in the legend
plt.plot(fpr, tpr, color='green', label = 'ROC krivka modelu (AUC = %0.2f)' % roc_auc)
plt.legend(loc = 'lower right') # we will set the rendering of the legend at the bottom right
plt.plot([0, 1], [0, 1],linestyle='--', color='red') # plot the diagonal in red (r) dashed color
plt.xlim([0, 1]) # x axis values from 0 to 1
plt.ylim([0, 1]) # y axis values from 0 to 1
plt.ylabel('True Positive Rate') # x axis label
plt.xlabel('False Positive Rate') # y axis label
plt.show() # show plot
Since the k-NN classifier is sensitive to the scales of numerical attributes, it is advisable to ensure that individual attributes have the same influence during the calculation, or so that due to the different scales of individual attributes, some of them are not suppressed, or on the contrary, so that some are not more prominent.
This can be achieved by normalizing numeric attributes, which normalizes a numeric attribute within a certain range to a defined interval. It is good to use a uniform type of normalization for all attributes, which will thus acquire values of the same range.
Normalization of attributes¶
Examining the transformed dataset, we find that the 2 attributes fare' and
age' evoke significantly different values from the other attributes.
We can apply normalization using the MinMaxScaler
transformation. In this case, we can apply it to selected attributes or to the entire data frame. In our case, we apply to all attributes. We store the normalized data in the normData
data frame.
from sklearn.preprocessing import MinMaxScaler # import libraries
scaler = MinMaxScaler() # initialize the transformation
normData = pd.DataFrame(scaler.fit_transform(titanic), index=titanic.index, columns=titanic.columns) # apply the transformer to the titanic data frame, we transform the result (field) into a data frame in the same structure as the original frame
normData.head() # write 5 records
## if we want to transform only the selected attributes, we do it like this:
# titanic['fare'] = pd.DataFrame(scaler.fit_transform(pd.DataFrame(titanic['fare'])), columns=['fare'])
# titanic['age'] = pd.DataFrame(scaler.fit_transform(pd.DataFrame(titanic['age'])), columns=['age'])
# normData = titanic
# normData.head()
Task 6.1¶
Divide the transformed dataset in the same way as the untransformed one into a training and testing set (testing with a size of 30% of the total). Train the k-NN model on the transformed dataset and compare its accuracy with the model on the untransformed data. Use precision, return and confusion_matrix
to compare.
# YOUR CODE HERE
Task 6.2¶
Plot the ROC curves of two (or more) models (e.g. two k-NN models with different values of the parameter k
) in one graph at the same time and compare them.
# YOUR CODE HERE
Task 6.3¶
Now try to tune the model on the normalized data by adjusting its parameters. As we mentioned, with the k-NN algorithm we can set several parameters, e.g. value k
, for the classifier KNeighborsClassifier
it is a parameter:
n_neighbors
- corresponds to the value ofk
, the number of nearest neighbors according to which we will classify the unlabeled examples
You set it to a specific value when initializing the model as follows: model = KNeighborsClassifier(n_neighbors=3)
Try to follow the instructions from exercise no. 10, i.e. start with the simplest model (parameter k
=1), which you will increase until the quality of the model stops increasing.
# YOUR CODE HERE
Task 6.4¶
Train the k-NN model on the Titanic dataset by using cross-validation on the training set to validate it. When training models, try the influence of other parameters on the resulting quality of the model:
weights
- weighting, the valueuniform
specifies the same weight of the voice of each of the nearest neighbors, the valuedistance
weights their influence according to the distancemetric
- specifies the metric used, values e.g.euclidean
,manhattan
.
You set these e.g. like this: model = KNeighborsClassifier(n_neighbors=10, weights='uniform',metric='manhattan')
. As part of cross-validation, calculate the average `score' of the models.
Try to find the best combination of parameters and then test the best of the models on the test set. On the test set, calculate the accuracy
, precision
, recall
metrics and output the confusion matrix
.
# YOUR CODE HERE