Introduction to Scikit-learn¶
The examples in this lesson aim to introduce the Scikit-learn library. Scikit-learn is a library of machine learning algorithms for the Python language and contains implementations of many algorithms for classification, regression, clustering used in data analytics.
The following examples demonstrate the basic use of the library including the conversion of data into a format suitable for model training, the basis of how to work when training models using this library. Examples will be demonstrated on the standard Iris database on the example of creation and evaluation of a simple classification model.
First, similar to the previous lessons, we import the used libraries.
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
We will load the data set that we will work with into the Pandas data frame. For checking, we print the header of the imported data frame.
# using the load_dataset() function, we load the iris dataset from the repository of standard datasets into the iris data frame
iris = sns.load_dataset("iris")
# print the header of the iris data frame
iris.head(10)
To examine the data set, we use the Seaborn library and draw a pairwise diagram for individual attributes, with color differentiation according to the value of the target attribute (species
).
# set the default rendering style in the Seaborn library using the set() function
sns.set()
# we draw a pairplot for all attributes with color differentiation according to the species attribute
# the optional height parameter defines the size of the rendered graph compared to the standard preset value
g = sns.pairplot(iris, hue='species', height=2);
Modeling with Scikit-learn¶
Typically, modeling using the Scikit-Learn API consists of the following steps:
- Splitting the data into a flag matrix and a vector of target attribute values
- Splitting data into training and testing sets
- Choosing and importing Scikit-learn classes of the model that we will create
- Setting the parameters of the learning algorithm
- Learning the model on the training data using the
fit()
function
The next steps then depend on the type of model we are training.
- For predictive models - using the model to predict the target attribute of the test data using the
predict()
function - For descriptive models - we derive model properties using the
transform()
orpredict()
functions
Splitting the data into a flag matrix and a vector of target attribute values¶
In order to use the data for modeling using the Scikit-Learn library, we need to extract the flag matrix and the array of target attribute values from the pandas data frame. We will thus separate the columns that will be used to train the models from the target attribute. The image below shows how the data frame will be split. We can implement this using data frame operations.
The naming convention is that we store the flag matrix in the code in a variable starting with the letter X
. The feature matrix is assumed to be two-dimensional, with dimensions n_samples
X n_features
(where n_samples
is the number of examples and n_features
is the number of features (attributes)), and is usually formed by a NumPy array or a Pandas data frame.
In addition to the flag matrix, we also work with a vector of target attribute values. We usually mark this with the initial letter y
. The array is usually one-dimensional, with length n_samples
(the number of examples of the dataset), which corresponds to the dimension of the feature matrix, and is generally composed of a NumPy array or a Pandas column. A vector of target attribute values can contain numeric values or discrete values representing classes.
# from the iris data frame, we use the drop() function to remove the column with the target attribute and then store such a data frame in X_iris,
# the axis parameter set to 1 specifies that we remove the entire column from the frame
X_iris = iris.drop('species', axis=1)
# we will print out the dimensionality of the feature matrix (size 150 x 4)
X_iris.shape
# analogously, for the needs of creating a column of values vector, we assign to y_iris the values of the "species" column from the iris data frame
y_iris = iris['species']
# we print out the dimensionality of the target attribute value vector (dimension 150)
y_iris.shape
Division into training and testing set¶
In this step, we will show how to divide the dataset into a training set, on which we will train the classification model, and a test set, which we will use for its evaluation. We will use the train_test_split()
function from the sklearn
library. This function serves to divide the dataset into a training and a test set, and its parameters are the matrix of input data (in this case, we will use X_iris
), a vector with the values of the target attribute (y_iris
). The random_state
parameter is used to initialize the internal random number generator that will be used to split the examples into the training and test sets. The parameter test_size
(in the range from 0 to 1) specifies the ratio of the size of the test set to the training set (eg a value of 0.5 indicates that the ratio of the test set to the training set will be 50 to 50).
from sklearn.model_selection import train_test_split # Naimportujeme z knižnice potrebnú funkciu
# we will use a function to split the feature matrix and the value vector into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_iris, y_iris, test_size=0.4, random_state=1)
Using train_test_split()
will result in:
X_train
- matrix of training set featuresX_test
- matrix of testing set featuresy_train
- vector of values of the target attribute of the training sety_test
- vector of values of the target attribute of the test set
We can list the dimensions and headers of individual matrices and vectors using the shape
and head()
functions.
# YOUR CODE HERE
Model training¶
At the beginning, it is necessary to import the classes of the model that we want to train. In the Scikit-Learn library, each model has a corresponding Python library class. In this case, we import the KNeighborsClassifier
class corresponding to the nearest neighbor classifier.
With the model = KNeighborsClassifier()
command, we initialize the model, and the fit()
function is used to train it. The parameters of the function fit()
are the matrix of training documents and the corresponding vector of values of the target attribute of the training documents (in this case Xtrain
and ytrain
created in the previous step).
We can use the created model (model
object) to classify the examples in the test set. We implement this by applying the predict()
function to the created model. Its parameter is the flag matrix (without the vector of target attribute values) of the test set (in this case Xtest
). The output is a vector of predictions calculated by the y_model
model.
# importing the class corresponding to the model we will train
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
model.fit(X_train, y_train) # training the model on the training set
y_model = model.predict(X_test) # # using the model to predict the target attribute of the test data feature matrix
Evaluation of the model¶
We can verify the accuracy of the learned classifier in a simple way. For this, we use the vector of values of the target attribute of the test set (ytest
), which we compare with the predicted values (vector y_model
). Sklearn allows you to use multiple functions to calculate various classifier quality metrics. In the example below, we calculate the accuracy, i.e. the ratio of correctly and incorrectly classified examples.
First, we import the necessary class from the sklearn library. Then we calculate the accuracy metric with the command accuracy_score(ytest, y_model)
. The parameters of the function are a vector of values of the target attribute of the test set and a vector of values predicted by the model.
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_model)
To visualize how the classifier predicted the values we can view 'confusion_matrix()'. It will show us how the classifier coped with the classification into individual classes. On the diagonal of the matrix we see the examples that were classified correctly, outside the diagonal we see the wrongly classified examples. We draw the matrix using the function confusion_matrix()
, where the mandatory parameters of the function are the vector of values of the target attribute and the vector of predictions.
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_model)
print(cm)
In order to render the matrix more comprehensive, we can supplement it with captions for individual rows and columns and values of the target attribute by transforming it into a data frame.
print(pd.DataFrame(confusion_matrix(y_test, y_model, labels=['setosa', 'versicolor','virginica']), index=['true:setosa', 'true:versicolor', 'true:virginica'], columns=['pred:setosa', 'pred:versicolor', 'pred:virginica']))
Of course, we can visualize the matrix using the Seaborn library or others.
g = sns.heatmap(cm, cmap='magma', annot=True) # render using the heatmap() function from seaborn
Division of data into training, validation and test set¶
In the previous example, we divided the dataset into a training and a test set. Methodologically, it is sometimes appropriate to divide the dataset into a training set, on which we will train the model, a validation set, on which we will tune the model parameters, and a test set, which we will use for evaluation.
In this example, we will show how we could use Scikit-learn to split the dataset in such a way. So we will again use the train_test_split()
function as for splitting into a training/testing set, but with the fact that we apply it twice. For the first time, we will divide the dataset into two subsets in the chosen ratio and thus create a test set and a set, which we will once again divide into two - for training and validation.
The example below splits the dataset into a 60/20/20 training/validation/test set.
# first, we split the whole dataset in two by creating a "temporary" set and a test set in the ratio 80/20
# then we divide the "temporary" set into training and testing in the same way. In order to preserve the original ratio, this time in a ratio of 75/25
X_temp, X_test, y_temp, y_test = train_test_split(X_iris, y_iris, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=1)
The result of these operations will be the following feature matrices:
- X_train (dimension 90 x 4)
- X_val (size 30 x 4)
- X_test (size 30 x 4)
And vectors of target attribute values:
- y_train
- y_val
- y_test
You can check their dimensions and headers below using the shape
and head()
functions.
# YOUR CODE HERE
We can then train the model on the training set, we can use the validation set to optimize the model parameters, and we can evaluate the model on the test set.
The code below trains two k-NN models on the training set for 2 different values of its k
parameter.
# importing the class corresponding to the model we will train
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=3) # training kNN model with parameter value k=3
model.fit(X_train, y_train) # training the model on the training set
y_model = model.predict(X_val) # we will use different models for prediction on the validation set
print(f"Accuracy of the first model: {accuracy_score(y_val, y_model)}") # and we compare the performance metrics of the classifiers on the validation set
Task 5.1:¶
write code to train model model2
of type K-nearest neighbors with parameter value k
=5. As in the previous paragraph, test the trained model on the validation set and report its accuracy using accuracy_score
.
# YOUR CODE HERE
Then use the best of the models on the validation set for evaluation on the test set and calculate its accuracy.
# YOUR CODE HERE
N-fold cross-validation¶
The scikit-learn library also provides a function for cross-validation. We can use the cross_val_score()
function from the model_selection
package. The function has three required parameters:
model
- the model we want to evaluateX_train
- matrix of training set featuresy_train
- vector of values of the target attribute of the training setcv
- parameter defining the multiplicity of cross-validation
The cv
parameter can be an integer that indicates the degree of cross-validation (eg cv
=5 divides the training set into 5 subsets, which are then used to train and test 5 models). In addition to the specific specification of the parameter value, we can define a specific way of dividing the training set. The cross_val_score()
function returns as a result an array of metrics - in this case, an array of classifier accuracies.
# we import the necessary function
from sklearn.model_selection import cross_val_score
# we will use cross-validation for the model, dividing the X_train/y_train training set into 5 parts
score = cross_val_score(model, X_train, y_train, cv=5)
print(score) # we will print score values
If we want to use a different metric in cross-validation to evaluate the quality of the classifier, or if we want to include several metrics for evaluation, we need to use the cross_validate()
function. In addition to the parameters specifying the model, the training set and the level of cross-validation, it also has the scoring
parameter. It contains a list of metrics that we want to calculate for the models. In the following example, we will add the accuracy
metric to the list (we will explain and show some others in the following exercises). In addition to the selected metrics, the cross_validate()
function also calculates the model training times fit_time
and the time needed to classify a new example score_time
.
A list of all metrics that can be used to evaluate classification models can be found in the documentation here: https://scikit-learn.org/stable/modules/model_evaluation.html#multimetric-scoring
The output of the function is then an array of metrics for each stage of cross-validation. Individual metrics can be viewed using the command scores.keys()
.
# we import the necessary functions for cross-validation
from sklearn.model_selection import cross_validate
scoring = ['accuracy'] # we choose the metrics we want to calculate for the models
# execute the cross-validation
# parameter return_train_score specifies whether we also want to return the result of the evaluation on the training set in the results
scores = cross_validate(model, X_train, y_train, scoring=scoring, cv=10, return_train_score=False)
# let's sort the output array of metrics by key
sorted(scores.keys())
print(scores['test_accuracy']) # we will print the selected field of metrics
Tasks¶
Load the dataset Winequality
from the data
directory for exercise 10, which describes the characteristics of wines. These are described by 11 attributes that describe various (mostly chemical) properties of wines. The target attribute that we will use predictive models to predict is wine quality. All attributes in the dataset are numeric, the dataset does not contain missing values.
Description of individual attributes:
fixed acidity
- Concentration of non-volatile acidsvolatile acidity
- Concentration of volatile acidscitric acid
- amount of citric acid in wineresidual sugar
- the amount of sugar that remains in the wine after fermentationchlorides
- salt content in winefree sulfur dioxide
- amount of free SO2total sulfur dioxide
- total amount of SO2 (in small amounts undetectable, in larger concentrations influence on wine taste)denisty
- density (ratio of wine density to water density)pH
- characterizes the acidity/alkalinity of the wine in the range from 0 (very acidic) to 14 (very basic); most wines reach a pH value of 3-4sulphates
- amount of additives (e.g. antioxidants) which can be measured using the amount of sulfatesalcohol
- amount of alcohol in winequality
- the target attribute, the resulting "grade" characterizing the quality of the wine, takes on the entire values 0 - 10
Task 5.2¶
Prepare this dataset for use in the Scikit-learn library.
- First combine both files
winequality_white.csv
andwinequality_red.csv
into one data frame - Split the integrated data into a flag matrix and a vector of target attribute values.
# YOUR CODE HERE
Task 5.3¶
Divide the prepared dataset into a training and a test set in the ratio 80/20. List the dimensions for the feature matrix and the vector of target attribute values for both the training and test sets.
# YOUR CODE HERE
Task 5.4¶
Train the k-NN model on the training set with the selected value of the parameter k
. Use 10-fold cross-validation to find the best value of the k
parameter.
# YOUR CODE HERE
Task 5.5¶
Test the best model on the test set. Print the accuracy of the classifier and plot the confusion matrix in any way.
# YOUR CODE HERE