1_clustering_interpretation.ipynb

Toto je statické zobrazenie, ak chcete Notebook spustiť, prihláste sa do prostredia Data Lab.

Notebook

Clustering - evaluation and interpretation of clusters - example 1¶

The following example demonstrates clustering of transactional data.

This time we will focus on other criteria for evaluating the quality and compactness of clusters and their interpretation not with the help of visualizations, but with the help of decision trees.

First, we import the necessary libraries for working with data frames, fields and for rendering graphs.

In [ ]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

We load the input data from the file into the data frame. We will list the first 5 records.

In [ ]:

data = pd.read_csv('../data/wholesale.csv')
data.head()

As in the previous exercise, we transform the data using the One Hot Encoder (both attributes containing categorical data) and print the first 5 records of the transformed dataset to the screen.

In [ ]:

data = pd.get_dummies(data, columns=['Channel', 'Region']) 
data.head()

Since we will be training a K-Means model, we will normalize the numerical attributes using MinMaxScaler.

In [ ]:

from sklearn.preprocessing import MinMaxScaler # import MinMaxScaler

scaler = MinMaxScaler() # Initialize the transformer
scaler.fit(data) # apply the transformer to the input data

# after applying the scaler, we will get an output in the form of a numpy array
# we can - but we don't have to - transform it into the pandas framework 
# functions for training models can then work with both pandas and numpy

# data_norm = scaler.transform(data)
data_norm = pd.DataFrame(scaler.fit_transform(data), index=data.index, columns=data.columns)

Criterion Silhouette¶

In addition to the sum of squared distances from the cluster center, we can use several other metrics that define the quality of individual clusters. It makes sense to use them:

where we do not use methods that create cluster representatives
if we want to use a different criterion than the one used by the algorithm itself

One of frequently used criterions is the Silhouette index. It provides the coefficient, calculated for each sample and averaged for the entire data set. The coefficient combines the average value of the intra-cluster distance metric with the average distance to the nearest cluster. The coefficient takes on values from 0 and 1 (for each sample). A value close to zero means that the sample is probably classified in the wrong cluster, and values closer to 1 mean that the sample is a regular element of the predicted cluster and well distinguishable from others. The Silhouette coefficient in scikit-learn then calculates the average value for all examples. This then allows multiple clustering models (with different numbers of clusters) to be compared against each other.

Similar to the case of finding the correct value of clusters using the sum of squared distances, we can create several models in the cycle, which we evaluate using this criterion.

In [ ]:

from sklearn.cluster import KMeans # Import the library for KMeans
from sklearn.metrics import silhouette_score # Import the function for calculating Silhouette

# we will use the Silhouette score for the number of clusters
# we can then compare the ideal numbers of clusters for different criteria

K = range(2,10) # generate the parameter array (number of clusters)

results = []

# in cycle we create a clustering model for each value of the parameter, the number of clusters corresponds to the value of the iterator

for k in K:
    model = KMeans(n_clusters = k)
    model.fit(data_norm)
    predictions = model.predict(data_norm) # to calculate the silhouette, we assign examples from the input data to clusters
    results.append(silhouette_score(data_norm, predictions)) # calculate the score and assign it to the list in which we will store all scores

In [ ]:

# we can print the results on the screen
# the list contains the Silhouette scores for the parameters, in the order in which they were created

print(results)

We can visualize the Silhouette score in the same way as in the case of the sum of the squares of the distances from the centroid.

Task 9.1.¶

Use matplotlib as in the tasks from the previous exercise to plot the dependence of the number of clusters on the Silhouette score.

In [ ]:

# YOUR CODE HERE

Now we can train the model with the best score.

In [ ]:

model = KMeans(n_clusters=6) # we train a model for the specified number of clusters
model.fit(data_norm) # we learn on the training set

labels = model.predict(data_norm) # we sort the input data into clusters

In [ ]:

# we can look at the belonging of samples to clusters by listing their predictions
print(labels)

The frequency of individual clusters within the input data can also be essential information. We can easily calculate it from labels by counting the number of occurrences of different elements of the clustering result field.

In [ ]:

clusters, counts = np.unique(labels, return_counts=True) # we use the unique function to identify different values and return their counts
print(np.asarray((clusters, counts))) # to format the output "prettier", we combine them into a numpy array

In [ ]:

Interpreting clusters using decision trees¶

One of the possibilities (besides the examination of attribute values, etc.) how to interpret the resulting clusters is to build classification models above the given clusters, which will enable the examples belonging to the given cluster to be described in some way. Decision trees are ideal for this purpose - with their help, we can derive rules that describe the conditions of belonging of examples to individual clusters.

In such a case, the process is as follows - using clustering, we actually "generate" the target attribute from the point of view of classification. Individual clusters then essentially represent its individual values - classes. We can therefore assign a "target attribute" to the input data, which now expresses the belonging of the example to a specific cluster. We can therefore create a tree model over such data - which is representable and comprehensible, since our goal is to understand and understand the created clusters, ideally also to describe e.g. using a combination of attribute values.

When we use the input data (data) and the vector of belonging of the examples to the clusters (labels), we basically create a pair of the flag matrix and the vector of the target attribute values that we use in the classification. The data is then prepared in such a way that we can use it to train classification models.

In [ ]:

# the data frame data basically corresponds to an array of flags
# the column of target attribute values corresponds to the vector of target attribute values

X_train = data
y_train = labels

In [ ]:

print(X_train.shape)
print(y_train.shape)

Task 9.2.¶

Train the tree classifier on the input data. If necessary, pre-process the data additionally. Choose a method for finding parameters, or estimate model parameters.

In [ ]:

# YOUR CODE HERE

Task 9.3¶

Train the model with suitable parameters on the input data and display confusion_matrix for it. Compare the resulting matrix with the results of the frequency of individual clusters.

In [ ]:

# YOUR CODE HERE

Task 9.4.¶

Use the code from the examples from the previous exercises and try to visualize the created model. You can use the knowledge you derive from its structure to describe individual classes, or clusters?

In [ ]:

# YOUR CODE HERE