Clustering¶
In the following examples, we will describe how to work with clustering models in the Scikit-learn library using sample demonstration tasks. In these tasks, we will use either generated or simplified data so that the outputs can be well visualized by graphs (data sets will therefore contain only 2 attributes to be able to plot them - rare in practice).
Examples demonstrate usage:
- k-means methods
- on grid/density based methods
- hierarchical clustering
K-means methods¶
For a demonstration of the use and demonstration of the functioning of k-means methods, we will choose the K-Means method (the parameters and setting of the method will be described below).
First, we import the libraries we will work with. We will also need Seaborn and matplotlib for plotting the outputs and numpy for working with arrays.
import matplotlib.pyplot as plt # we import matplotlib for plotting
import seaborn as sns; sns.set() # import seaborn for more advanced visualizations and set the environment
import numpy as np # we import numpy for working with fields
# we will set rendering of visualizations in Jupyter notebooks
%matplotlib inline
Now we will prepare the data for the demonstration of this method. To demonstrate how K-Means works, we will generate sample synthetic data with two numerical attributes. To generate the data, we will use the make_blobs
function from the examples generator of the Scikit-learn library (samples_generator
). It is used precisely for creating sample datasets with a specific distribution. The make_blobs
function will therefore create a defined number of data points (n_samples
) in four clusters (centers
). We can also define how "densely" the generated points around the centers should be generated (cluster_std
). The output of the function is a numpy array that corresponds to the flag matrix and target attribute vector (which this time represents cluster membership).
The output of the function is a flag matrix (numpy array) and a vector of target attribute values. In this case, it represents the actual inclusion of the example in the cluster. It is not used in the actual creation of clustering models (in practice, real values are often not even available). If they are (e.g. expert's opinion, etc.), then these values can be used to express the quality of the created clusters (the example will be part of the next lesson).
from sklearn.datasets.samples_generator import make_blobs # import the function for the data generator
# we will generate 300 records with a defined distribution
# in four groups, with a defined deviation from the centers
blobs, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
# the output is a matrix of example flags (blobs) and a vector of true cluster membership values y_true
# blobs and y_true are numpy arrays
# write one record from the generated data points
print(blobs[:1])
We can easily display the generated data using the Seaborn library. Since we have data described by two numerical attributes, we can use scatterplot
for visualization. It works both with pandas data frames and with numpy fields, so we define the individual columns of the input numpy field as parameters.
# plot the points with a scatterplot
# to the x-axis column with index 0
# to the y-axis column with index 1
g = sns.scatterplot(x=blobs[:, 0], y=blobs[:, 1])
Now let's try to build a KMeans model from the Scikit-learn library.
The way of creating models is very similar to creating classification models using this library. In the same way, we first import the necessary library, initialize the model with defined parameters, train the model on the input data using the fit
function. To classify examples into a given cluster, we can then use the predict
function, the output of which this time will be predicting the cluster to which the given object/objects belong/belong. So the difference is that (since this is clustering - unsupervised learning) we are not working with data on the value of the target attribute, since it does not exist. Thus, the functions for creating clustering models use only the matrix of symptoms, without the vector of values of the target attribute, during training (fit
) and prediction (predict
).
A mandatory parameter of the K-Means algorithm is the number of clusters we are looking for. This is defined by the value of the n_clusters
parameter.
The example below will create a K-Means model on the generated data. Considering the structure of the generated data, we will try to train the K-Means model for 4 clusters. In the next step, we will use the predict
function on the training data - we want to classify all examples from the data set into the created clusters so that we can clearly visualize the clustering results. Likewise, we can get either individual or all centroids from the created model using cluster_centers_
.
from sklearn.cluster import KMeans # first we import the necessary library, in this case KMeans for the given model
kmeans = KMeans(n_clusters=4) # initialize the K-Means model, set the parameter value K - the number of clusters - to 4
kmeans.fit(blobs) # train the model on the input data
y_kmeans = kmeans.predict(blobs) # we sort all the data into the created clusters
centers = kmeans.cluster_centers_ # we load the centroids of the created clusters into the variable centers
print("All centroids:") # print centroids for all clusters
print(centers)
If we would like to classify an unknown example into one of the clusters, we use the predict
function tactfully. As its parameter, we must enter an example, transformed into a numpy field (in the same form as the training data). Then, using the precict
function, we get the cluster identifier into which the model classifies the unknown example.
# we will create a sample example and transform it into a numpy field in the desired shape (1 row, 2 columns)
x = np.array([1.98686, 3.76876]).reshape(1, 2)
prediction = kmeans.predict(x) # we will predict its membership in the kmeans model cluster
print(prediction) # print the output on the screen
In this sample case, since we are working with two attributes, we can use the Seaborn library to visualize the created clusters and their centroids. So, we will use classified training data (blobs
and clusters that include y_kmeans
) and calculated centroids (centers
). So let's plot 2 scatter plot visualizations in one graph:
- we will first plot the data points, separated by color according to the clusters
- then we plot the cluster centers
In both cases, we plot the values of the first and second column on the X and Y axes (first and second attribute), or centroids.
# draw the first scatterplot - X and Y axes correspond to the first, respectively to the second column in the data field
# we want to distinguish individual points by color according to clusters (we have cluster membership in y_kmeans)
# we set the color palette to Set1
g=sns.scatterplot(x=blobs[:, 0], y=blobs[:, 1], hue=y_kmeans, palette="Set1")
# we also draw the centroids in the created visualization, also with a scatterplot, since they have the same structure as the input data
# we draw the objects of the centers array (coordinates are both columns)
# we set the size of the points with the s parameter (we want to highlight the centroids a little)
# we define the color and the rendering method with the marker parameter
g=sns.scatterplot(x=centers[:, 0], y=centers[:, 1], s=150, color=".1", marker="X")
Now we will show what happens if we do not choose the right value of the k
parameter. We will create a model with 6 clusters, in the same way we will sort the examples from the input set into the created clusters and visualize the structure of the clusters with a scatter plot.
This demonstrates the need for the correct choice of the clustering parameter,
# we create a new model, this time for 6 clusters
kmeans2 = KMeans(n_clusters=6)
kmeans2.fit(blobs) # train the model on the input data
y_kmeans2 = kmeans2.predict(blobs) # we will use the model to assign data to the created clusters
# we plot the data with a scatterplot, differentiated by color according to clusters
g = sns.scatterplot(x=blobs[:, 0], y=blobs[:, 1], hue=y_kmeans2, s=50, palette='Set1')
Grid or density based clustering¶
As we said, k-means methods are sometimes not suitable where it does not assume that the objects are in spherical clusters. Then it is advisable to use another type of clustering, e.g. grid or density based. The example below demonstrates on sample generated data what problems K-Means can have and how they can be dealt with by another model.
For this task, we will again generate data of a specific shape. We will not generate "heaps" of data points this time, but using the make_moons
function, data grouped in two dimensions in the shape of crescents. In the same way as in the previous example, we can define how many data points we want to create in this way and with what noise.
from sklearn.datasets import make_moons # we import the necessary library for dataset generation
# we will generate the input data
# 200 examples
moon_data, y_true = make_moons(200, noise=.05, random_state=0)
# moon_data again contains a numpy array of examples described by 2 attributes and y_true the actual value of cluster membership
# again, using the scatter plot, we plot the generated data
# on the x-axis the first column and on the y-axis the second
sns.scatterplot(x=moon_data[:, 0], y=moon_data[:, 1])
Let's try to train the K-Means model on such a dataset and see how it copes with clustering on such data.
# we initialize the K-Means model, for 2 clusters
kmeans_moons = KMeans(n_clusters=2)
kmeans_moons.fit(moon_data) # train the model on the input data
labels = kmeans_moons.predict(moon_data) # we assign the input data to clusters
And now we can visualize how K-Means coped with this shaped data - we visualize the assignments using a Scatter plot.
# plot the data points using a scatter plot
# we differentiate by color according to belonging to clusters (labels)
# YOUR CODE HERE
From the output, we can see that K-Means cannot identify clusters well in such structured data.
So we will try to use a different type of method, a method based on density, which should be able to detect clusters of non-spherical shapes as well.
We will use the DBSCAN method from the Scikit-learn library, which we will train on the input data in the same way. Its parameter is the eps
value - the largest distance between two examples, for which they are still considered examples from the same neighborhood. The DBSCAN model in Scikit-learn has an implemented function fit_predict
which trains the model and immediately assigns the value of the cluster to which it belongs to the input data.
from sklearn.cluster import DBSCAN # we import the necessary libraries
dbscan = DBSCAN(eps=######) # initialize the DBSCAN model for the defined minimum distance value
labels = dbscan.fit_predict(moon_data) # train the model on the input data and assign the data to clusters
# using a scatter plot, we draw the data points and color them according to their belonging to the clusters
sns.scatterplot(x=moon_data[:, 0], y=moon_data[:, 1], hue=labels, s=50, palette='Set1')
Hierarchical clustering¶
Now we will show an example of hierarchical agglomerative clustering. We will use the data from the first task.
First, we draw a dendrogram - a hierarchy of clusters - on this data using the dendrogram
function from the Scipy library. To render it, we first need to create its structure. The linkage
function creates a hierarchical clustering model that we can use to plot a dendrogram. For the function, we specify the input data and cluster joining algorithm (parameter method
).
import scipy.cluster.hierarchy as shc # we import the necessary libraries
plt.figure(figsize=(30, 20)) # set the size of the rendered image
plt.title("Dendogram:") # we will write its name
links = shc.linkage(blobs, method='ward') # create a hierarchical cluster model
dend = shc.dendrogram(links) # draw the dendrogram
This is how we visualize the structure of the dendrogram. We can then create an agglomerative model for a defined number of clusters using Scikit-learn that will correspond to a certain level of the dendrogram hierarchy.
from sklearn.cluster import AgglomerativeClustering # we import the necessary libraries
aggcl = AgglomerativeClustering(n_clusters=5) # set the parameters and the defined number of clusters (where the agglomerative model "stops")
labels_agg = aggcl.fit_predict(blobs) # train the model
We can then visualize the model in the same way using the Seaborn library.
sns.scatterplot(x=blobs[:,0], y=blobs[:,1], hue=labels_agg, palette='Set1')