Working with the K-Means algorithm¶
in this example, we will show the work with the K-Means algorithm on a simple example.
We will demonstrate the creation of a clustering model on data that describes the clients of a wholesaler. It consists of information that characterizes annual purchases of fresh, dairy, food and other products. Each client is described by the following attributes:
- Fresh - annual outlets for fresh products
- Milk - annual dispensaries for dairy products
- Groceries - annual outlets for food products
- Frozen - annual outlets for frozen products
- Detergents_Paper - annual outlets for cleaning and paper products
- Delicassen - annual deli shops
- Channel - method of selling goods to customers - Horeca (Hotel/Restaurant/Cafe) or Retail
- Region - nominal values, corresponding to Lisbon, Porto or Other
The goal is to create a clustering model that would find groups of different types of customers in a set.
First, we import the necessary libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
We load the data from the file into the data frame and print the header.
data = pd.read_csv('../data/wholesale.csv')
data.head()
Since the dataset contains 2 categorical attributes (Channel and Region), we transform these columns using the One Hot Encoder approach.
data = pd.get_dummies(data, columns=['Channel', 'Region'])
data.head()
In the case of clustering using K-Means, a very important part is the calculation of the distances of individual examples from the cluster centers. To ensure the same importance of all attributes, we normalize them to the same range. Similar to classification, we can use MinMaxScaler
, for the entire data frame (we normalize all attributes).
We can transform the normalized data into a data frame (but we don't have to, other functions can work with the numpy array we get at the output of the normalization).
from sklearn.preprocessing import MinMaxScaler # we import MinMaxScaler
scaler = MinMaxScaler() # Initialize the transformer
scaler.fit(data) # we apply it to the input data
# after applying the scaler, we will have an output in the form of a numpy array
# we can - but we don't have to - transform it into the pandas framework (if we still want to do some preprocessing)
# functions for training models can then work with both pandas and numpy
# data_norm = scaler.transform(data)
data_norm = pd.DataFrame(scaler.fit_transform(data), index=data.index, columns=data.columns)
Now we will create a K-Means clustering model that we will train on the input data. The K-Means implementation in Sciki-learn allows the following algorithm settings (we have selected only a few):
- n_clusters - corresponds to the value of the
k
parameter, defines the number of clusters - max_iter - corresponds to the maximum number of iterations of the algorithm (default value - 300)
As an output after creating the model, we can use:
- cluster_centers_ - array of centroid coordinates for individual clusters
- labels_ - cluster membership for all input data objects
- inertia_ - the sum of the squares of the distances of the examples to the centroid (criterion)
from sklearn.cluster import KMeans
model = KMeans(n_clusters=4)
model.fit(data_norm)
# labels = model.predict(data_norm)
We can now write e.g. the sum of squared distances within the clusters for the created model or e.g. calculate the distances between individual centroids.
from sklearn.metrics.pairwise import euclidean_distances # import the function euclidean_distances which calculates the distances between the specified points
print("Inertia:") # display the calculated inertia
print(model.inertia_)
print("Mutual centroid distances:")
dists = euclidean_distances(model.cluster_centers_) # calculate the distances between the cluster centers and print them
print(dist)
We can also look at the content of individual clusters. Using model.labels_
we can look at which clusters the model assigned individual examples from the input data set. We can also look at a particular cluster and the examples that belong to it.
print("Samples and their belonging to clusters:")
print(model.labels_) # print the cluster membership for each example
print("Samples from cluster 0:")
cluster_0 = np.where(model.labels_==0) # we only select examples that belong to cluster 0
print(cluster_0) # we print them on the screen
Now we will show how we can compare two clusters with each other.
I will now create data frames (non-normalized) from examples for individual clusters. Now I can look at the contents of individual clusters and compare their commonalities or differences.
cluster_1 = np.where(model.labels_==1) # find examples assigned to cluster 1
data_cluster_0 = data.iloc[cluster_0] # data frame from examples for cluster 0
data_cluster_1 = data.iloc[cluster_1] # dataframe from examples for cluster 1
data_cluster_0.describe()
data_cluster_1.describe()
We can also make visualizations using Seaborn to compare clusters.
Visualize with Seaborn e.g.:
- Distributions of selected attribute values for individual clusters
- Average values of selected attributes of individual clusters
import seaborn as sns
# YOUR CODE HERE
How to find the best value of number of clusters? We can use the so-called "elbow" method. Similar to Grid Search, we will create a set of models with different parameter values. However, in this case we will also need some criterion that would tell us something about the clusters themselves.
Since we use the K-Means model, we can search for the optimal number of clusters by calculating for each model the sum of the squares of the distances of the examples classified into the cluster to their centroid. We can get this value from the model as one of its
Sum_of_squared_distances = [] # empty array of distance generators
K = range(1,15) # generate a range of K parameters
# in the cycle we select models with different settings
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(data_norm)
Sum_of_squared_distances.append(km.inertia_)
print(Sum_of_squared_distances)
We can visualize the dependence of the number of clusters on their "compactness".
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('Number of clusters')
plt.ylabel('Sum of distance generators')
plt.title('Finding the optimal number of clusters')
plt.show()