2_kmeans.ipynb

Toto je statické zobrazenie, ak chcete Notebook spustiť, prihláste sa do prostredia Data Lab.

Notebook

Working with the K-Means algorithm¶

in this example, we will show the work with the K-Means algorithm on a simple example.

We will demonstrate the creation of a clustering model on data that describes the clients of a wholesaler. It consists of information that characterizes annual purchases of fresh, dairy, food and other products. Each client is described by the following attributes:

Fresh - annual outlets for fresh products
Milk - annual dispensaries for dairy products
Groceries - annual outlets for food products
Frozen - annual outlets for frozen products
Detergents_Paper - annual outlets for cleaning and paper products
Delicassen - annual deli shops
Channel - method of selling goods to customers - Horeca (Hotel/Restaurant/Cafe) or Retail
Region - nominal values, corresponding to Lisbon, Porto or Other

The goal is to create a clustering model that would find groups of different types of customers in a set.

First, we import the necessary libraries.

In [ ]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

We load the data from the file into the data frame and print the header.

In [ ]:

data = pd.read_csv('../data/wholesale.csv')
data.head()

Since the dataset contains 2 categorical attributes (Channel and Region), we transform these columns using the One Hot Encoder approach.

In [ ]:

data = pd.get_dummies(data, columns=['Channel', 'Region']) 
data.head()

In the case of clustering using K-Means, a very important part is the calculation of the distances of individual examples from the cluster centers. To ensure the same importance of all attributes, we normalize them to the same range. Similar to classification, we can use MinMaxScaler, for the entire data frame (we normalize all attributes).

We can transform the normalized data into a data frame (but we don't have to, other functions can work with the numpy array we get at the output of the normalization).

In [ ]:

from sklearn.preprocessing import MinMaxScaler # we import MinMaxScaler

scaler = MinMaxScaler() # Initialize the transformer
scaler.fit(data) # we apply it to the input data

# after applying the scaler, we will have an output in the form of a numpy array
# we can - but we don't have to - transform it into the pandas framework (if we still want to do some preprocessing)
# functions for training models can then work with both pandas and numpy

# data_norm = scaler.transform(data)
data_norm = pd.DataFrame(scaler.fit_transform(data), index=data.index, columns=data.columns)

Now we will create a K-Means clustering model that we will train on the input data. The K-Means implementation in Sciki-learn allows the following algorithm settings (we have selected only a few):

n_clusters - corresponds to the value of the k parameter, defines the number of clusters
max_iter - corresponds to the maximum number of iterations of the algorithm (default value - 300)

As an output after creating the model, we can use:

cluster_centers_ - array of centroid coordinates for individual clusters
labels_ - cluster membership for all input data objects
inertia_ - the sum of the squares of the distances of the examples to the centroid (criterion)

In [ ]:

from sklearn.cluster import KMeans

model = KMeans(n_clusters=4)
model.fit(data_norm)
# labels = model.predict(data_norm)

We can now write e.g. the sum of squared distances within the clusters for the created model or e.g. calculate the distances between individual centroids.

In [ ]:

from sklearn.metrics.pairwise import euclidean_distances # import the function euclidean_distances which calculates the distances between the specified points

print("Inertia:") # display the calculated inertia
print(model.inertia_)

print("Mutual centroid distances:")
dists = euclidean_distances(model.cluster_centers_) # calculate the distances between the cluster centers and print them
print(dist)

We can also look at the content of individual clusters. Using model.labels_ we can look at which clusters the model assigned individual examples from the input data set. We can also look at a particular cluster and the examples that belong to it.

In [ ]:

print("Samples and their belonging to clusters:")
print(model.labels_) # print the cluster membership for each example

In [ ]:

print("Samples from cluster 0:")
cluster_0 = np.where(model.labels_==0) # we only select examples that belong to cluster 0
print(cluster_0) # we print them on the screen

Now we will show how we can compare two clusters with each other.

I will now create data frames (non-normalized) from examples for individual clusters. Now I can look at the contents of individual clusters and compare their commonalities or differences.

In [ ]:

cluster_1 = np.where(model.labels_==1) # find examples assigned to cluster 1

data_cluster_0 = data.iloc[cluster_0] # data frame from examples for cluster 0
data_cluster_1 = data.iloc[cluster_1] # dataframe from examples for cluster 1

In [ ]:

data_cluster_0.describe()

In [ ]:

data_cluster_1.describe()

We can also make visualizations using Seaborn to compare clusters.

Visualize with Seaborn e.g.:

Distributions of selected attribute values for individual clusters
Average values of selected attributes of individual clusters

In [ ]:

import seaborn as sns

# YOUR CODE HERE

How to find the best value of number of clusters? We can use the so-called "elbow" method. Similar to Grid Search, we will create a set of models with different parameter values. However, in this case we will also need some criterion that would tell us something about the clusters themselves.

Since we use the K-Means model, we can search for the optimal number of clusters by calculating for each model the sum of the squares of the distances of the examples classified into the cluster to their centroid. We can get this value from the model as one of its

In [ ]:

Sum_of_squared_distances = [] # empty array of distance generators

K = range(1,15) # generate a range of K parameters

# in the cycle we select models with different settings

for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(data_norm)
    Sum_of_squared_distances.append(km.inertia_)
    
print(Sum_of_squared_distances)

We can visualize the dependence of the number of clusters on their "compactness".

In [ ]:

plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('Number of clusters')
plt.ylabel('Sum of distance generators')
plt.title('Finding the optimal number of clusters')
plt.show()

In [ ]: