Clustering - evaluation and interpretation of clusters - example 2¶
The same procedure as in the previous example can be demonstrated on another example. In this dataset, we will work with data that characterizes customers. Each of them is described:
- Gender - sex
- Age - age
- Annual Income - annual income in dollars
- Spending Score - an index describing the customer's tendency to buy
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# read the dataset and explore first 5 rows
data = pd.read_csv('../data/customers.csv')
data.head()
First, we prepare the data by simple preprocessing:
- remove the customer identifier
- encode the gender attribute
# YOUR CODE HERE
data = data.drop("CustomerID", axis=1)
data["Gender"] = data["Gender"].map({"Male": 0, "Female":1})
data.head()
As in the previous example, we identify a suitable number of clusters and train the K-Means model for the selected number. We will try to use both criteria - both inertia
and silhouette
for the decision.
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
results_inertia = []
results_silhouette = []
K = range(2,10)
for k in K:
model = KMeans(n_clusters=k)
model.fit(data)
predictions = model.predict(data)
results_inertia.append(model.inertia_)
results_silhouette.append(silhouette_score(data, predictions))
print("Inertia:")
print(results_inertia)
print("Silhouette:")
print(results_silhouette)
In order to have a better overview, we can plot both graphs. Using subplot()
we can plot both graphs side by side at the same time.
plt.figure(figsize=(12, 4)) # define the size of the image (stretch a little to the width, to render 2 next to each other)
plt.subplot(1, 2, 1) # 1-2-1 means we will create 1 row, 2 columns and draw to 1.
plt.plot(K, results_inertia, 'bx-')
plt.xlabel('Number of clusters')
plt.ylabel('Sum of distance generators')
plt.title('Inertia')
plt.subplot(1, 2, 2) # 1-2-2 means we will create 1 row, 2 columns and draw up to 2.
plt.plot(K, results_silhouette, 'bx-')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette')
plt.title('Silhouette')
plt.show() # call the show function only at the end, which renders both graphs at once
We have discovered the ideal number of clusters and now we will train the K-Means model with the given number of clusters. We will use the trained model and classify the examples into clusters. We will use the created variable for the classification model for the description of clusters.
model = KMeans(n_clusters=6)
model.fit(data)
labels = model.predict(data)
# print out the predictions
print(labels)
Now we train the classifier, which we will use to describe the clusters.
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
dt = DecisionTreeClassifier(max_depth=4) # Decision tree init
dt.fit(data, labels) # train the model
y_dt = dt.predict(data) # test the model
from sklearn.metrics import accuracy_score,precision_score, recall_score # compute the metrics
print(f"Accuracy: {accuracy_score(labels, y_dt)}")
cm = confusion_matrix(labels, y_dt) # confusion matrix
print(cm)
We plot the tree model to estimate the rules for the given clusters. We will try to derive rules for describing individual clusters.
from sklearn import tree
from sklearn.tree import export_graphviz
with open("decision_tree.txt", "w") as f:
f = tree.export_graphviz(dt, feature_names=data.columns.values, class_names=["0","1","2","3","4","5"], out_file=f)
Task 9.5.¶
Create visualizations of the average values of their numerical attributes for the identified clusters. Compare these with the extracted rules, or with knowledge visible from the tree structure of the model.
# YOUR CODE HERE