Detection of anomalies and outliers¶
Outliers and anomalies can be searched for in several ways.
Outliers are easy to find, e.g. using different visualizations if we are looking for outliers within individual attributes or within a combination of values of a small number of attributes.
Of the simple visualization techniques for detecting outliers, we can use:
Task 9.6.¶
Which visualization techniques supported by the Seaborn library can we directly use for outlier detection? Use selected techniques to detect outliers within the Titanic dataset.
import numpy as np
import pandas as pd
import seaborn as sns
titanic = pd.read_csv("../data/titanic-processed.csv")
titanic.head()
# YOUR CODE HERE
# YOUR CODE HERE
# YOUR CODE HERE
In addition to such visualizations, outliers can be detected using clustering algorithms. In that case, it is advisable to use a method that detects few clusters that are far from standard examples. We can therefore use density-based methods for detection (such as DBSCAN), where we set the distance factor of points belonging to a cluster in such a way as to distinguish all standard examples from distant ones that we consider outliers.
Using the example of the Titanic dataset, we demonstrate the use of the DBSCAN method for detecting outliers from the point of view of the age
and fare
attributes.
# we will preprocess the data in the same way as in the examples from the previous exercises:
# - remove attributes that we will not use (e.g. duplicates)
# - map binary and ordinal attributes to indexes
# - transform categorical attributes without arrangement using the One Hot approach
titanic = titanic.drop(columns=['cabin','deck','ticket','title'])
titanic['sex'] = titanic['sex'].map({"male": 0, "female": 1})
titanic['has_family'] = titanic['has_family'].map({False: 0, True: 1})
titanic['fare_ordinal'] = titanic['fare_ordinal'].map({"normal": 0, "more expensive": 1, "most expensive": 2})
titanic['age_ordinal'] = titanic['age_ordinal'].map({"child": 0, "young": 1, "adult": 2, "old": 3})
titanic = pd.get_dummies(titanic, columns=['embarked', 'title_short'])
We will train the DBSCAN model with the defined value of the eps
parameter. We will try to find the right value of the parameter so that it properly separates the examples into clusters - the goal is to "separate" the Ouliers from the standard examples.
We can then plot the results using the Seaborn library and its scatter plot.
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=100) # initialize the DBSCAN model for the defined minimum distance value
labels = dbscan.fit_predict(titanic) # train the model on the input data
g = sns.scatterplot(x='age', y='fare', hue=labels, data=titanic) # draw a dot plot, colored according to clusters
In addition to simple visualization techniques and low-dimensional data, we can use clustering to detect anomalies in data. Usually, it is good to use these methods also where it is a prediction task with a very unbalanced target attribute - we can thus "detect" the minority class by means of clustering.
As an example, we will show how to detect suspicious transactions in data describing transactions made using credit cards.
Understanding and interpreting the data is difficult - these are transformed signs that are coded, we only know that they are attributes of the payer and the payment itself.
from sklearn.preprocessing import normalize # we import used libraries
from sklearn.metrics import confusion_matrix
data=pd.read_csv("../data/creditcard.csv") # load the data into the data frame from the file
data.head() # we will print the first 5 records on the screen
Let's look at the target attribute distribution:
print(data["Class"].value_counts())
g = sns.countplot(x='Class', data=data)
We transform the dataset in the same way as for prediction tasks - we separate the symptom matrix and the vector of values of the target Class
attribute. We can then use it for verification.
features=data.drop(["Time","Class"],axis=1)
labels=pd.DataFrame(data[["Class"]])
We normalize the data frame with the symptom matrix.
from sklearn.preprocessing import normalize
features=normalize(features)
Now let's try to train the clustering model. We will train the model in such a way that it separates the classes representing anomalies from the majority transactions in a suitable way. We then compare the clustering results with the actual values that we store in the vector of target attribute values.
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
#kmeans=KMeans(n_clusters=2, max_iter=300)
#kmeans.fit(features)
#y_kmeans=kmeans.predict(features)
dbscan = DBSCAN(eps=0.5)
y_dbscan = dbscan.fit_predict(features)
The clustering results can also be expressed by the number of clusters. This will give us at least an approximate estimate of the quality of the clustering model (ratio of examples in clusters). Of course, it does not say anything about whether the examples in the individual clusters really correspond to the class assignment.
#clusters, counts = np.unique(y_kmeans, return_counts=True) # we use the unique function to identify different values and return their numbers
#print(np.asarray((clusters, counts)))
clusters, counts = np.unique(y_dbscan, return_counts=True) # we use the unique function to identify different values and return their numbers
print(np.asarray((clusters, counts)))
#print(confusion_matrix(labels,y_kmeans))
y_dbscan[y_dbscan == -1] = 1
print(confusion_matrix(labels == 1,y_dbscan == -1))
Fraud = data[data['Class']==1] # will select data that is flagged as fraud
Valid = data[data['Class']==0] # will select data that is flagged as OK
outlier_fraction = len(Fraud)/float(len(Valid)) # calculate the proportion of anomalies (fraud) in the data, which we then use as a parameter of the LOF method
print(outlier_fraction)
We can also use the Local Outlier Factor method to search for outliers.
from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors=40, metric='euclidean', contamination = outlier_fraction) # we will create a model, the density around each example is calculated (number of neighbors), as the contamination parameter we will indicate the expected share of ammonals
y_lof = lof.fit_predict(features) # train the model
#scores_prediction = lof.negative_outlier_factor_
We can evaluate the results of the model using the confusion matrix. But we need to recode the outputs of the LOF model. The latter marks anomalies with -1 and regular data with 1. In order to be able to compare them with the original values of the Class attribute using the confusion_matrix() function, we need to replace these values with 0, or 1.
y_lof[y_lof == 1] = 0
y_lof[y_lof == -1] = 1
print(confusion_matrix(labels,y_lof))