Classification using probabilistic models - Naive Bayes¶
The probabilistic classifier based on Naive Bayes is implemented in Scikit-learn by the GaussianNB
class. We use the method in the same way as other methods. In addition, Naive Bayes is a non-parametric method, i.e. we do not set any parameters of the model (tuning of the model is omitted), so its accuracy depends on the data and its pre-processing.
In the case of the Naive Bayes classifier, we remove redundant attributes. It is also not necessary to normalize the attributes.
# Titanic import a preprocessing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
titanic = pd.read_csv("../data/titanic-processed.csv")
titanic = titanic.drop(columns=['cabin','ticket','title', 'deck', 'fare_ordinal', 'age_ordinal'])
titanic['sex'] = titanic['sex'].map({"male": 0, "female": 1})
titanic['has_family'] = titanic['has_family'].map({False: 0, True: 1})
titanic = pd.get_dummies(titanic, columns=['embarked', 'title_short'])
titanic.head()
X_titanic = titanic.drop('survived', axis=1) # create a flag matrix - use all columns except the target attribute and store in X_titanic
y_titanic = titanic['survived'] # create a vector of target attribute values as column 'survived'
print(X_titanic.shape) # for checking, we can print the dimensions of the matrix of values and the vector of the target attribute
print(y_titanic.shape)
from sklearn.model_selection import train_test_split # we import the function train_test_split()
X_train, X_test, y_train, y_test = train_test_split(X_titanic, y_titanic, test_size=0.3, random_state=1) # split the dataset into training and testing parts, so that the testing part will be 30% of the total dataset
Task 7.6.¶
Try training the GaussianBN()
model on the Titanic data. Since the method uses no input parameters, test the effect of preprocessing and attribute selection on model accuracy. How pre-processed data did you achieve the best results?
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(X_train, y_train)
y_nb = nb.predict(X_test)
from sklearn.metrics import accuracy_score
print(f"Presnosť (accuracy) modelu: {accuracy_score(y_test, y_nb)}")
Naive Bayes belongs to the so-called probabilistic classifiers. This means that, in addition to the relevant class, we can look at the probabilities of the tested example belonging to individual classes.Naive Bayes patrí medzi tzv. pravdepodobnostné klasifikátory. To znamená, že okrem preikujúcej triedy sa vieme pozrieť na pravdepodobnosti príslušnosti testovaného príkladu do jednotlivých tried.
prediction = nb.predict_proba(X_test[:1])
print(prediction)