Predictive modeling - regression (using linear regression)¶
In this example, we demonstrate a sample solution of a regression-type predictive problem - a problem when the predicted attribute is continuous. The method of modeling and preparation for modeling corresponds to classification tasks (method of dividing data into training and testing, model training).
First, we import some necessary libraries.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Loading the Diabetes dataset using the load_diabetes
function. The dataset in the diabetes' variable contains vectorized data, the
data' column contains data from the input attributes, and the `target' column contains the values of the target attribute.
from sklearn import datasets
diabetes = datasets.load_diabetes()
For a better display of the data, we can convert the data and display it as a data frame - however, we can already work directly with the vector form of the data.
diabetes['data']
contains 10 columns in which the values of patient parameters (Age, Sex, BMI index, average blood pressure, etc.) are encoded. The predicted variable diabetes['target']
expresses the quantitative rate of disease progression one year after the initial measurement.
df = pd.DataFrame(diabetes['data'])
df.head()
Using the train_test_split
function, we split the data into a training and a test set. Since we already have vectorized data in diabetes
, predicting attributes in column data
and predicted attribute in column target
, we can directly assign them to variables X (matrix of symptoms) and y (vector of target attribute values)
from sklearn.model_selection import train_test_split
X = diabetes['data']
y = diabetes['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)
We import the library for the linear regression model - linear regression. We initialize the model and train it on the training set.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
model = lm.fit(X_train, y_train)
After not training, we can verify the model on the test set. Using the predict
function, we calculate the predictions for the data from the test set and store them in the y_model
variable.
To compare the actual values from the test set and the predicted values, we can create a table (like a Pandas data frame) where we list the actual values from the test set y_test
and the values predicted by the model y_model
.
y_model = model.predict(X_test)
summary_df = pd.DataFrame()
summary_df['target'] = y_test
summary_df['prediction'] = y_model
print(summary_df)
summary_df.plot(kind="scatter", x="target", y="prediction")
Similar to classification tasks, we can calculate several metrics from these results that allow us to compare regression models, or express the quality of the given regression model. Several metrics are used for regression models:
- average absolute error - shows the size of the total error that occurred during the prediction (large errors are not penalized in this indicator)
- mean squared error - the quantity expresses the accuracy of estimates using the mean value of the squares of the differences between the predicted value and reality (it penalizes extreme errors, or in other words, MSE is much more affected by large errors than small ones)
- R2 score - Coefficient of determination, ranging from 0 (no linear relationship) to 1 (absolute linear relationship, either positive or negative)
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, y_model)
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_model)
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_model)
print("MAE:", mae)
print("MSE:", mse)
print("R2 Score:", r2)