Basic settings¶
We import the necessary libraries. The following paragraph should be run first before the pd
, np
, plt
and sns
imported objects are used.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# display of graphs directly in the cells of the notebook
%matplotlib inline
# initialize the seaborn library
sns.set()
Options for loading and writing data¶
The pandas
library supports reading and writing data from various formats, such as from text CSV files or Microsoft Excel files. For each of the supported formats, it is possible to change the settings, such as character for separating values, a decimal point, or whether the file contains a header with column/attribute names on the first line.
You can read more about the possibilities of loading and writing data in the pandas
library here.
# for CSV files you can change e.g. value separator (parameter `delimiter`) and decimal point character (`decimal`)
# if the file does not have a header with attribute names on the first line (parameter `header=None`), the attributes can be named
# in the `names` parameter
data = pd.read_csv("../data/iris.csv", delimiter=";", decimal=",", header=None, names=["sepal_length", "sepal_width", "petal_length", "petal_width", "species"])
data.head()
# similarly, you can load data from an Excel file (by default, the data table from the first workbook is loaded)
data = pd.read_excel("../data/iris.xlsx")
To write data to a file, use the data frame methods directly.
# by default, a column with row indexes is also written to the file, if we want to write only attributes without an index, we set `index=False`
data.to_csv("../data/iris_processed.csv", index=False)
# when writing, we can choose which attributes are written
data.to_excel("../data/iris_processed.xlsx", index=False, columns=["sepal_length", "sepal_width"])
Joining tables¶
The concat
method is used for basic data joining, which allows you to join data by rows or by columns aligned according to the indexes of individual rows. The pandas
library further enables the joining of tables using keys similar to relational databases using the SQL language.
# first, we will load and pre-process the data we will work with: World Happiness Report from 2015 and 2016
data_2015 = pd.read_csv("../data/2015.csv") # data for 2015
data_2015 = data_2015[["Country", "Rank", "Score"]] # we only select `Country`, `Rank` and `Score` columns
# we will rename the columns `Rank` and `Score` to `Rank 2015` and `Score 2015`
data_2015 = data_2015.rename(columns={"Rank":"Rank 2015", "Score":"Score 2015"})
data_2015.head()
# similarly, we will pre-process the data for 2016
data_2016 = pd.read_csv("../data/2016.csv")
data_2016 = data_2016[["Country", "Rank", "Score"]]
data_2016 = data_2016.rename(columns={"Rank":"Rank 2016", "Score":"Score 2016"})
data_2016.head()
# if we are using the `concat` method for concatenation by columns (`axis=1`), the data will be concatenated according to the order in the rows
# if the number of rows in the joined tables differs, the data will be supplemented with missing values
data_all = pd.concat([data_2015, data_2016], axis=1)
data_all.tail()
# we will use the `merge` method to join the tables according to the keys
# in the basic parameters, we set the connected data (right and left tables) and attributes that will be used as a key
data_all = pd.merge(left=data_2015, right=data_2016, left_on="Country", right_on="Country")
data_all.tail()
# we check the number of rows and missing values in the linked table
print(len(data_2015), len(data_2016), len(data_all))
data_all.isna().sum()
By default, when joining, only those rows whose key was found in both joined tables are included in the resulting table (i.e. the inner
method). Other options are similar to SQL:
Method merge | SQL | Description |
---|---|---|
left | LEFT OUTER JOIN | Only keys from the left table |
right | RIGHT OUTER JOIN | Only keys from the right table |
outer | FULL OUTER JOIN | Select data if they have a key in the right or left table (unification) |
inner | INNER JOIN | Select data if they have a key in both the right and left table (intersection) |
# joining according to the keys of the left table (if the key is not in the right table, the missing values are added)
data_all = pd.merge(left=data_2015, right=data_2016, left_on="Country", right_on="Country", how="left")
print(len(data_2015), len(data_2016), len(data_all))
data_all.isna().sum()
# unification of values from both tables
data_all = pd.merge(left=data_2015, right=data_2016, left_on="Country", right_on="Country", how="outer")
print(len(data_2015), len(data_2016), len(data_all))
data_all.isna().sum()
Data normalization¶
With some data analysis methods, it is necessary to compare attribute values in the same ranges, i.e. the data needs to be normalized. One of the basic methods is conversion to the selected interval (most often 0-1). For data normalization, we will use objects from the sklearn
library, which contains several advanced methods and algorithms for data analysis.
# we will import the necessary objects from the `sklearn` library
from sklearn.preprocessing import MinMaxScaler, StandardScaler
# we load the `iris` dataset
iris = pd.read_csv("../../6/data/iris.csv")
# we will store numerical attributes in `iris_data' variable
iris_data = iris[["sepal_length", "sepal_width", "petal_length", "petal_width"]]
# we save the names of species (classes) in the variable `iris_labels`
iris_labels = iris["species"]
# we will use the MinMaxScaler object to normalize the data to the interval 0-1
# create a normalization object
min_max = MinMaxScaler()
# we transform the data, the result is a numerical array of the `numpy` library, which we convert back to the `pandas` data frame
norm_array = min_max.fit_transform(iris_data)
# we will create a new data frame from the `numpy` array, we will keep the column names from the original `iris_data` set
iris_data_norm = pd.DataFrame(norm_array, columns=iris_data.columns)
iris_data_norm.describe()
Standardization is a special method of data normalization, when the data is recalculated so that it has a mean value of 0 (mean) and a standard deviation of 1 (or dispersion). When recalculating, the average for the entire column is subtracted from each original value and the result is divided by the standard deviation of the original data.
standard = StandardScaler()
std_array = standard.fit_transform(iris_data)
iris_data_std = pd.DataFrame(norm_array, columns=iris_data.columns)
iris_data_std.describe()
# we draw a histogram of the original, normalized and standardized data for the `sepal_length` attribute
# we create a figure divided into 1 row and 3 columns, we display one histogram in each column
fig, axis = plt.subplots(1, 3)
pl = iris_data["sepal_length"].hist(ax=axis[0]) # `ax` parameter determines in which part of the figure the graph will be displayed
pl = iris_data_norm["sepal_length"].hist(ax=axis[1])
pl = iris_data_std["sepal_length"].hist(ax=axis[2])
Data visualization using the method of principal components¶
Visualization is a very useful tool in data analysis. The main limitation in data visualization is that only two- and three-dimensional graphs can be displayed, i.e. we can display only dependencies between 2-3 attributes at once in one graph (we can use other attributes to render other visual aspects such as color or size of points, etc.)
If we want to visualize data with a larger number of attributes at once, we have to project the data set into a less dimensional 2- or 3-dimensional space. One of the methods of data projection is the method of principal components (Principal Component Analysis - PCA), which tries to find such a projection of attribute values that preserves as many mutual relationships as possible from the original data. The PCA method calculates new attributes from the original attributes - the so-called principal components, which are organized according to how much information from the original data is reflected in them. For visualization, we calculate and display the dependence of the two most important components to show as many original relationships in the data as possible.
# import the PCA method from the `sklearn` library
from sklearn.decomposition import PCA
# we only calculate the two most important principal components
pca = PCA(n_components=2)
pca_array = pca.fit_transform(iris_data_norm)
# the result is a `numpy` numeric field, which we convert to a `pandas` data frame
# name the columns `component_1` and `component_2`
iris_data_pca = pd.DataFrame(pca_array, columns=["component_1", "component_2"])
# let's see how the examples look after projection, we can no longer simply interpret the values of the components
iris_data_pca.head()
# let's see how much information (variance) in the original data is projected into individual components
pca.explained_variance_ratio_
# we calculate the percentage of how much information there is in total in the first two main components that we visualize
print("{0:.4f}".format(np.sum(pca.explained_variance_ratio_) * 100))
# we connect the projected data with the original labels of the examples and display the data on the X-Y graph
# we draw the color according to the assignment to individual plant species
iris_pca = pd.concat([iris_data_pca, iris_labels], axis=1)
pl = sns.scatterplot(data=iris_pca, x="component_1", y="component_2", hue="species")
Tasks¶
Task 3.1¶
Download World Happiness Record 2016 and 2017 data.
- Select only
Country
,Score
andRank
attributes. - Join the tables by the state name so that all values from both tables are there. Rename the
Score
andRank
attributes appropriately according to the year. - Add a new attribute
Diff
whose values will be equal toScore 2017
-Score 2016
- Display the histogram of the
Diff
attribute.
In 2017, are there more states that have improved or worsened?
Task 3.2¶
Download the World Happiness Record 2017 data.
- Select only basic factors (
GDP
,Family
,Health
,Freedom
,Trust
,Generosity
) in the table. - Normalize the data to the interval 0-1.
- Add the names of states and regions to the normalized table.
- Save the data to an Excel file.
Task 3.3¶
Load the preprocessed data from the previous task.
- Calculate the two principal components.
- Visualize the main components on the X-Y graph, show the region of the state as a color.
What can you say about Central Europe? How much of the total variance of the original data is preserved in the visualization?