Basic settings¶
When analyzing the data, we will use several libraries, which must be imported into your notebook at the beginning. The numpy
library provides basic types such as data fields representing numerical vectors of values, or matrices. When analyzing data, we will most often work with the pandas
library, which extends numpy
and defines basic types for the representation of data tables and their columns. We will use the mathplotlib
and seaborn
libraries to draw graphs that will be displayed directly as part of the notebook.
# import the required libraries, this paragraph should be run first before they are used
# imported pd, np, plt and sns objects
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# set the display of graphs directly in the paragraphs of the notebook
%matplotlib inline
# initialize the seaborn library
sns.set()
Dataset - World Happiness Report¶
The World Happiness Report is published annually by the United Nations and is based on questionnaire survey data collected from a representative sample of the population of individual countries. The questionnaire survey focuses on evaluating one's own life based on several criteria, which include economic indicators, social care, healthcare, freedom of decision, perception of kindness among people and perception of corruption.
The data available to you are evaluated for the years 2015, 2016 and 2017 and are described by the following attributes:
Country
- Name of the countryRegion
- Geopolitical region of the countryRank
- Overall ranking of countries for a given year, according to the achieved scoreScore
- Achieved score expressing the overall happiness of the countryGDP
- Economic situation - gross national product per capitaFamily
- Social support of the familyHealth
- Health status - life expectancyFreedom
- Freedom of expression and decision-makingTrust
- Perception of corruptionGenerosity
- Perception of kindness
The overall rating includes the difference between the given country and Dystopia - the hypothetical country that has the lowest rating for all indicators (i.e. no country can be worse than Dystopia).
Loading data and basic statistics¶
The data table is represented in the pandas
library by the type DataFrame
(data frame), which is composed of individual columns represented by the type Series
(data sequence). pandas
directly supports loading and writing data in standard formats such as .csv
(Comma-Separated Values - text format with comma-separated values).
# load the data for the year 2015 from the file 2015.csv
data_2015 = pd.read_csv("../data/2015.csv")
# display the names of the columns (the columns property is an object of type pandas.Index, which represents the sequence of indexes for
# rows or columns of the table, it is accessed similarly as the Python list)
data_2015.columns
The data is accessed in the same way as the Python lists or maps by indexing. By default, the columns are indexed by their name and the values in the columns by the numerical index of the row of the table (from 0). For each numerical column, we can directly calculate basic statistics, such as number of non-empty values, minimum, maximum, mean, standard deviation and quantiles.
# you can find out the size of the data table from the shape property
print(data_2015.shape[0]) # first dimension - number of rows
print(data_2015.shape[1]) # second dimension - number of columns
# you can access the data by indexing, e.g. to print the value of the 'Country' column on the first row of the table (index 0)
print(data_2015['Country'][0])
scores = data_2015['Score'] # scores is an object of type pandas.Series that represents one column of the table
# for numeric columns we can calculate basic statistics directly
score_count = scores.count() # number of non-empty values
score_mean = scores.mean() # average
score_std = scores.std() # standard deviation
score_min = scores.min() # minimum value
score_max = scores.max() # maximum value
score_q25 = scores.quantile(0.25) # 25% quartile
score_q50 = scores.quantile(0.5) # 50% quartile - median
score_q75 = scores.quantile(0.75) # 75% quartile
# statistics for all numerical attributes of the table can also be calculated at once using the describe() method
data_2015.describe()
# for categorical attributes we can display a list of different values and their frequency
# e.g. for 'Region'
data_2015['Region'].value_counts()
Data preparation¶
We will gradually load and modify the data for the years 2015, 2016 and 2017. From the table for the year 2015, we will remove the StdError
column, which indicates the standard deviation of the answers among residents of one state, and we will not use it in the analysis.
# columns are removed using the drop method
# by default, the drop method does not change the original data frame, but creates a new one, so we change the variable data_2015
# to the new value
data_2015 = data_2015.drop(columns="StdError")
# we add the 'Year' column to the table, whose value we set to 2015 for all rows
data_2015["Year"] = 2015
# using the head method, we display the first rows of the table (by default, the head method returns a new DataFrame object with the first
# 5 rows of the original table)
data_2015.head()
We load the data for the year 2016 from which we remove the columns LowerConfidence
and UpperConfidence
and add the column Year
with the value 2016 for all rows.
# load the data for the year 2016
data_2016 = pd.read_csv("../data/2016.csv")
# if we set the inplace parameter to True in the drop method, the column will be removed directly in the original frame and not created
# a new one, as the columns parameter you can specify a list of several columns to be deleted at once
data_2016.drop(columns=["LowerConfidence", "UpperConfidence"], inplace=True)
# add the Year column
data_2016["Year"] = 2016
# you can specify the number of rows to display for the head method
data_2016.head(1) # print 1 row
From the print, you can notice that the table for 2016 does not contain the total score and ranking of the countries (columns Score
and Rank
). Since Score
and Rank
are calculated by counting the other indicators, we can simply calculate them with the eval
method. For the eval
method, the expression for calculating values is entered as a string, which can contain names of existing columns, constants (numbers, strings, etc.) and operators +
, -
, *
, /
, * *
, %
.
# we add a new column Score to the table, the value of which we calculate for each row as the sum of the attributes GDP, Family,
# Health, Freedom, Trust, Generosity and Dystopia
data_2016["Score"] = data_2016.eval("GDP + Family + Health + Freedom + Trust + Generosity + Dystopia")
# the Rank column is determined by the overall ranking of the countries sorted by score from the largest to the smallest
# we sort the rows of the table according to the score in descending order
data_2016.sort_values(by="Score", ascending=False)
# for checking, we display the 3 first and 3 last countries
data_2016.head(3)
# the tail method returns the last rows of the table
data_2016.tail(3)
# since we have the data sorted in descending order according to the total score, we set the values in the Rank order column to
# the sequence of numbers 1, 2, ..., number of countries (number of rows in the table)
data_2016["Rank"] = range(1, data_2016.shape[0] + 1)
data_2016.head()
Task 1.1¶
- Load the data for the year 2017 from the file
2017.csv
into the variabledata_2017
. - Remove unnecessary columns
WhiskerHigh
andWhiskerLow
. - Add a
Year
column and set its value to 2017 for all rows. - Display the first row of the table.
data_2017 = None
# using the concat method, we can combine data for all years into one table, parameter sort=False and ignore_index=True
# indicates that the rows should not be rearranged according to their original indexes and that the order in which they are listed is the same
# like the order of combined tables (i.e. first all 2015 rows, then 2016 and finally 2017)
data = pd.concat([data_2015, data_2016, data_2017], ignore_index=True, sort=False)
data.head(1)
Selection of data¶
It is possible to select only some columns from the table by name, or rows according to the specified condition. To select rows, you can use the query
method, where the logical condition is entered simply as a string, similar to when entering expressions in the eval
method.
# we select only the Contry, Rank, Score and Year columns
selected = data[["Country", "Rank", "Score", "Year"]]
# we will select only rows about Slovakia
selected.query("Country == 'Slovakia'")
# when filtering rows, you can enter a more complex condition with the operators <, >, <=, >=, !=, ==, in [list],
# not in [list] and with logical conjunctions and, or and not
# e.g. we will filter out the rows about our neighbors for 2017 and sort the result according to the overall ranking of the countries
q = "Country in ['Slovakia', 'Czech Republic', 'Poland', 'Hungary', 'Ukraine', 'Austria'] and Year == 2017"
selected.query(q).sort_values(by="Rank")
Dependencies between attributes¶
One of the basic tasks in the data understanding phase is to analyze the dependencies between pairs of attributes.
Dependencies between categorical and numerical attributes¶
Dependencies between categorical and numerical attributes can be examined e.g. using a pivot table that divides the data into groups according to categorical attribute values and summarizes for each group numerical attribute values using various aggregation functions (e.g. mean/min/max value, sample standard deviation, etc.)
We will create a pivot table using the pivot_table
method, where, in addition to data, we must enter at least one categorical attribute and one numeric one.
# e.g. to calculate the average score for all countries in a given year we can enter
pd.pivot_table(data, index="Year", values="Score")
# we can group data according to several categorical attributes at once,
# e.g. by region and year
pd.pivot_table(data, index=["Region", "Year"], values="Score")
# we can rearrange the pivot table by moving some categorical attributes from the rows of the table
# (parameter index) to columns (parameter columns)
# e.g. we can display the previous table more clearly as follows
table = pd.pivot_table(data, index="Region", columns="Year", values="Score")
table
# values of the pivot table can be directly displayed graphically, e.g. as a horizontal bar graph
table.plot(kind="barh")
# set the description of the x-axis
l = plt.xlabel("Average Happiness Rank")
# you can calculate multiple aggregation functions in one table by setting the aggfunc parameter to a list of functions,
# e.g. to calculate the mean value and standard deviation of the score for each region:
pd.pivot_table(data, index="Region", values="Score", aggfunc=["mean", "std"])
# you can also calculate different aggregation functions for different numeric attributes at once
# e.g. in the following table we calculate the average value for the score (Score) and the minimum and maximum value
# for the rank (Rank) for each region
pd.pivot_table(data, index="Region", values=["Score", "Rank"], aggfunc={"Score": "mean", "Rank": ["min", "max"]})
Dependencies between numerical attributes - correlation¶
The basic type of dependence between two numerical attributes is a linear dependence, which can be expressed by a (Pearson) correlation coefficient.
# we select only the numeric attributes
factors = data[["GDP", "Family", "Health", "Freedom", "Trust", "Generosity"]]
# we calculate the correlation table using the corr method
corr_table = factors.corr()
corr_table
We can display the values graphically using a graph of the heat map type.
p = sns.heatmap(corr_table,
xticklabels=corr_table.columns, yticklabels=corr_table.columns, # označíme osy názvami stĺpcov
vmin=-1, vmax=1, # set the minimum and maximum value for the color palette
cmap='coolwarm', # change the preset color palette
square=True) # display square fields
The highest dependency is between GDP
and Health
attributes. We will display the data on the X-Y graph, in which we will also display the regression line.
p = sns.lmplot(data=factors, x='GDP', y='Health', fit_reg=True)
You can display the dependence between all numerical attributes at once with a pair plot graph.
p = sns.pairplot(factors)
Data set - Iris¶
The dataset Iris describes the flowers of the iris species Iris setosa, Iris virginica and Iris versicolor.
The set was collected by biologist Edgar Anderson and was used for the first time in data analysis by a British statistician
Ronald Fisher in 1936. The set contains 50 examples of each type, described by 4 attributes:
the length and width of the petals (petal_width
, petal_height
) and the length and width of the sepals (sepal_width
, sepal_height
).
Task 1.2¶
- Load the data from the Iris dataset (
../data/iris.csv
file) into theiris
variable. - Calculate base statistics for numeric attributes.
- For the nominal attribute
species
, display the different values and their abundances. - Determine if the data set contains missing values.
Task 1.3¶
Display a pair plot plot for numeric attributes. Can you determine from the course which attributes are most correlated?
Task 1.4¶
Calculate the correlation matrix and display it using a heat map. Determine which two attributes are most and least correlated and display their X-Y plot along with the regression line.
Task 1.5¶
Using a pivot table, calculate the mean value and standard deviation for the sepal_length
and sepal_width
attributes grouped by the species
attribute.
Task 1.6¶
- Filter out from the dataset only rows (examples) belonging to the class
virginica
. - From the filtered rows, create a table with the attributes
sepal_ratio = sepal_length / sepal_width
andpetal_ratio = petal_length / petal_width
. - Display the X-Y dependence between
sepal_ratio
andpetal_ratio
and calculate the correlation coefficient between them.