Visualizations using the Seaborn library - exercise 1¶
The following tasks demonstrate the capabilities of the Seaborn library in different ways of visualizing Pandas dataframe variables.
At the beginning, we import the Seaborn
library. We also import libraries for working with data frames Pandas
and Numpy
for numerical operations that we will use.
The command %matplotlib inline
sets the rendering of the visualizations within the front-end we use (in our case Jupyter notebooks) directly, below the part of the running code. The visualizations will then be part of the notebook.
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# initialize the Seaborn library
sns.set()
Loading dataset¶
Seaborn allows you to use the built-in load_dataset()
function to load standard datasets from the library's examples repository. In this case, we do not load the dataset from disk, it is automatically downloaded from the repository and loaded as a standard data frame. This function can be used when analyzing standard data collections. The only parameter of this function is the name of the dataset, a complete list of datasets that can be loaded in this way can be found here.
In this tutorial we will work with the tips
dataset. This simple dataset contains data on the size of payments and tips in a restaurant. They are characterized by the following attributes:
- total_bill - amount of the entire bill
- tip - tip amount
- sex - the gender of the bill payer
- smoker - identifies whether the payer is a smoker or not
- day - day of the transaction
- time - specifies whether it was dinner or lunch
- size - the size of the group of people
tips = sns.load_dataset('tips') # we load the Tips dataset from the repository of standard datasets
tips.head() # we will have a data frame loaded in the tips variable
We can apply various visualizations provided by the Seaborn library to the Pandas data frame. Individual variables are referenced using the column name in the data frame.
Visualizations of value distribution - numerical attributes¶
The first group of visualizations for numerical attributes are value distribution diagrams (we will show histogram visualizations for categorical attributes later in countplot()
). To visualize them in the Seaborn library, we can use the value distribution visualization using the distplot()
function. It plots the distribution of the values of the selected attribute, where the values of the selected attribute are on the x-axis and their distribution is on the y-axis.
The only required parameter to the distplot()
function is an array or list of variable values. If we have data loaded in a data frame, we will specify the parameter as data_frame['column']
. The function then uses the column name for the x-axis name.
# with the following command, we draw a graph of the distribution of the values of the `total_bill` attribute
g = sns.distplot(tips['total_bill'])
After rendering the graph, we see the distribution of the values of the total_bill
attribute. Note that for rendering purposes the continuous attribute has been discretized. The function distplot()
implicitly automatically recalculates the appropriate number of intervals. In addition to the histogram itself, Seaborn implicitly plots an approximation of the density function of the distribution of values in addition to the abundance.
We can modify the rendering method itself by setting optional parameters. There are several of them:
bins
- the value of this parameter determines the number of intervals that will be used when transforming the continuous attribute during rendering. Takes on integer values.kde
- KDE (kernel density estimation), another of the extension parameters, which turns off or on the rendering of the density approximation of the distribution of values. Its default value isTrue
, we can optionally disable it by setting it toFalse
.rug
- display of data points on the x-axis (valuesTrue
andFalse
)hist
- histogram display (valuesTrue
andFalse
)
# we plot distplot() for the total_bill attribute from the tips data frame without displaying the approximation,
# we use 40 intervals for discretization
g = sns.distplot(tips['total_bill'],kde=False,bins=40)
With different combinations of parameters, we can visualize different combinations of ways of plotting the distribution of values of one variable.
# plot an approximation of the density of the distribution, with data points, without a histogram
g = sns.distplot(tips['total_bill'],kde=True, hist=False, rug=True)
If we want to plot the distribution of values using KDE only, we can use the kdeplot()
function.
Using Seaborn, we can also draw visualizations of value distributions for several attributes at the same time. They are then distinguished by color.
for col in ['tip', 'total_bill']: # we go through all the attributes we want to render in a simple loop
sns.kdeplot(tips[col]) # use kdeplot to plot the KDE curve
Visualizations of the dependence of two numerical variables¶
Scatter plots¶
To visualize the dependence of the values of two numerical attributes, we can use scatter plots using the ``scatterplot()'' function of the Seaborn library. The function has 3 standard parameters at the input:
x
- the variable that will be displayed on the x-axisy
- the variable that will be displayed on the y-axisdata
- data frame of input data
Both analyzed variables must be numeric. In these and the following functions, we already specify the variables only by the name of the column in the used data frame (it is also usually listed as one of the parameters). Again, Seaborn uses the name of the data frame to label the axes in the graph.
The output is a graph where individual points correspond to individual records in the data frame, or combination of the values of both monitored attributes.
The example below shows the association between two numeric attributes. In this case, let's examine the relationship between the tip size and the total_bill using a scatter plot.
# on the x-axis we plot the tip values, on the y-axis the total_bill values
# we use the tips data frame as the source data
g = sns.scatterplot(x='tip', y='total_bill',data=tips)
A scatter plot can also be combined with a categorical attribute. We can thus visually distinguish the values of the combination of two numerical attributes for the specific values of the selected categorical attribute.
We specify it in the scatterplot()
function using the hue
extension parameter. We will then set its value to a categorical attribute, which we will use for the color distinction of the points in the visualization.
Let's say that we want to distinguish the values of the tip
and total_bill
attribute combinations according to whether the guest was a smoker or not.
# let's draw a scatterplot as in the previous example, just use the hue parameter set to the 'smoker' attribute
# to distinguish the points by color
g = sns.scatterplot(x='tip', y='total_bill', hue='smoker', data=tips)
Other parameters with which we can modify the visualization of the dependency are:
style
- if we want to use the shape of points to distinguish individual values of the categorical attributesize
- if we want to use the size continuous attribute to distinguish points using the size of individual points
All the above methods can be combined with each other. We will then be able to plot the dependence of two numerical variables depending on several categorical or other continuous variables in a dot graph.
The example below enriches the previous graph by displaying the size of the group (using the size of the individual points - the size
parameter) and the gender distinction (using the different shape of the points - the style
parameter).
g = sns.scatterplot(x='tip', y='total_bill', hue='smoker', size='size', style='sex', data=tips)
Regression plots¶
The dot graph can also be supplemented with a visualization of the linear functional dependence of the investigated attributes. Such visualization is realized using regplot()
or lmplot()
functions. The function uses the same parameters as the scatterplot()
function, and the resulting dot plot is supplemented with a visualization of the regression model of the functional dependence of the variables.
In the example below, we visualize a linear dependency between the total_bill
and tip
attributes. In this case, the visualization makes sense - as the bill increases, the amount of the tip usually increases as well.
The parameters of the regplot()
function are:
x
- variable on the x-axisy
- variable on the y-axisdata
- data frame
Extending parameters allow you to set the degree of the function used for dependency modeling - a linear function is implicitly set, the `order' parameter specifies the degree of the polynomial.
# the example below shows a combination of the same variables as the previous examples
g = sns.regplot(x='tip',y='total_bill',data=tips)
Similar to the previous visualization methods, regplot
can also be supplemented with color resolution using the values of the selected categorical attribute. As in the previous examples, using the hue
parameter setting.
# analogous to the scatterplot example
# we distinguish the graph using the hue parameter according to the values of the 'smoker' attribute
g = sns.lmplot(x='tip',y='total_bill', hue='smoker', data=tips)
Combined visualizations - Joint plot¶
Visualization using the jointplot()
function allows in general to extend and combine visualizations of the characteristics of the variables themselves together with the visualization of the mutual characteristics of two variables. The parameters of this function are:
x
- the name of the first variabley
- the name of the second variabledata
- data frame of input datakind
- defines the type of visualization combination. Takes values:scatter
,hex
,reg
,kde
.
When it is set to the value scatter
, we get a dot plot like scatterplot()
combined with graphs about the distribution of the values of both variables (like distplot()
).
# the given example combines the visualization of the combination of the values of the two attributes tip and total_bill (as a scatterplot)
# it then supplements them with visualizations of value distributions (like distplot)
g = sns.jointplot(x='tip', y='total_bill',data=tips, kind='scatter')
Task 4.1¶
Try setting different values of the kind
parameter and compare the different rendering methods.
# YOUR CODE HERE
Combined visualizations - Pair graph¶
The paired graph allows you to visualize mutual combinations of all numerical attributes across the entire data frame at once. The result is a matrix of graphs where, outside the diagonals, we display visualizations of mutual combinations of individual attributes. The value distributions of the given attribute are implicitly displayed on the diagonal.
The pairplot()
function requires only one mandatory parameter - the source data frame.
If we want to draw a pair diagram only for the selected columns, we can specify them beforehand in an array that we pass to the function as a parameter.
col = ['tip', 'total_bill', 'size'] # we specify the columns for which we want to draw a graph
g = sns.pairplot(tips[col]) # we call the pairplot function with a parameter
Just like in the previous examples, we can add a color distinction to the pair plot according to the values of the selected categorical attribute by setting the hue
parameter.
Task 4.2¶
In a similar way as in the previous tasks, try to display a pair graph for 3 numerical attributes of the tips
dataset, but differentiated according to the values of the attribute describing gender (sex
).
# YOUR CODE HERE
Visualization of distribution of distribution of values - categorical attributes¶
We can use several types of visualizations to visualize categorical variables.
Among the simplest is the visualization of histograms - the frequency of different values of categorical attributes. We can use the countplot()
function to visualize it.
The function has 2 mandatory parameters at the input, which will make it possible to create basic histograms for one attribute:
x
ory
- column of the data frame visualized on the x or y axis. We choose according to whether we want to draw the graph horizontally or vertically.data
- source data frame
The example below shows the frequency of smoker
attribute values.
g = sns.countplot(x='smoker', data=tips)
The visualization shows the frequency of individual values of the given attribute. We can specify expanding parameters to the function:
hue
- color differentiation using the values of the specified categorical attributepalette
- defines a color set (egrainbow
,Set1
,Blues
orcoolwarm
)
Task 4.3¶
Try the countplot()
visualization by showing the number of smokers also by gender. Draw individual columns horizontally.
### YOUR CODE HERE
Visualizations of interdependence of 2 variables of different types¶
Bar chart¶
countplot
is essentially a simplified form of a barplot
graph, which generally allows displaying the aggregated values of a selected numerical attribute according to the values of a categorical variable (by default it is set as the mean function). The visualization is thus made up of columns whose height represents the average value of the selected attribute. In addition, this information is supplemented by the visualization of the standard deviation.
The barplot()
function has several required parameters:
x
- the name of the categorical variabley
- the name of a numeric variabledata
- data frame of input data
We can further specify rendering with extension parameters. For example:
hue
- color differentiation using the values of the specified categorical attributeestimator
- can explicitly define the estimator function (implicitly the average is set), but we can also use e.g.median
,ci
- explicitly defines the deviation of the estimator (implicitly the standard deviation is set, if it is set toNone
, the deviation is not plotted).palette
- defines a color setorder
- allows you to set the order in which the columns will be displayed, e.g. ['Attribute2', 'Attribute1']
g = sns.barplot(x='sex',y='tip',data=tips) # this command visualizes the average tip amount for men and women
Task 4.4.¶
Plot the median tip amounts for each day of the week, divided by gender.
Note - to recalculate the median, you need to import the median
function from numpy
# YOUR CODE HERE
Box plot¶
We can use boxplot
to display the distribution of values for categorical data. Visualizes the distribution of numeric variables across different categorical attribute values and displays minimum/maximum values, median, and quartiles. Points plotted outside the "box" (so-called fliers) represent outliers.
The boxplot()
function has the same mandatory input parameters as the previous functions:
x
- the name of the categorical variabley
- the name of a numeric variabledata
- data frame of input data
# the code below plots the distribution of tip amount (tip attribute) by gender.
g = sns.boxplot(x="sex", y="tip", data=tips)
We can also modify and expand boxplot
visualizations, similarly to the previous examples. Similarly, we can define:
hue
- color differentiation using the values of the specified categorical attributeorder
- allows you to set the order in which the columns will be displayed, e.g. ['Attribute2', 'Attribute1']palette
- defines a color set
# this example splits the visualization by the ``smoker'' attribute for smokers/non-smokers.
g = sns.boxplot(x="day", y="tip", hue="smoker",data=tips)
Task 4.5¶
Try to draw the last box plot also with this extension parameters palette
and order
. Try setting the first attribute, e.g. to rainbow
, Set1
, Blues
or coolwarm
. Plot the days in reverse order on the x-axis.
# YOUR CODE HERE
You can also try using the extension parameter palette
for other visualizations (except distplot()
graph).
# YOUR CODE HERE
Scatter plots for categorical variables¶
Strip plot¶
The stripplot()
function can be used to plot a dot plot (like a scatterplot
) but in the case when one of the variables is categorical. The resulting graph then visualizes the records and distribution of values for individual categorical values.
The mandatory parameters are:
x
- the name of the categorical variabley
- the name of a numeric variabledata
- data frame of input data
# the same example from the previous demos - on the x-axis we plot the value of the 'day' attribute
# on the y-axis of the 'tip' attribute value, the source dataset is 'tips'
g = sns.stripplot(x="day", y="tip", data=tips)
Similarly, as in the previous cases, these visualizations can also be specified using the hue
parameter according to the values of the selected categorical variable.
Another of the expanding parameters is jitter
. By setting the value of this parameter, we can specify the rendering density.
We can also use the palette
parameter to define the used color palette.
# this graph visualizes the amount of tips (tip) for men and women by individual days of the week (attribute day)
# the density of rendering points (jitter) is set manually to 0.3 and the color palette is also set to coolwarm
g = sns.stripplot(x="day", y="tip", hue="sex",data=tips, jitter = 0.3, palette = 'coolwarm')
Try setting the jitter
parameter to different values in the interval (0,1) and see how it affects the visualization.
Swarm plot¶
The swarmplot()
function plots a similar type of plot to stripplot()
. Its use is therefore the same - for plotting a point graph, where one of the variables is categorical. The difference is only in the rendering itself, where individual points do not overlap. Such a method can then provide a better view of the distribution of values, on the other hand, it may not be suitable for a large number of records.
Task 4.6¶
Now try the same visualization as in the previous example using the swarmplot
function. The same way of specifying input parameters is used - data column for the x
and y
axis, data
for the source data frame, hue
for the categorical attribute and palette
for colors.
# YOUR CODE HERE
We can combine visualizations from the previous examples and create combined graphs. The example below combines swarmplot
and violinplot
into one visualization.
Combined visualizations - Facet grids¶
Facet Grids
allow multiple visualizations to be rendered into a grid at once. Using the FacetGrid()
function, we can create a grid that we can fill with various visualizations.
Facet histograms¶
The example below demonstrates the use of the function to draw multiple histograms in parallel depending on different combinations of attribute values.
Let's say we are interested in knowing how much of the total bill men and women leave during dinner or lunch (the time
attribute). To visualize such a histogram, we first create a new attribute describing the proportion of tips on the total bill and call it tips_pct
.
Then we draw a grid, where we specify the individual rows with the row
parameter and the columns with the col
parameter. Their values are categorical attributes for which individual graphs will be visualized in individual columns.
Then, using the map()
function, we apply a matplotlib histogram (the plt.hist
function) for the selected attribute (in this case, the tip_pct
tip proportion).
tips['tip_pct'] = 100 * tips['tip'] / tips['total_bill'] # we will create a new column with the value of the tip share on the bill
g = sns.FacetGrid(tips, row="sex", col="time") # we will create a grid where there will be rows according to the value of gender and columns according to time
g.map(plt.hist, "tip_pct") # plot matplotlib histograms in a grid
Plotting dependence with a third categorical variable¶
By using the catplot()
function, we can draw the visualization of graphs of the mentioned types (for categorical attributes) also depending on another additional categorical variable. The function will make it possible to draw several diagrams of the given type in parallel, each for one of the values of the next categorical variable.
The catplot()
function is actually a bar graph (the same as in the barplot()
function), but it allows drawing to the FacetGrid
automatically. We can therefore directly specify the col
parameter in the function, which, according to the values of the specified attribute in the individual columns, will draw bar graphs for the individual values of the given attribute.
As an example, we can use a visualization of the amount of tips for men and women according to individual days of the week, divided according to whether they are smokers or not. We define the third categorical variable using the col
parameter. The kind
parameter then defines the rendering style (we can use bar
, strip
, point
, swarm
). If we use the count
value, we can plot the number of attribute values on the x-axis on the y-axis.
g = sns.catplot(x="day", y="tip", hue="sex", col="smoker", data=tips, kind="bar")
Heat maps¶
Heatmaps can be used to clearly plot correlations. We calculate the correlation matrix of the numerical attributes in the data frame using tips.corr()
.
We can write the values of the correlation coefficients themselves into a table by default, or we can use the function heatmap()
from the Seaborn library, which will allow us to visualize the correlation table specified as its parameter in color. The individual color shades then indicate the strength of the mutual correlation.
# rendering a heatmap for the correlation table of the tips framework
# tips.corr() function calculates the correlations of all numeric attributes of the tips data frame
g = sns.heatmap(tips.corr())
This is what the heatmap visualization looks like with default settings. As in other visualizations, here too we can specify several ways to modify the display. For example displaying correlation coefficients according to the annot
parameter, which is also set to the value True
, will plot the values of the correlation coefficients in the corresponding fields. We can also define a color palette using the cmap
(colormap) parameter.
g = sns.heatmap(tips.corr(),cmap='coolwarm',annot=True)
With the mask
parameter, we can specify the fields that are "masked". We can thus mask and not render e.g. fields with zero values, or, for example, in the case of correlation matrices, the part above the diagonal.
mask = np.zeros_like(tips.corr(), dtype=np.bool)
mask[np.triu_indices_from(tips.corr())] = True # triu_indices_from() returns the indices of the upper triangle from the input field, the mask is set to True on them
g = sns.heatmap(tips.corr(), mask=mask, annot=True, square=True) # a heatmap with a mask is drawn
Heatmaps can of course be used to display pivot tables
. We will use the table as an input parameter and, similarly to the previous examples, we will use other parameters to set the display of the coefficients of individual fields. With the cbar
parameter, we can set the color scale on/off.
heatmap_data = pd.pivot_table(tips, values='total_bill', index=['size'], columns='day') # we will create a pivot table - values of the total account according to the size of the group and the day
g = sns.heatmap(heatmap_data, annot=True, cmap="YlGnBu", cbar=False) # we draw the heatmap
The use of a heatmap for visualizing missing values is interesting. This way we can quickly get an idea of which attributes the missing values are in and how many there are. To demonstrate the usage, we will load the Titanic dataset, as there are no missing values in the Tips dataset.
titanic = sns.load_dataset('titanic') # load the Titanic dataset from the standard datasets repository
g = sns.heatmap(titanic.isnull(), cbar = False) # we draw a heatmap for those elements of the data frame that are missing, we don't draw the bar
Style and rendering settings¶
In this section, we will provide some examples and demonstrations of how to better customize the rendering of visualizations. The previous examples used the default preset style of the Seaborn library, and we controlled some of the rendering aspects directly for individual graphs using the palette
parameter. Seaborn allows basic style settings to be implemented both for individual graphs and to define the style uniformly for all used visualizations.
Below are some examples of how some aspects of the visualization can be customized.
We will show individual style adjustments using the distplot
example from the beginning of the exercise.
g = sns.distplot(tips['tip'])
The implicitly rendered chart is rendered with a border, white background. Using the set()
and set_style()
functions, we can define a style that will then be used for all graphs in the script/notebook. The style then specifies the basic elements of the visualizations - the color of the axes, the background of the graph and other elements. It is usually set at the beginning, when initializing Seaborn.
The set()
function allows you to set various aspects:
style
- used image style (see styles below)xlabel
,ylable
- axis labelspalette
- used color palette - similar to what we specified for individual graphs, we can define it for the entire notebook, or different contextfont
- specification of the font of the font in the diagramsfont_scale
- font size
The set_style()
function defines one specific aspect of the visualization - the style of the visualizations themselves. It covers the style
part of setting all aspects of visualization. We can set the style for the document separately. The parameter values are:
darkgrid
whitegrid
dark
white
ticks
If we set the style using set()
or set_style()
, the changes will be reflected in the whole context (eg document). If we want to apply the style only locally, to one of the images, we can use the function axes_style('darkgrid')
as in the tasks below.
Task 4.7¶
Try and compare the graph visualizations of the distribution of tip
attribute values for different values of the function's input parameter.
with sns.axes_style('darkgrid'): # we apply the style using the axes_style function
g = sns.distplot(tips['tip']) # draw the graph
Using the despine()
function, we can remove the frame around the diagram, or define its scope with the display method.
g = sns.distplot(tips['tip'])
g = sns.despine()
despine()
also allows you to change the rendering of the axes, their spacing, or trimming, using the offset
and trim
parameters. We can also specify which of the boundaries we want to remove (with parameters left
, right
, bottom
, top
).
g = sns.distplot(tips['tip'])
g = sns.despine(offset=5, trim=True, left=True, bottom=False)
The set_context()
function will allow you to scale the visualization appropriately for the given purpose. Depending on the different values of its parameter, the size of axis labels, legends, line thickness, or axis scale is adjusted. The parameter values of this function are paper
, notebook
, talk
and poster
. Try the differences in rendering.
We can also add other aspects to the context, e.g. font size (font_scale
),
sns.set_context('talk', font_scale = 1.3)
g = sns.distplot(tips['tip'])
We can specify axis labels using matplotlib
, or also in Seaborn using the set()
function. It has 2 parameters for describing the x- and y-axis.
with sns.axes_style('whitegrid'):
g = sns.distplot(tips['tip'])
g.set(xlabel='deň', ylabel='distribúcia')
Setting the axes or image captions can also be implemented by combining Seaborn and matplotlib rendering. The example below demonstrates the rendering of a Seaborn graph and the subsequent addition of a caption using matplotlib.
g = sns.distplot(tips['tip'])
# if we want, we can combine Seaborn with matplotlib - e.g. we will use the title() function to render the graph header
g = plt.title('Distribution plot example', fontsize=14, fontweight='bold')
We have already tried setting the color palette for some graphs using a separate parameter during rendering. The palette can be set for the entire document with the set_palette()
command. We will show you several color palettes that can be used.
sns.set_palette("Dark2")
g = sns.distplot(tips['tip'])
Most color palettes that can be used are presented here: