Data visualization, pt. 2 (seaborn)

Data visualization, pt. 2 (`seaborn`)#

Goals of this lecture#

Introducting seaborn.
Putting seaborn into practice:
- Univariate plots (histograms).
- Bivariate continuous plots (scatterplots and line plots).
- Bivariate categorical plots (bar plots, box plots, and strip plots).

Introducing `seaborn`#

What is `seaborn`?#

seaborn is a data visualization library based on matplotlib.

In general, it’s easier to make nice-looking graphs with seaborn.
The trade-off is that matplotlib offers more flexibility.

import seaborn as sns ### importing seaborn
import pandas as pd
import matplotlib.pyplot as plt ## just in case we need it
import numpy as np

%matplotlib inline 
%config InlineBackend.figure_format = 'retina'

The `seaborn` hierarchy of plot types#

We’ll learn more about exactly what this hierarchy means today (and in next lecture).

title

Example dataset#

Today we’ll work with a new dataset, from Gapminder.

Gapminder is an independent Swedish foundation dedicated to publishing and analyzing data to correct misconceptions about the world.
Between 1952-2007, has data about life_exp, gdp_cap, and population.

df_gapminder = pd.read_csv("data/viz/gapminder_full.csv")

df_gapminder.head(2)

	country	year	population	continent	life_exp	gdp_cap
0	Afghanistan	1952	8425333	Asia	28.801	779.445314
1	Afghanistan	1957	9240934	Asia	30.332	820.853030

df_gapminder.shape

(1704, 6)

Univariate plots#

A univariate plot is a visualization of only a single variable, i.e., a distribution.

title

Check-in#

How would you make a histogram showing the distribution of population values in 2007 alone?

Bonus 1: Modify this graph to show probability, not count.
Bonus 2: What do you notice about this graph, and how might you change it?

### Your code here

Solution (pt. 1)#

### original graph
sns.histplot(data = df_gapminder[df_gapminder['year']==2007], x = 'population', stat = 'probability')

<Axes: xlabel='population', ylabel='Probability'>

../_images/77dc1df6bd2fa2e107153cc1bca9b481e74dde28c26f58a4c2cad03c803ed77d.png

Solution (pt. 2)#

The plot is extremely right-skewed. We could transform it using a log-transform.

df_gapminder['pop_log'] = df_gapminder['population'].apply(lambda x: np.log10(x))
sns.histplot(data = df_gapminder, x = 'pop_log', stat = 'probability')
plt.xlabel("Population (Log 10)")

Text(0.5, 0, 'Population (Log 10)')

../_images/025f39969f1cf80559ac89442ab58366000d34021dc0a6195ecfc4118362d556.png

Solution (pt. 3) using `log_scale`#

Rather than transforming the data directly, we can do this using sns.histplot.

sns.histplot(data = df_gapminder, x = 'pop_log', stat = 'probability', log_scale = True)

<Axes: xlabel='pop_log', ylabel='Probability'>

../_images/72ec35b87f631b9212bddf3d00bd7b72a7f4ab773be9d0edc0b7a19ac9272c74.png

Bivariate continuous plots#

A bivariate continuous plot visualizes the relationship between two continuous variables.

title

Using `replot`#

relplot allows you to plot either line plots or scatter plots using kind.
relplot also makes it easier to facet (which we’ll discuss momentarily).

sns.relplot(data = df_gapminder, x = "year", y = "life_exp", kind = "line")

/Users/seantrott/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)

<seaborn.axisgrid.FacetGrid at 0x1690401d0>

../_images/f6f44ed18a37d6c8bd387c2611db485456e921408194dcecd8fb6fe6df243f08.png

Faceting into `rows` and `cols`#

We can also plot the same relationship across multiple “windows” or facets by adding a rows/cols parameter.

sns.relplot(data = df_gapminder, x = "year", y = "life_exp", kind = "line", col = "continent")

/Users/seantrott/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)

<seaborn.axisgrid.FacetGrid at 0x1692edb10>

../_images/18c3c84b154108fbd49ee6f52d526df694ff7fbf169ac2ca0b0f0a84ab870db7.png

Bivariate categorical plots#

A bivariate categorical plot visualizes the relationship between one categorical variable and one continuous variable.

title

Example dataset#

Here, we’ll return to our Pokemon dataset, which has more examples of categorical variables.

df_pokemon = pd.read_csv("data/pokemon.csv")

Using `catplot`#

seaborn.catplot is a convenient function for plotting bivariate categorical data using a range of plot types (bar, box, strip).

sns.catplot(data = df_pokemon, x = "Legendary", 
             y = "Attack", kind = "bar")

/Users/seantrott/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)

<seaborn.axisgrid.FacetGrid at 0x16878f9d0>

../_images/24c41529817a3e92b7b56d1f17f8a21dea7344266f7e430c82f0cf53fe2b406f.png

`strip` plots#

A strip plot shows each individual point (like a scatterplot), divided by a category label.

sns.catplot(data = df_pokemon, x = "Legendary", 
             y = "Attack", kind = "strip", alpha = .5)

/Users/seantrott/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)

<seaborn.axisgrid.FacetGrid at 0x16c6f0d50>

../_images/f5508e395bf0194068a8bc1c68ca559e446f70587943cfb513e9dde386b373a5.png

Adding a `mean` to our `strip` plot#

We can plot two graphs at the same time, showing both the individual points and the means.

sns.catplot(data = df_pokemon, x = "Legendary", 
             y = "Attack", kind = "strip", alpha = .1)
sns.pointplot(data = df_pokemon, x = "Legendary", 
             y = "Attack", hue = "Legendary")

/Users/seantrott/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)

<Axes: xlabel='Legendary', ylabel='Attack'>

../_images/0a6f41567d3800770d8f2e907176f748d5080c1cabf03ebefeab779b1024940f.png

`box` plots#

A box plot shows the interquartile range (the middle 50% of the data), along with the minimum and maximum.

sns.catplot(data = df_pokemon, x = "Legendary", 
             y = "Attack", kind = "box")

/Users/seantrott/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)

<seaborn.axisgrid.FacetGrid at 0x16c696c90>

../_images/42bc8584ef9ef0912def4b86eb206c1d6be729e23304e6be886e74d96bc27ba6.png

Conclusion#

As with our lecture on pyplot, this just scratches the surface.

But now, you’ve had an introduction to:

The seaborn package.
Plotting both univariate and bivariate data.
Creating plots with multiple layers.

Data visualization, pt. 2 (seaborn)

Contents

Data visualization, pt. 2 (seaborn)#

Goals of this lecture#

Introducing seaborn#

What is seaborn?#

The seaborn hierarchy of plot types#

Example dataset#

Univariate plots#

Histograms with sns.histplot#

Modifying the number of bins#

Modifying the y-axis with stat#

Check-in#

Solution (pt. 1)#

Solution (pt. 2)#

Solution (pt. 3) using log_scale#

Bivariate continuous plots#

Scatterplots with sns.scatterplot#

Showing gdp_cap by life_exp#

Showing gdp_cap_log by life_exp#

Adding a hue#

Adding a size#

Changing the position of the legend#

Lineplots with sns.lineplot#

Showing life_exp by year#

Modifying how error/uncertainty is displayed#

Adding a hue#

Check-in#

Solution#

Heteroskedasticity in gdp_cap by year#

Plotting by country#

Using replot#

Faceting into rows and cols#

Bivariate categorical plots#

Example dataset#

Barplots with sns.barplot#

Average Attack by Legendary status#

Average Attack by Type 1#

Check-in#

Solution#

Modifying hue#

Using catplot#

strip plots#

Adding a mean to our strip plot#

box plots#

Conclusion#

Data visualization, pt. 2 (`seaborn`)#

Introducing `seaborn`#

What is `seaborn`?#

The `seaborn` hierarchy of plot types#

Histograms with `sns.histplot`#

Modifying the y-axis with `stat`#

Solution (pt. 3) using `log_scale`#

Scatterplots with `sns.scatterplot`#

Showing `gdp_cap` by `life_exp`#

Showing `gdp_cap_log` by `life_exp`#

Adding a `hue`#

Adding a `size`#

Lineplots with `sns.lineplot`#

Showing `life_exp` by `year`#

Adding a `hue`#

Heteroskedasticity in `gdp_cap` by `year`#

Using `replot`#

Faceting into `rows` and `cols`#

Barplots with `sns.barplot`#

Average `Attack` by `Legendary` status#

Average `Attack` by `Type 1`#

Modifying `hue`#

Using `catplot`#

`strip` plots#

Adding a `mean` to our `strip` plot#

`box` plots#