Data visualization, pt. 2 (seaborn)#

Goals of this lecture#

  • Introducting seaborn.

  • Putting seaborn into practice:

    • Univariate plots (histograms).

    • Bivariate continuous plots (scatterplots and line plots).

    • Bivariate categorical plots (bar plots, box plots, and strip plots).

Introducing seaborn#

What is seaborn?#

seaborn is a data visualization library based on matplotlib.

  • In general, it’s easier to make nice-looking graphs with seaborn.

  • The trade-off is that matplotlib offers more flexibility.

import seaborn as sns ### importing seaborn
import pandas as pd
import matplotlib.pyplot as plt ## just in case we need it
import numpy as np
%matplotlib inline 
%config InlineBackend.figure_format = 'retina'

The seaborn hierarchy of plot types#

We’ll learn more about exactly what this hierarchy means today (and in next lecture).

title

Example dataset#

Today we’ll work with a new dataset, from Gapminder.

  • Gapminder is an independent Swedish foundation dedicated to publishing and analyzing data to correct misconceptions about the world.

  • Between 1952-2007, has data about life_exp, gdp_cap, and population.

df_gapminder = pd.read_csv("data/viz/gapminder_full.csv")
df_gapminder.head(2)
country year population continent life_exp gdp_cap
0 Afghanistan 1952 8425333 Asia 28.801 779.445314
1 Afghanistan 1957 9240934 Asia 30.332 820.853030
df_gapminder.shape
(1704, 6)

Univariate plots#

A univariate plot is a visualization of only a single variable, i.e., a distribution.

title

Histograms with sns.histplot#

  • We’ve produced histograms with plt.hist.

  • With seaborn, we can use sns.histplot(...).

Rather than use df['col_name'], we can use the syntax:

sns.histplot(data = df, x = col_name)

This will become even more useful when we start making bivariate plots.

# Histogram of life expectancy
sns.histplot(df_gapminder['life_exp'])
<Axes: xlabel='life_exp', ylabel='Count'>
../_images/fafae383153215dd4be6faeaa14bbfa08f4608a1accaa8b9447bf3f7b680587c.png

Modifying the number of bins#

As with plt.hist, we can modify the number of bins.

# Fewer bins
sns.histplot(data = df_gapminder, x = 'life_exp', bins = 10, alpha = .6)
<Axes: xlabel='life_exp', ylabel='Count'>
../_images/ca3456de3de6b94349ddd679d06164b561585882eeeba00f639267688723f886.png
# Many more bins!
sns.histplot(data = df_gapminder, x = 'life_exp', bins = 100, alpha = .6)
<Axes: xlabel='life_exp', ylabel='Count'>
../_images/460b459d138d506c045ff537dc2b2579bd6204befe72716ea985a707b47f6fd1.png

Modifying the y-axis with stat#

By default, sns.histplot will plot the count in each bin. However, we can change this using the stat parameter:

  • probability: normalize such that bar heights sum to 1.

  • percent: normalize such that bar heights sum to 100.

  • density: normalize such that total area sums to 1.

# Note the modified y-axis!
sns.histplot(data = df_gapminder, x = 'life_exp', stat = "probability", alpha = .6)
<Axes: xlabel='life_exp', ylabel='Probability'>
../_images/74b812bba55bdb821f51a5d7db46d32a4dfc3b0ff0bdbcfd1636863d6e14fd3b.png

Check-in#

How would you make a histogram showing the distribution of population values in 2007 alone?

  • Bonus 1: Modify this graph to show probability, not count.

  • Bonus 2: What do you notice about this graph, and how might you change it?

### Your code here

Solution (pt. 1)#

### original graph
sns.histplot(data = df_gapminder[df_gapminder['year']==2007], x = 'population', stat = 'probability')
<Axes: xlabel='population', ylabel='Probability'>
../_images/77dc1df6bd2fa2e107153cc1bca9b481e74dde28c26f58a4c2cad03c803ed77d.png

Solution (pt. 2)#

The plot is extremely right-skewed. We could transform it using a log-transform.

df_gapminder['pop_log'] = df_gapminder['population'].apply(lambda x: np.log10(x))
sns.histplot(data = df_gapminder, x = 'pop_log', stat = 'probability')
plt.xlabel("Population (Log 10)")
Text(0.5, 0, 'Population (Log 10)')
../_images/025f39969f1cf80559ac89442ab58366000d34021dc0a6195ecfc4118362d556.png

Solution (pt. 3) using log_scale#

Rather than transforming the data directly, we can do this using sns.histplot.

sns.histplot(data = df_gapminder, x = 'pop_log', stat = 'probability', log_scale = True)
<Axes: xlabel='pop_log', ylabel='Probability'>
../_images/72ec35b87f631b9212bddf3d00bd7b72a7f4ab773be9d0edc0b7a19ac9272c74.png

Bivariate continuous plots#

A bivariate continuous plot visualizes the relationship between two continuous variables.

title

Scatterplots with sns.scatterplot#

A scatterplot visualizes the relationship between two continuous variables.

  • Each observation is plotted as a single dot/mark.

  • The position on the (x, y) axes reflects the value of those variables.

One way to make a scatterplot in seaborn is using sns.scatterplot.

Showing gdp_cap by life_exp#

What do we notice about gdp_cap?

sns.scatterplot(data = df_gapminder, x = 'gdp_cap',
               y = 'life_exp', alpha = .3)
<Axes: xlabel='gdp_cap', ylabel='life_exp'>
../_images/bd0b2965db38376763829b793d9b399758c213ca29263a72d3e1481956c1fedf.png

Showing gdp_cap_log by life_exp#

## Log GDP
df_gapminder['gdp_cap_log'] = np.log10(df_gapminder['gdp_cap']) 
## Show log GDP by life exp
sns.scatterplot(data = df_gapminder, x = 'gdp_cap_log', y = 'life_exp', alpha = .3)
<Axes: xlabel='gdp_cap_log', ylabel='life_exp'>
../_images/c0c2c856cf6bce2fa0f59fb2adfe74afbe32fbd13d9376aacfe2273a3b7ed48b.png

Adding a hue#

  • What if we want to add a third component that’s categorical, like continent?

  • seaborn allows us to do this with hue.

## Log GDP
df_gapminder['gdp_cap_log'] = np.log10(df_gapminder['gdp_cap']) 
## Show log GDP by life exp
sns.scatterplot(data = df_gapminder[df_gapminder['year'] == 2007],
               x = 'gdp_cap_log', y = 'life_exp', hue = "continent", alpha = .7)
<Axes: xlabel='gdp_cap_log', ylabel='life_exp'>
../_images/cac8fb31c7167dd92aa17f6ce0001ac6c6dd9e27f371ef154baca740a6d4bbac.png

Adding a size#

  • What if we want to add a fourth component that’s continuous, like population?

  • seaborn allows us to do this with size.

## Log GDP
df_gapminder['gdp_cap_log'] = np.log10(df_gapminder['gdp_cap']) 
## Show log GDP by life exp
sns.scatterplot(data = df_gapminder[df_gapminder['year'] == 2007],
               x = 'gdp_cap_log', y = 'life_exp',
                hue = "continent", size = 'population', alpha = .7)
<Axes: xlabel='gdp_cap_log', ylabel='life_exp'>
../_images/ae59721b13cc3693fa077181c6037503cda0ea80e0af2a0440ec4f8d9998d24e.png

Changing the position of the legend#

## Show log GDP by life exp
sns.scatterplot(data = df_gapminder[df_gapminder['year'] == 2007],
               x = 'gdp_cap_log', y = 'life_exp',
                hue = "continent", size = 'population', alpha = .7)

plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0)
<matplotlib.legend.Legend at 0x168779910>
../_images/69675be037b7381b1a05c740dd18099d85e482d3401cf7109c16d676e82be61a.png

Lineplots with sns.lineplot#

A lineplot also visualizes the relationship between two continuous variables.

  • Typically, the position of the line on the y axis reflects the mean of the y-axis variable for that value of x.

  • Often used for plotting change over time.

One way to make a lineplot in seaborn is using sns.lineplot.

Showing life_exp by year#

What general trend do we notice?

sns.lineplot(data = df_gapminder,
             x = 'year',
             y = 'life_exp')
<Axes: xlabel='year', ylabel='life_exp'>
../_images/24b5a4fe4d7924aa1dd85f572155269794b66c2ee8fd6c8f9d51db6cae3f5c26.png

Modifying how error/uncertainty is displayed#

  • By default, seaborn.lineplot will draw shading around the line representing a confidence interval.

  • We can change this with errstyle.

sns.lineplot(data = df_gapminder,
             x = 'year',
             y = 'life_exp',
            err_style = "bars")
<Axes: xlabel='year', ylabel='life_exp'>
../_images/45705befb1bd3c83044856ab6377f21465288c288c90a0b8b4f3758119f4c18d.png

Adding a hue#

  • We could also show this by continent.

  • There’s (fortunately) a positive trend line for each continent.

sns.lineplot(data = df_gapminder,
             x = 'year',
             y = 'life_exp',
            hue = "continent")
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0)
<matplotlib.legend.Legend at 0x16913d7d0>
../_images/79c442d3b7b17ef421232dc7d18ae766650d073eb3bd2c2ecd03764ee6b0fc49.png

Check-in#

How would you plot the relationship between year and gdp_cap for countries in the Americas only?

### Your code here

Solution#

What do we notice about:

  • The overall trend line?

  • The error bands as year increases?

sns.lineplot(data = df_gapminder[df_gapminder['continent']=="Americas"],
             x = 'year',
             y = 'gdp_cap')
<Axes: xlabel='year', ylabel='gdp_cap'>
../_images/b4ce0ec440e9d936c356c6047812fb7451d8b099e6af82361abc5de85c870b48.png

Heteroskedasticity in gdp_cap by year#

  • Heteroskedasticity is when the variance in one variable (e.g., gdp_cap) changes as a function of another variable (e.g., year).

  • In this case, why do you think that is?

Plotting by country#

  • There are too many countries to clearly display in the legend.

  • But the top two lines are the United States and Canada.

    • I.e., two countries have gotten much wealthier per capita, while the others have not seen the same economic growth.

sns.lineplot(data = df_gapminder[df_gapminder['continent']=="Americas"],
             x = 'year', y = 'gdp_cap', hue = "country", legend = None)
<Axes: xlabel='year', ylabel='gdp_cap'>
../_images/cdcf3af17f8f394aa11b16d4cf8c101da93d8730e7946636c106f3493d598d06.png

Using replot#

  • relplot allows you to plot either line plots or scatter plots using kind.

  • relplot also makes it easier to facet (which we’ll discuss momentarily).

sns.relplot(data = df_gapminder, x = "year", y = "life_exp", kind = "line")
/Users/seantrott/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
<seaborn.axisgrid.FacetGrid at 0x1690401d0>
../_images/f6f44ed18a37d6c8bd387c2611db485456e921408194dcecd8fb6fe6df243f08.png

Faceting into rows and cols#

We can also plot the same relationship across multiple “windows” or facets by adding a rows/cols parameter.

sns.relplot(data = df_gapminder, x = "year", y = "life_exp", kind = "line", col = "continent")
/Users/seantrott/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
<seaborn.axisgrid.FacetGrid at 0x1692edb10>
../_images/18c3c84b154108fbd49ee6f52d526df694ff7fbf169ac2ca0b0f0a84ab870db7.png

Bivariate categorical plots#

A bivariate categorical plot visualizes the relationship between one categorical variable and one continuous variable.

title

Example dataset#

Here, we’ll return to our Pokemon dataset, which has more examples of categorical variables.

df_pokemon = pd.read_csv("data/pokemon.csv")

Barplots with sns.barplot#

A barplot visualizes the relationship between one continuous variable and a categorical variable.

  • The height of each bar generally indicates the mean of the continuous variable.

  • Each bar represents a different level of the categorical variable.

With seaborn, we can use the function sns.barplot.

Average Attack by Legendary status#

sns.barplot(data = df_pokemon,
           x = "Legendary", y = "Attack")
<Axes: xlabel='Legendary', ylabel='Attack'>
../_images/ffdc35bc97f9dd02091c3c0fbd8b37c9f3c66f111ed165161154185dea3e2fcb.png

Average Attack by Type 1#

Here, notice that I make the figure bigger, to make sure the labels all fit.

plt.figure(figsize=(15,4))
sns.barplot(data = df_pokemon,
           x = "Type 1", y = "Attack")
<Axes: xlabel='Type 1', ylabel='Attack'>
../_images/9e87f19c255362418728191a216bb9a28bfddafcbf28bb26aeb65f0123327f47.png

Check-in#

How would you plot HP by Type 1?

### Your code here

Solution#

plt.figure(figsize=(15,4))
sns.barplot(data = df_pokemon,
           x = "Type 1", y = "HP")
<Axes: xlabel='Type 1', ylabel='HP'>
../_images/61656be4b0e8a7e2d3ac5e0250b2e0a63d11ff0c4089221935d7897a2e022227.png

Modifying hue#

As with scatterplot and lineplot, we can change the hue to give further granularity.

  • E.g., HP by Type 1, further divided by Legendary status.

plt.figure(figsize=(15,4))
sns.barplot(data = df_pokemon,
           x = "Type 1", y = "HP", hue = "Legendary")
<Axes: xlabel='Type 1', ylabel='HP'>
../_images/0f988cc4c166fbbe5bb7450b8f0635403ee32d050bbe8b6226a89aa0c9791f6e.png

Using catplot#

seaborn.catplot is a convenient function for plotting bivariate categorical data using a range of plot types (bar, box, strip).

sns.catplot(data = df_pokemon, x = "Legendary", 
             y = "Attack", kind = "bar")
/Users/seantrott/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
<seaborn.axisgrid.FacetGrid at 0x16878f9d0>
../_images/24c41529817a3e92b7b56d1f17f8a21dea7344266f7e430c82f0cf53fe2b406f.png

strip plots#

A strip plot shows each individual point (like a scatterplot), divided by a category label.

sns.catplot(data = df_pokemon, x = "Legendary", 
             y = "Attack", kind = "strip", alpha = .5)
/Users/seantrott/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
<seaborn.axisgrid.FacetGrid at 0x16c6f0d50>
../_images/f5508e395bf0194068a8bc1c68ca559e446f70587943cfb513e9dde386b373a5.png

Adding a mean to our strip plot#

We can plot two graphs at the same time, showing both the individual points and the means.

sns.catplot(data = df_pokemon, x = "Legendary", 
             y = "Attack", kind = "strip", alpha = .1)
sns.pointplot(data = df_pokemon, x = "Legendary", 
             y = "Attack", hue = "Legendary")
/Users/seantrott/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
<Axes: xlabel='Legendary', ylabel='Attack'>
../_images/0a6f41567d3800770d8f2e907176f748d5080c1cabf03ebefeab779b1024940f.png

box plots#

A box plot shows the interquartile range (the middle 50% of the data), along with the minimum and maximum.

sns.catplot(data = df_pokemon, x = "Legendary", 
             y = "Attack", kind = "box")
/Users/seantrott/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
<seaborn.axisgrid.FacetGrid at 0x16c696c90>
../_images/42bc8584ef9ef0912def4b86eb206c1d6be729e23304e6be886e74d96bc27ba6.png

Conclusion#

As with our lecture on pyplot, this just scratches the surface.

But now, you’ve had an introduction to:

  • The seaborn package.

  • Plotting both univariate and bivariate data.

  • Creating plots with multiple layers.