Data visualization in Python (pyplot)

Data visualization in Python (`pyplot`)#

Looking ahead: Weeks 3-4#

In week 3, we’ll dive deep into data visualization.
- How do we make visualizations in Python?
- What principles should we keep in mind?
In week 4, we’ll work on managing and cleaning our data.
- How do I deal with missing values?
- What are some basic ways to describe my data?

I view both these weeks as integral to Exploratory Data Analysis in Python.

Goals of this lecture#

What is data visualization and why is it important?
Introducing matplotlib.
Univariate plot types:
- Histograms (univariate).
- Scatterplots (bivariate).
- Bar plots (bivariate).

Introduction: data visualization#

What is data visualization?#

Data visualization refers to the process (and result) of representing data graphically.

For our purposes today, we’ll be talking mostly about common methods of plotting data, including:

Histograms
Scatterplots
Line plots
Bar plots

Why is data visualization important?#

Exploratory data analysis
Communicating insights
Impacting the world

Exploratory Data Analysis: Checking your assumptions#

Anscombe’s Quartet

title

Communicating Insights#

Reference: Full Stack Economics

title

Impacting the world#

Florence Nightingale (1820-1910) was a social reformer, statistician, and founder of modern nursing.

title

Impacting the world (pt. 2)#

John Snow (1813-1858) was a physician whose visualization of cholera outbreaks helped identify the source and spreading mechanism (water supply).

title

Introducing `matplotlib`#

Loading packages#

Here, we load the core packages we’ll be using.

We also add some lines of code that make sure our visualizations will plot “inline” with our code, and that they’ll have nice, crisp quality.

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as ss

%matplotlib inline 
%config InlineBackend.figure_format = 'retina'

What is `matplotlib`?#

matplotlib is a plotting library for Python.

Many tutorials available online.
Also many examples of matplotlib in use.

Note that seaborn (which we’ll cover soon) uses matplotlib “under the hood”.

What is `pyplot`?#

pyplot is a collection of functions within matplotlib that make it really easy to plot data.

With pyplot, we can easily plot things like:

Histograms (plt.hist)
Scatterplots (plt.scatter)
Line plots (plt.plot)
Bar plots (plt.bar)

Example dataset#

Let’s load our familiar Pokemon dataset, which can be found in data/pokemon.csv.

df_pokemon = pd.read_csv("data/pokemon.csv")
df_pokemon.head(3)

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False

Histograms#

What are histograms?#

A histogram is a visualization of a single continuous, quantitative variable (e.g., income or temperature).

Histograms are useful for looking at how a variable distributes.
Can be used to determine whether a distribution is normal, skewed, or bimodal.

A histogram is a univariate plot, i.e., it displays only a single variable.

Histograms in `matplotlib`#

To create a histogram, call plt.hist with a single column of a DataFrame (or a numpy.ndarray).

Check-in: What is this graph telling us?

p = plt.hist(df_pokemon['Attack'])

../_images/b8424dae962c90986d2b98e0f0d2dc045ac51a76b78c9125f681e829c4651b06.png

Changing the number of bins#

A histogram puts your continuous data into bins (e.g., 1-10, 11-20, etc.).

The height of each bin reflects the number of observations within that interval.
Increasing or decreasing the number of bins gives you more or less granularity in your distribution.

### This has lots of bins
p = plt.hist(df_pokemon['Attack'], bins = 30)

../_images/3c44d9c57ea87cbb0036330953593ba3df82ce38e2c234d83974298eb1bc3656.png

### This has fewer bins
p = plt.hist(df_pokemon['Attack'], bins = 5)

../_images/cc0d707b1441036a0ee95dd64fbc798bdce0e60814e18bbd2c42cdf683a4bc67.png

Changing the `alpha` level#

The alpha level changes the transparency of your figure.

### This has fewer bins
p = plt.hist(df_pokemon['Attack'], alpha = .6)

../_images/8ee244498203795e39eee30c0dc06f57a95888b9e170a8d9537cea7649e9c394.png

Check-in:#

How would you make a histogram of the scores for Defense?

### Your code here

Solution#

p = plt.hist(df_pokemon['Defense'], alpha = .6)

../_images/4a0f18ae0193537ef33242b59a73bd015a250474e9656a5accfbca0bda7599ca.png

Check-in:#

Could you make a histogram of the scores for Type 1?

### Your code here

Solution#

Not exactly.
Type 1 is a categorical variable, so there’s no intrinsic ordering.
The closest we could do is count the number of each Type 1 and then plot those counts.

Learning from histograms#

Histograms are incredibly useful for learning about the shape of our distribution. We can ask questions like:

Is this distribution relatively normal?
Is the distribution skewed?
Are there outliers?

Normally distributed data#

We can use the numpy.random.normal function to create a normal distribution, then plot it.

A normal distribution has the following characteristics:

Classic “bell” shape (symmetric).
Mean, median, and mode are all identical.

norm = np.random.normal(loc = 10, scale = 1, size = 1000)
p = plt.hist(norm, alpha = .6)

../_images/62e37fda49131b45a33a1fcda1c78237135eaf3345d38519b0ada799acfa0a44.png

Skewed data#

Skew means there are values elongating one of the “tails” of a distribution.

Positive/right skew: the tail is pointing to the right.
Negative/left skew: the tail is pointing to the left.

rskew = ss.skewnorm.rvs(20, size = 1000) # make right-skewed data
lskew = ss.skewnorm.rvs(-20, size = 1000) # make left-skewed data
fig, axes = plt.subplots(1, 2)
axes[0].hist(rskew)
axes[0].set_title("Right-skewed")
axes[1].hist(lskew)
axes[1].set_title("Left-skewed")

Text(0.5, 1.0, 'Left-skewed')

../_images/3b0063bb8051d88ef0a27182c6246b84531b1a69ce64db65bcd4ecfb59afefb9.png

Outliers#

Outliers are data points that differ significantly from other points in a distribution.

Unlike skewed data, outliers are generally discontinuous with the rest of the distribution.
Next week, we’ll talk about more ways to identify outliers; for now, we can rely on histograms.

norm = np.random.normal(loc = 10, scale = 1, size = 1000)
upper_outliers = np.array([21, 21, 21, 21]) ## some random outliers
data = np.concatenate((norm, upper_outliers))
p = plt.hist(data, alpha = .6)
plt.arrow(20, 100, dx = 0, dy = -50, width = .3, head_length = 10, facecolor = "red")

<matplotlib.patches.FancyArrow at 0x171cee250>

../_images/1d7d08c8080b58663ba6f18ea34d40ca2e42e8f6fcd18a2f438df9810a235cac.png

Check-in#

How would you describe the following distribution?

Normal vs. skewed?
With or without outliers?

p = plt.hist(df_pokemon['HP'], alpha = .6)

../_images/1c491ad865368fba787e6f59b5deeb455621dc2dc8e1db2d1e87b9375e8f8c14.png

Check-in#

How would you describe the following distribution?

Normal vs. skewed?
With or without outliers?

p = plt.hist(df_pokemon['Sp. Atk'], alpha = .6)

../_images/c58c7725459de1215e3c696e2cc0fca6886d60e3dd033be6c47db262447c2bbe.png

Check-in#

In a somewhat right-skewed distribution (like below), what’s larger––the mean or the median?

p = plt.hist(df_pokemon['Sp. Atk'], alpha = .6)

Solution#

The mean is the most affected by skew, so it is pulled the furthest to the right in a right-skewed distribution.

p = plt.hist(df_pokemon['Sp. Atk'], alpha = .6)
plt.axvline(df_pokemon['Sp. Atk'].mean(), linestyle = "dashed", color = "green")
plt.axvline(df_pokemon['Sp. Atk'].median(), linestyle = "dotted", color = "red")

<matplotlib.lines.Line2D at 0x1721b7310>

../_images/7cef4027e7621387210490c934284bedb35e26c80c8584760ad221e71172db6f.png

Modifying our plot#

A good data visualization should also make it clear what’s being plotted.
- Clearly labeled x and y axes, title.
Sometimes, we may also want to add overlays.
- E.g., a dashed vertical line representing the mean.

Adding axis labels#

p = plt.hist(df_pokemon['Attack'], alpha = .6)
plt.xlabel("Attack")
plt.ylabel("Count")
plt.title("Distribution of Attack Scores")

Text(0.5, 1.0, 'Distribution of Attack Scores')

../_images/f674d5fa25eb30dd25909ab98bed6c06da2ed2b45998cf65feb6d10a79095788.png

Adding a vertical line#

The plt.axvline function allows us to draw a vertical line at a particular position, e.g., the mean of the Attack column.

p = plt.hist(df_pokemon['Attack'], alpha = .6)
plt.xlabel("Attack")
plt.ylabel("Count")
plt.title("Distribution of Attack Scores")
plt.axvline(df_pokemon['Attack'].mean(), linestyle = "dotted")

<matplotlib.lines.Line2D at 0x17255d3d0>

../_images/011f498e1e05e46000ec775cc66ca103637a039d21bce911e34909625c19c9be.png

Scatterplots#

What are scatterplots?#

A scatterplot is a visualization of how two different continuous distributions relate to each other.

Each individual point represents an observation.
Very useful for exploratory data analysis.
- Are these variables positively or negatively correlated?

A scatterplot is a bivariate plot, i.e., it displays at least two variables.

Scatterplots with `matplotlib`#

We can create a scatterplot using plt.scatter(x, y), where x and y are the two variables we want to visualize.

x = np.arange(1, 10)
y = np.arange(11, 20)
p = plt.scatter(x, y)

../_images/14e408e63448fc94c13a1b377b4679de4f140697607820d9f106a20e0393ddbf.png

Check-in#

Are these variables related? If so, how?

x = np.random.normal(loc = 10, scale = 1, size = 100)
y = x * 2 + np.random.normal(loc = 0, scale = 2, size = 100)
plt.scatter(x, y, alpha = .6)

<matplotlib.collections.PathCollection at 0x1724163d0>

../_images/0afcd3ff862701e947e7022369dfa430922dc28600e3dda6a4eb7dfd29d9adfa.png

Check-in#

Are these variables related? If so, how?

x = np.random.normal(loc = 10, scale = 1, size = 100)
y = -x * 2 + np.random.normal(loc = 0, scale = 2, size = 100)
plt.scatter(x, y, alpha = .6)

<matplotlib.collections.PathCollection at 0x172109d90>

../_images/e30e72336146bb6056d5e4d95d44ec5499ee52ddb6a828328a1d34c900c013bd.png

Scatterplots are useful for detecting non-linear relationships#

x = np.random.normal(loc = 10, scale = 1, size = 100)
y = np.sin(x)
plt.scatter(x, y, alpha = .6)

<matplotlib.collections.PathCollection at 0x17374df50>

../_images/9b5ae0e74a718a02d92f88dc4ec3e5125f4dd992c5e0becc2c001b2a44742424.png

Check-in#

How would we visualize the relationship between Attack and Speed in our Pokemon dataset?

### Check-in

Solution#

Perhaps somewhat positively correlated, but not too much.

Side note: what would it mean for the Pokemon game if all these attributes (Speed, Defense, etc.) were extremely positively correlated?

plt.scatter(df_pokemon['Attack'], df_pokemon['Speed'], alpha = .6)
plt.xlabel("Attack")
plt.ylabel("Speed")

Text(0, 0.5, 'Speed')

../_images/2a00d5b5d2411d41861928fb129a8111d7721849ed598e15abfa9c6af14c91c1.png

Barplots#

What is a barplot?#

A barplot visualizes the relationship between one continuous variable and a categorical variable.

The height of each bar generally indicates the mean of the continuous variable.
Each bar represents a different level of the categorical variable.

A barplot is a bivariate plot, i.e., it displays at least two variables.

Barplots with `matplotlib`#

plt.bar can be used to create a barplot of our data.

E.g., average Attack by Legendary status.
However, we first need to use groupby to calculate the mean Attack per level.

Step 1: Using `groupby`#

summary = df_pokemon[['Legendary', 'Attack']].groupby("Legendary").mean().reset_index()
summary

	Legendary	Attack
0	False	75.669388
1	True	116.676923

### Turn Legendary into a str
summary['Legendary'] = summary['Legendary'].apply(lambda x: str(x))
summary

	Legendary	Attack
0	False	75.669388
1	True	116.676923

Step 2: Pass values into `plt.bar`#

Check-in:

What do we learn from this plot?
What is this plot missing?

plt.bar(x = summary['Legendary'],
       height = summary['Attack'],
       alpha = .6)
plt.xlabel("Legendary status")
plt.ylabel("Attack")

Text(0, 0.5, 'Attack')

../_images/29e77cfcc9bee63ce25e2278724fbf9e8ef2e34df67ef7300cec1ba009b60bdd.png

Adding error bars#

Without some measure of variance, bar plots just tell us the mean of each level.
Ideally, we’d have a way to measure how much variance there is around that mean.

Typically, error bars are calculated using the standard error of the mean.

Standard error of the mean#

The standard error of the mean is the standard deviation of the distribution of sample means; in practice, it’s an estimate of how much variance there is around our estimate of the mean.

Standard deviation, or \(\sigma\), is a measure of how much scores deviate around the mean.
Standard error of the mean, or \(\sigma_\bar{x}\), incorporates standard deviation, but also sample size, or \(n\).

\(\Large \sigma_\bar{x} = \frac{\sigma}{\sqrt{n}}\)

As \(n\) increases, \(\sigma_\bar{x}\) decreases.
I.e., larger sample size decreases standard error of the mean––which is good for our estimates!

Turning standard error into error bars#

An error bar represents a “confidence interval”.
Typically, the lower/upper bounds of a confidence interval are calculated by subtracting or adding \(2 * \sigma_\bar{x}\) to the mean.

Note: Next week, we’ll learn all about why this is!

Step 1: calculate standard errors with `sem`#

sem_summ = df_pokemon[['Legendary', 'Attack']].groupby("Legendary").sem().reset_index()
sem_summ

	Legendary	Attack
0	False	1.124646
1	True	3.764211

### Turn Legendary into a str
sem_summ['Legendary'] = sem_summ['Legendary'].apply(lambda x: str(x))
sem_summ

	Legendary	Attack
0	False	1.124646
1	True	3.764211

Step 2: Create plot using `plt.errorbar`#

The x and y coordinates are just from our original summary DataFrame.
The yerr is the standard error we just calculated.

plt.errorbar(x = summary['Legendary'], # original coordinate
             y = summary['Attack'], # original coordinate
             yerr = sem_summ['Attack'] * 2, # standard error 
            ls = 'none', ## toggle this to connect or not connect the lines
             color = "black"
            )
plt.xlabel("Legendary status")
plt.ylabel("Attack")

Text(0, 0.5, 'Attack')

../_images/ea7a7e8a9c605894cd98baa2da55ba89719790ad32bd47f709979eb02b595e76.png

Step 3: Combining with `plt.bar`#

plt.errorbar(x = summary['Legendary'], # original coordinate
             y = summary['Attack'], # original coordinate
             yerr = sem_summ['Attack'] * 2, # standard error 
            ls = 'none', ## toggle this to connect or not connect the lines
             color = "black"
            )
plt.bar(x = summary['Legendary'],
       height = summary['Attack'],
       alpha = .6)
plt.xlabel("Legendary status")
plt.ylabel("Attack")

Text(0, 0.5, 'Attack')

../_images/e5b44e6830d5b426a8d060290320aac10dbe856213e46936e58c10889ebf06a3.png

Check-in#

Create a barplot with errorbars representing:

mean Speed by Type 1
Focusing only on Pokemone with a Type 1 of Grass or Electric.

### Your code here

Solution#

This is a multi-step one! Steps involved:

Filter our DataFrame to be only Grass or Electric.
Use groupby to calculate the mean Speed by Type 1.
Use groupby to calculate the standard error of the mean for Speed by Type 1.
Use plt.bar and plt.errorbar to plot these data.

Step 1#

df_filtered = df_pokemon[df_pokemon['Type 1'].isin(['Grass', 'Electric'])]
df_filtered['Type 1'].value_counts()

Type 1
Grass       70
Electric    44
Name: count, dtype: int64

Steps 2-3#

summary = df_filtered[['Type 1', 'Speed']].groupby("Type 1").mean().reset_index()
summary

	Type 1	Speed
0	Electric	84.500000
1	Grass	61.928571

sem_speed = df_filtered[['Type 1', 'Speed']].groupby("Type 1").sem().reset_index()
sem_speed

	Type 1	Speed
0	Electric	4.023911
1	Grass	3.407173

Step 4#

plt.errorbar(x = summary['Type 1'], # original coordinate
             y = summary['Speed'], # original coordinate
             yerr = sem_speed['Speed'] * 2, # standard error 
            ls = 'none', color = "black"
            )
plt.bar(x = summary['Type 1'],
       height = summary['Speed'],
       alpha = .6)
plt.xlabel("Type 1")
plt.ylabel("Speed")

Text(0, 0.5, 'Speed')

../_images/4f03c12d9c875c746f2051aab78b0cfed77261f167d74cb47f1c4a5df1f05199.png

Conclusion#

This concludes our first introduction to data visualization:

Working with matplotlib.pyplot.
Creating basic plots: histograms, scatterplots, and barplots.

Next time, we’ll move onto discussing seaborn, another very useful package for data visualization.

Data visualization in Python (pyplot)

Contents

Data visualization in Python (pyplot)#

Looking ahead: Weeks 3-4#

Goals of this lecture#

Introduction: data visualization#

What is data visualization?#

Why is data visualization important?#

Exploratory Data Analysis: Checking your assumptions#

Communicating Insights#

Impacting the world#

Impacting the world (pt. 2)#

Introducing matplotlib#

Loading packages#

What is matplotlib?#

What is pyplot?#

Example dataset#

Histograms#

What are histograms?#

Histograms in matplotlib#

Changing the number of bins#

Changing the alpha level#

Check-in:#

Solution#

Check-in:#

Solution#

Learning from histograms#

Normally distributed data#

Skewed data#

Outliers#

Check-in#

Check-in#

Check-in#

Solution#

Modifying our plot#

Adding axis labels#

Adding a vertical line#

Scatterplots#

What are scatterplots?#

Scatterplots with matplotlib#

Check-in#

Check-in#

Scatterplots are useful for detecting non-linear relationships#

Check-in#

Solution#

Barplots#

What is a barplot?#

Barplots with matplotlib#

Step 1: Using groupby#

Step 2: Pass values into plt.bar#

Adding error bars#

Standard error of the mean#

Turning standard error into error bars#

Step 1: calculate standard errors with sem#

Step 2: Create plot using plt.errorbar#

Step 3: Combining with plt.bar#

Check-in#

Solution#

Step 1#

Steps 2-3#

Step 4#

Conclusion#

Data visualization in Python (`pyplot`)#

Introducing `matplotlib`#

What is `matplotlib`?#

What is `pyplot`?#

Histograms in `matplotlib`#

Changing the `alpha` level#

Scatterplots with `matplotlib`#

Barplots with `matplotlib`#

Step 1: Using `groupby`#

Step 2: Pass values into `plt.bar`#

Step 1: calculate standard errors with `sem`#

Step 2: Create plot using `plt.errorbar`#

Step 3: Combining with `plt.bar`#