Data wrangling

Data wrangling#

Goals of this lecture#

A huge \(\%\) of science involves data wrangling.

This could be an entire course on its own, but today we’ll focus on:

What is data wrangling?
What to do about missing values?
How to combine datasets?
Tidy data!
- What is it?
- How do we make our data tidy?

Importing relevant libraries#

import seaborn as sns ### importing seaborn
import pandas as pd

%matplotlib inline 
%config InlineBackend.figure_format = 'retina'

What is data wrangling?#

Data wrangling refers to manipulating, reshaping, or transforming a dataset as needed for your goals (e.g., visualization and/or analysis).

A huge \(\%\) of working with data involves “wrangling”.
Includes:
- “Cleaning” data (missing values, recasting variables, etc.).
- Merging/combining different datasets.
- Reshaping data as needed.

Dealing with missing values#

In practice, real-world data is often messy.
This includes missing values, which take on the value/label NaN.
- NaN = “Not a Number”.
Dealing with NaN values is one of the main challenges in CSS!

Loading a dataset with missing values#

The titanic dataset contains information about different Titanic passengers and whether they Survived (1 vs. 0).

Commonly used as a tutorial for machine learning, regression, and data wrangling.

df_titanic = pd.read_csv("data/wrangling/titanic.csv")
df_titanic.head(3)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S

Why is missing data a problem?#

If you’re unaware of missing data:

You might be overestimating the size of your dataset.
You might be biasing the results of a visualization or analysis (if missing data are non-randomly distributed).
You might be complicating an analysis.
- By default, many analysis packages will “drop” missing data––so you need to be aware of whether this is happening.

How to deal with missing data#

Identify whether and where your data has missing values.
Analyze how these missing values are distributed.
Decide how to handle them.

Not an easy problem––especially step 3!

Step 1: Identifying missing values#

The first step is identifying whether and where your data has missing values.

There are several approaches to this:

Using .isna
Using .info
Using .isnull

`isna()`#

The isna() function tells us whether a given cell of a DataFrame has a missing value or not (True vs. False).
If we call isna().any(), it tells us which columns have missing values.

df_titanic.isna().head(1)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	False	False	False	False	False	False	False	False	False	False	True	False

df_titanic.isna().any()

PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked        True
dtype: bool

Inspecting columns with `nan`#

Now we can inspect specific columns that have nan values.

df_titanic[df_titanic['Age'].isna()].head(5)

	PassengerId	Survived	Pclass	Name	Sex	Age	Ticket	Fare	Cabin	Embarked
5	6	0	3	Moran, Mr. James	male	NaN	330877	8.4583	NaN	Q
17	18	1	2	Williams, Mr. Charles Eugene	male	NaN	244373	13.0000	NaN	S
19	20	1	3	Masselmani, Mrs. Fatima	female	NaN	2649	7.2250	NaN	C
26	27	0	3	Emir, Mr. Farred Chehab	male	NaN	2631	7.2250	NaN	C
28	29	1	3	O'Dwyer, Miss. Ellen "Nellie"	female	NaN	330959	7.8792	NaN	Q

How many `nan`?#

If we call sum on the nan values, we can calculate exactly how many nan values are in each column.

df_titanic.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

`info`#

The info() function gives us various information about the DataFrame, including the number of not-null (i.e., non-missing) values in each column.

df_titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Check-in#

How many rows of the DataFrame have missing values for Cabin?

### Your code here

Solution#

### How many? (Quite a few!)
df_titanic[df_titanic['Cabin'].isna()].shape

(687, 12)

Visualizing missing values#

Finally, we can visualize the rate of missing values across columns using seaborn.heatmap.
- The dark cells are those with not-null values.
- The light cells have nan values.

sns.heatmap(df_titanic.isna())

<Axes: >

../_images/d61836a0611973d22fe99861597d17101b11441d7bd477f39687b15ff8106c75.png

Step 2: Analyze how the data are distributed#

Having identified missing data, the next step is determining how those missing data are distributed.

Is variable \(Y\) different depending on whether \(X\) is `nan`?#

One approach is to ask whether some variable of interest (e.g., Survived) is different depending on whether some other variable is nan.

### Mean survival for people without data about Cabin info
df_titanic[df_titanic['Cabin'].isna()]['Survived'].mean()

0.29985443959243085

### Mean survival for people *with* data about Cabin info
df_titanic[~df_titanic['Cabin'].isna()]['Survived'].mean()

0.6666666666666666

Check-in#

What is the mean Survived rate for values with a nan value for Age vs. those with not-null values? How does this compare to the overall Survived rate?

### Your code here

Solution#

### Mean survival for people without data about Age
df_titanic[df_titanic['Age'].isna()]['Survived'].mean()

0.2937853107344633

### Mean survival for people *with* data about Age
df_titanic[~df_titanic['Age'].isna()]['Survived'].mean()

0.4061624649859944

### Mean survival for people overall
df_titanic['Survived'].mean()

0.3838383838383838

Additional methods#

If you’re interested in diving deeper, you can look into the missingno library in Python (which must be separately installed).

Step 3: Determine what to do!#

Having identified missing data, you need to determine how to handle it.

There are several approaches you can take.

Removing all rows with any missing data.
Removing rows with missing data only when that variable is relevant to the analysis or visualization.
Imputing (i.e., guessing) what values missing data should have.

Removing all rows with any missing data#

We can filter our DataFrame using dropna, which will automatically “drop” any rows containing null values.
Caution: if you have lots of missing data, this can substantially impact the size of your dataset.

df_filtered = df_titanic.dropna()
df_filtered.shape

(183, 12)

Removing all rows with missing data in specific columns#

Here, we specify that we only want to dropna for rows that have nan in the Age column specifically.
We still have missing nan for Cabin, but perhaps that’s fine in our case.

df_filtered = df_titanic.dropna(subset = "Age")
df_filtered.shape

(714, 12)

Imputing missing data#

One of the most complex (and controversial) approaches is to impute the values of missing data.

There are (again) multiple ways to do this:

Decide on a constant value and assign it to all nan values.
- E.g., assign the mean Age to all people with nan in that column.
Try to guess the value based on specific characteristics of the data.
- E.g., based on other characteristics of this person, what is their likely Age?

Imputing a constant value#

We can use fillna to assign all values with nan for Age some other value.

## Assign the mean Age to all people with nan for Age
df_titanic['Age_imputed1'] = df_titanic['Age'].fillna(df_titanic['Age'].mean())
## Now let's look at those rows
df_titanic[df_titanic['Age'].isna()].head(5)

	PassengerId	Survived	Pclass	Name	Sex	Age	Ticket	Fare	Cabin	Embarked	Age_imputed1
5	6	0	3	Moran, Mr. James	male	NaN	330877	8.4583	NaN	Q	29.699118
17	18	1	2	Williams, Mr. Charles Eugene	male	NaN	244373	13.0000	NaN	S	29.699118
19	20	1	3	Masselmani, Mrs. Fatima	female	NaN	2649	7.2250	NaN	C	29.699118
26	27	0	3	Emir, Mr. Farred Chehab	male	NaN	2631	7.2250	NaN	C	29.699118
28	29	1	3	O'Dwyer, Miss. Ellen "Nellie"	female	NaN	330959	7.8792	NaN	Q	29.699118

Guessing based on other characteristics#

You can try to guess what their Age would be, based on other features.
The more sophisticated version of this is to use statistical modeling or using SimpleImputer from the sklearn library.
For now, simply note that Age correlates with other features (like Pclass).

## Passenger Class is correlated with Age
sns.barplot(data = df_titanic, x = 'Pclass', y = 'Age')

<Axes: xlabel='Pclass', ylabel='Age'>

../_images/0a5554e9141ed8577dfc511fe792b02c1da84cb125b591d2c0130a0fb5cc5062.png

Check-in#

What would happen if you used fillna with the median Age instead of the mean? Why would this matter?

### Your code here

Solution#

The median Age is slightly lower.

## Assign the median Age to all people with nan for Age
df_titanic['Age_imputed2'] = df_titanic['Age'].fillna(df_titanic['Age'].median())
## Now let's look at those rows
df_titanic[df_titanic['Age'].isna()].head(5)

	PassengerId	Survived	Pclass	Name	Sex	Age	Ticket	Fare	Cabin	Embarked	Age_imputed1	Age_imputed2
5	6	0	3	Moran, Mr. James	male	NaN	330877	8.4583	NaN	Q	29.699118	28.0
17	18	1	2	Williams, Mr. Charles Eugene	male	NaN	244373	13.0000	NaN	S	29.699118	28.0
19	20	1	3	Masselmani, Mrs. Fatima	female	NaN	2649	7.2250	NaN	C	29.699118	28.0
26	27	0	3	Emir, Mr. Farred Chehab	male	NaN	2631	7.2250	NaN	C	29.699118	28.0
28	29	1	3	O'Dwyer, Miss. Ellen "Nellie"	female	NaN	330959	7.8792	NaN	Q	29.699118	28.0

Merging datasets#

Merging refers to combining different datasets to leverage the power of additional information.

In Week 1, we discussed this in the context of data linkage.
Can link datasets as a function of:
- Shared time window.
- Shared identity.

Why merge?#

Each dataset contains limited information.
- E.g., GDP by Year.
But merging datasets allows us to see how more variables relate and interact.
Much of social science research involves locating datasets and figuring out how to combine them.

How to merge?#

In Python, pandas.merge allows us to merge two DataFrames on a common column(s).

pd.merge(df1, df2, on = "shared_column")

`merge` in practice#

For demonstration, we’ll merge two Linguistics datasets:

One dataset contains information about the Age of Acquisition of different English words (Kuperman et al., 2014).
The other dataset contains information about the Frequency and Concreteness of English words (Brysbaert et al., 2014).

Loading datasets#

df_aoa = pd.read_csv("data/wrangling/AoA.csv")
df_aoa.head(1)

	Word	AoA
0	a	2.89

df_conc = pd.read_csv("data/wrangling/concreteness.csv")
df_conc.head(1)

	Word	Concreteness	Frequency	Dom_Pos
0	sled	5.0	149	Adjective

Different kinds of merging#

As we see, the datasets are not the same size. This leaves us with a decision to make when merging.

inner join: Do we preserve only the words in both datasets?
left join: Do we preserve all the words in one dataset (the “left” one), regardless of whether they occur in the other?
right join: Do we preserve all the words in one dataset (the “right” one), regardless of whether they occur in the other?
outer join: Do we preserve all words in both, leaving empty (nan) values where a word only appears in one dataset?

df_aoa.shape

(31124, 2)

df_conc.shape

(28612, 4)

`inner` join#

For our purposes, it makes the most sense to use an inner join.
This leaves us with fewer words than occur in either dataset.

df_merged = pd.merge(df_aoa, df_conc, on = "Word", how = "inner")
df_merged.head(2)

	Word	AoA	Concreteness	Frequency	Dom_Pos
0	aardvark	9.89	4.68	21	Noun
1	abacus	8.69	4.52	12	Noun

df_merged.shape

(23569, 5)

Check-in#

What happens if you use a different kind of join, e.g., outer or left? What do you notice about the shape of the resulting DataFrame? Do some rows have nan values?

### Your code here

Solution#

df_outer_join = pd.merge(df_aoa, df_conc, on = "Word", how = "outer")
df_outer_join.shape

(36167, 5)

df_outer_join.head(4)

	Word	AoA	Concreteness	Frequency	Dom_Pos
0	a	2.89	NaN	NaN	NaN
1	aardvark	9.89	4.68	21.0	Noun
2	abacus	8.69	4.52	12.0	Noun
3	abalone	12.23	NaN	NaN	NaN

Why `merge` is so useful#

Now that we’ve merged our datasets, we can look at how variables across them relate to each other.

sns.scatterplot(data = df_merged, x = 'Concreteness',
                y = 'AoA', alpha = .2 )

<Axes: xlabel='Concreteness', ylabel='AoA'>

../_images/3f013165a4f305bf9eae0d4959d08455d28cf026f24215f9679753d0fd3f0921.png

Reshaping data#

Reshaping data involves transforming it from one format (e.g., “wide”) to another format (e.g., “long”), to make it more amenable to visualization and analysis.

Often, we need to make our data tidy.

What is tidy data?#

Tidy data is a particular way of formatting data, in which:

Each variable forms a column (e.g., GDP).
Each observation forms a row (e.g., a country).
Each type of observational unit forms a table (tabular data!).

Originally developed by Hadley Wickham, creator of the tidyverse in R.

Tidy vs. “untidy” data#

Now let’s see some examples of tidy vs. untidy data.

Keep in mind:

These datasets all contain the same information, just in different formats.
“Untidy” data can be useful for other things, e.g., presenting in a paper.
The key goal of tidy data is that each row represents an observation.

Tidy data#

Check-in: Why is this data considered tidy?

df_tidy = pd.read_csv("data/wrangling/tidy.csv")
df_tidy

	ppt	condition	rt
0	john	Congruent	200
1	john	Incongruent	250
2	mary	Congruent	178
3	mary	Incongruent	195

Untidy data 1#

Check-in: Why is this data not considered tidy?

df_messy1 = pd.read_csv("data/wrangling/messy1.csv")
df_messy1

	john	mary	condition
0	200	178	Congruent
1	250	195	Incongruent

Untidy data 2#

Check-in: Why is this data not considered tidy?

df_messy2 = pd.read_csv("data/wrangling/messy2.csv")
df_messy2

	congruent	incongruent	ppt
0	200	250	john
1	178	195	mary

Making data tidy#

Fortunately, pandas makes it possible to turn an “untidy” DataFrame into a tidy one.

The key function here is pandas.melt.

pd.melt(df, ### Dataframe
        id_vars = [...], ### what are the identifying columns?
        var_name = ..., ### name for variable grouping over columns
        value_name = ..., ### name for the value this variable takes on

If this seems abstract, don’t worry––it’ll become clearer with examples!

Using `pd.melt`#

Let’s start with our first messy DataFrame.
Has columns for each ppt, which contain info about rt.

df_messy1

	john	mary	condition
0	200	178	Congruent
1	250	195	Incongruent

pd.melt(df_messy1, id_vars = 'condition', ### condition is our ID variable
       var_name = 'ppt', ### new row for each ppt observation
       value_name = 'rt') ### label for the info we have about each ppt

	condition	ppt	rt
0	Congruent	john	200
1	Incongruent	john	250
2	Congruent	mary	178
3	Incongruent	mary	195

Check-in#

Try to use pd.melt to turn df_messy2 into a tidy DataFrame.

Hint: Think about the existing structure of the DataFrame––how is data grouped––and what the id_vars would be.

df_messy2

	congruent	incongruent	ppt
0	200	250	john
1	178	195	mary

### Your code here

Solution#

pd.melt(df_messy2, id_vars = 'ppt', ### here, ppt is our ID variable
       var_name = 'condition', ### new row for each ppt observation
       value_name = 'rt') ### label for the info we have about each ppt

	ppt	condition	rt
0	john	congruent	200
1	mary	congruent	178
2	john	incongruent	250
3	mary	incongruent	195

Hands-on: a real dataset#

Now, we’ll turn to a real dataset, which Timothy Lee, creator of Full Stack Economics, compiled and shared with me.

df_work = pd.read_csv("data/viz/missing_work.csv")
df_work.head(5)

	Year	Child care problems	Maternity or paternity leave	Other family or personal obligations	Illness or injury	Vacation	Month
0	2012	18	313	246	899	1701	10
1	2012	35	278	230	880	1299	11
2	2012	13	245	246	944	1005	12
3	2013	14	257	250	1202	1552	1
4	2013	27	258	276	1079	1305	2

Check-in#

Is this dataset tidy? How could we make it tidy, if not––i.e., if we wanted each row to be a single observation corresponding to one of the reasons for missing work?

### Your code here

Solution#

df_melted = pd.melt(df_work, id_vars = ['Year', 'Month'],
                    var_name = "Reason",
                    value_name = "Days Missed")
df_melted.head(2)

	Year	Month	Reason	Days Missed
0	2012	10	Child care problems	18
1	2012	11	Child care problems	35

Why tidy data is useful#

Finally, let’s use this dataset to recreate a graph from FullStackEconomics.

Original graph#

title

Check-in#

As a first-pass approach, what tools from seaborn could you use to recreate this plot?

### Your code here

Solution#

This is okay, but not really what we want. This is grouping it by Year. But we want to group by both Year and Month.

sns.lineplot(data = df_melted, x = 'Year', y = 'Days Missed', hue = "Reason")

<Axes: xlabel='Year', ylabel='Days Missed'>

../_images/e9a7b7852743c621303c8dfe09ba2bc8b2948b4e15b93a8eaf6f797bb8ad5a22.png

Using `datetime`#

Let’s make a new column called date, which combines the Month and Year.
Then we can use pd.to_datetime to turn that into a custom pandas representation.

## First, let's concatenate each month and year into a single string
df_melted['date'] = df_melted.apply(lambda row: str(row['Month']) + '-' + str(row['Year']), axis = 1)
df_melted.head(2)

	Year	Month	Reason	Days Missed	date
0	2012	10	Child care problems	18	10-2012
1	2012	11	Child care problems	35	11-2012

## Now, let's create a new "datetime" column using the `pd.to_datetime` function
df_melted['datetime'] = pd.to_datetime(df_melted['date'])
df_melted.head(2)

/var/folders/pn/5zbmv0cj31v6hmyh53njhmdw0000gn/T/ipykernel_1691/714386078.py:2: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  df_melted['datetime'] = pd.to_datetime(df_melted['date'])

	Year	Month	Reason	Days Missed	date	datetime
0	2012	10	Child care problems	18	10-2012	2012-10-01
1	2012	11	Child care problems	35	11-2012	2012-11-01

Plotting again#

Much better!

sns.lineplot(data = df_melted, x = "datetime", y = "Days Missed", hue = "Reason")

<Axes: xlabel='datetime', ylabel='Days Missed'>

../_images/801ada6aa35bc49f29e1e3045a25fa1bb1d8e608f5b52ece19cf4cb0b9729900.png

Conclusion#

This was an introduction to data wrangling.

As noted, data wrangling is a hugely important topic––and could constitute an entire class.
But today, we’ve focused on:
- Identifying and addressing missing data.
- Merging datasets.
- Making data tidy.

Data wrangling

Contents

Data wrangling#

Goals of this lecture#

Importing relevant libraries#

What is data wrangling?#

Dealing with missing values#

Loading a dataset with missing values#

Why is missing data a problem?#

How to deal with missing data#

Step 1: Identifying missing values#

isna()#

Inspecting columns with nan#

How many nan?#

info#

Check-in#

Solution#

Visualizing missing values#

Step 2: Analyze how the data are distributed#

Is variable \(Y\) different depending on whether \(X\) is nan?#

Check-in#

Solution#

Additional methods#

Step 3: Determine what to do!#

Removing all rows with any missing data#

Removing all rows with missing data in specific columns#

Imputing missing data#

Imputing a constant value#

Guessing based on other characteristics#

Check-in#

Solution#

Merging datasets#

Why merge?#

How to merge?#

merge in practice#

Loading datasets#

Different kinds of merging#

inner join#

Check-in#

Solution#

Why merge is so useful#

Reshaping data#

What is tidy data?#

Tidy vs. “untidy” data#

Tidy data#

Untidy data 1#

Untidy data 2#

Making data tidy#

Using pd.melt#

Check-in#

Solution#

Hands-on: a real dataset#

Check-in#

Solution#

Why tidy data is useful#

Original graph#

Check-in#

Solution#

Using datetime#

Plotting again#

Conclusion#

`isna()`#

Inspecting columns with `nan`#

How many `nan`?#

`info`#

Is variable \(Y\) different depending on whether \(X\) is `nan`?#

`merge` in practice#

`inner` join#

Why `merge` is so useful#

Using `pd.melt`#

Using `datetime`#