05-Review (Working with Data)

05-Review (Working with Data)#

Goals of this lecture#

This lecture will review tools useful for working with data in Python.

Focus: tabular data in .csv files.
Intro to pandas.
Basic manipulation and analysis using pandas.

What is a file?#

A file is a set of bytes used to store some kind of data.

The format of this data depends on what you’re using it for, but at some level, it is translated into binary bits (1s and 0s).

The file format is usually specified in the file extension.

.csv: comma separated values.
.txt: a plain text file.
.py: an executable Python file.
.png: a portable network graphic file (i.e., an image).

What is tabular data?#

Tabular data is data organized in a table with rows and columns.

This kind of data is two-dimensional.
Typically, each row represents an “observation”.
Typicallly, each column represents an attribute.

Often stored in .csv files.

.csv = “comma-separated values”

Example: Countries#

Check-in: What does each row represent? What about each column?

Country	Population (million)	GDP (Trillions)
USA	329.5	20.94
UK	76.22	2.7
China	1402	14.72

Introducing `pandas`#

pandas is a package that enables fluid and efficient storage, manipulation, and analysis of data.

## Import statement: pandas is a "package"
import pandas as pd

`pandas.read_csv`#

Tabular data is often stored in .csv files.

pandas.read_csv can be used to read in a .csv file.
This is represented as a pandas.DataFrame.

pd.read_csv("path/to/file.csv") ### replace with actual filepath!

### .csv file with data about different Pokemon
df_pokemon = pd.read_csv("data/pokemon.csv")

`read_csv` with a URL#

You can also pass a URL that points to a .csv file into read_csv.

This is a dataset from Brand et al. (2019), which quantified changes in the positivity and negativity in song lyrics over time.
We’ll be working more with that dataset soon!

df_lyrics = pd.read_csv("https://raw.githubusercontent.com/lottybrand/song_lyrics/master/data/billboard_analysis.csv")

df_lyrics.head(2)

	artist	artist_processed	rank	song	year	ID_index	negative	positive	n
0	sam the sham and the pharaohs	sam the sham and the pharaohs	1	wooly bully	1965	1	0.0	0.0	73.0
1	four tops	four tops	2	i cant help myself sugar pie honey bunch	1965	2	3.0	8.0	192.0

Using a `DataFrame`#

Now that we have a DataFrame object, we want to be able to use that DataFrame.
This includes:
- Get basic information about DataFrame (e.g., its shape).
- Accessing specific columns.
- Accessing specific rows.

Using `shape`#

df.shape tells us how many rows and columns are in the DataFrame.

## (#rows, #columns)
df_pokemon.shape

(800, 13)

Using `head` and `tail`#

The head(x) function displays the top x rows of the DataFrame.
Similarly, tail(x) displays the last x rows.

df_pokemon.head(2)

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False

df_pokemon.tail(2)

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
798	720	HoopaHoopa Unbound	Psychic	Dark	680	80	160	60	170	130	80	6	True
799	721	Volcanion	Fire	Water	600	80	110	120	130	90	70	6	True

Accessing columns#

A column can be accessed using dataframe_name['column_name'].

### What does this bracket syntax (["column_name"]) remind you of?
df_pokemon['Speed'].head(5)

  45
  60
  80
  80
  65
Name: Speed, dtype: int64

Useful operations with `pandas`#

DataFrames enable all sorts of useful operations, including:

sorting a DataFrame by a particular column.
Calculating descriptive statistics (e.g., mean, median, etc.).
Filtering a DataFrame.
Aggregating across levels of a variable using groupby.

Note that we’ll also cover these topics more in-depth in Week 4!

`sort_values`#

### By default, will sort from lowest to highest
df_pokemon.sort_values("HP").head(2)

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
316	292	Shedinja	Bug	Ghost	236	1	90	45	30	30	40	3	False
55	50	Diglett	Ground	NaN	265	10	55	25	35	45	95	1	False

### Show the highest HP
df_pokemon.sort_values("HP", ascending = False).head(2)

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
261	242	Blissey	Normal	NaN	540	255	10	10	75	135	55	2	False
121	113	Chansey	Normal	NaN	450	250	5	5	35	105	50	1	False

Check-in#

What is the Speed of the Pokemon with the highest Attack?

### Your code here

Solution#

### Pokemon with highest Attack
df_pokemon.sort_values("Attack", ascending = False).head(1)

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
163	150	MewtwoMega Mewtwo X	Psychic	Fighting	780	106	190	100	154	100	130	1	True

Descriptive statistics#

Columns of a DataFrame can also be summarized:

mean: average value (for numeric variables)
median: “middle” value (for numeric variables)
mode: most frequent value

df_pokemon['Attack'].mean()

79.00125

df_pokemon['Attack'].median()

75.0

df_pokemon['Attack'].mode()

0    100
Name: Attack, dtype: int64

Filtering a `DataFrame`#

Often, we want to filter a DataFrame so we only see observations that meet certain conditions.
Ultimately, this is similar to using a conditional statement––just with different syntax.

Example 1: filtering on a categorical variable#

The legendary column is a categorical variable, meaning there are several discrete categories.

## How many legendary pokemon?
df_pokemon[df_pokemon['Legendary']==True].head(5)

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
156	144	Articuno	Ice	Flying	580	90	85	100	95	125	85	1	True
157	145	Zapdos	Electric	Flying	580	90	90	85	125	90	100	1	True
158	146	Moltres	Fire	Flying	580	90	100	90	125	85	90	1	True
162	150	Mewtwo	Psychic	NaN	680	106	110	90	154	90	130	1	True
163	150	MewtwoMega Mewtwo X	Psychic	Fighting	780	106	190	100	154	100	130	1	True

Example 2: filtering on a continuous variable#

The HP column is a continuous variable.
Let’s show only the rows for Pokemon with a HP > 150.

df_pokemon[df_pokemon['HP'] > 150].head(3)

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
121	113	Chansey	Normal	NaN	450	250	5	5	35	105	50	1	False
155	143	Snorlax	Normal	NaN	540	160	110	65	65	110	30	1	False
217	202	Wobbuffet	Psychic	NaN	405	190	33	58	33	58	33	2	False

Using `groupby`#

The groupby function allows you to split data (i.e., along different categories) then apply some function to each split of that data (e.g., mean).

The syntax is as follows:

df_name.groupby("column_to_group_by").mean() ## or median, etc.

Example: mean `Attack` by `Legendary`#

Here, the [[...]] syntax just limits the columns in the DataFrame to the columns we directly care about.

df_pokemon[['Legendary', 'Attack']].groupby("Legendary").mean()

	Attack
Legendary
False	75.669388
True	116.676923

Check-in:#

How would you calculate the median Defense by Legendary status?

### Your code here

Solution#

df_pokemon[['Legendary', 'Defense']].groupby("Legendary").median()

	Defense
Legendary
False	66.0
True	100.0

Check-in:#

How would you calculate the mean HP by Type 1?

### Your code here

Solution#

df_pokemon[['Type 1', 'HP']].groupby("Type 1").mean()

	HP
Type 1
Bug	56.884058
Dark	66.806452
Dragon	83.312500
Electric	59.795455
Fairy	74.117647
Fighting	69.851852
Fire	69.903846
Flying	70.750000
Ghost	64.437500
Grass	67.271429
Ground	73.781250
Ice	72.000000
Normal	77.275510
Poison	67.250000
Psychic	70.631579
Rock	65.363636
Steel	65.222222
Water	72.062500

Conclusion#

This concludes our unit on interacting with data.

Reading in .csv files with pandas.
Summarizing and working with tabular data.

We’ll also spend much more time on tabular data in Week 4, particularly as it regards “cleaning” data.

05-Review (Working with Data)

Contents

05-Review (Working with Data)#

Goals of this lecture#

What is a file?#

What is tabular data?#

Example: Countries#

Introducing pandas#

pandas.read_csv#

read_csv with a URL#

Using a DataFrame#

Using shape#

Using head and tail#

Accessing columns#

Useful operations with pandas#

sort_values#

Check-in#

Solution#

Descriptive statistics#

Filtering a DataFrame#

Example 1: filtering on a categorical variable#

Example 2: filtering on a continuous variable#

Using groupby#

Example: mean Attack by Legendary#

Check-in:#

Solution#

Check-in:#

Solution#

Conclusion#

Introducing `pandas`#

`pandas.read_csv`#

`read_csv` with a URL#

Using a `DataFrame`#

Using `shape`#

Using `head` and `tail`#

Useful operations with `pandas`#

`sort_values`#

Filtering a `DataFrame`#

Using `groupby`#

Example: mean `Attack` by `Legendary`#