05-Review (Working with Data)#
Goals of this lecture#
This lecture will review tools useful for working with data in Python.
Focus: tabular data in
.csv
files.Intro to
pandas
.Basic manipulation and analysis using
pandas
.
What is a file?#
A file is a set of bytes used to store some kind of data.
The format of this data depends on what you’re using it for, but at some level, it is translated into binary bits (1
s and 0
s).
The file format is usually specified in the file extension.
.csv
: comma separated values..txt
: a plain text file..py
: an executable Python file..png
: a portable network graphic file (i.e., an image).
What is tabular data?#
Tabular data is data organized in a table with rows and columns.
This kind of data is two-dimensional.
Typically, each row represents an “observation”.
Typicallly, each column represents an attribute.
Often stored in .csv
files.
.csv
= “comma-separated values”
Example: Countries#
Check-in: What does each row represent? What about each column?
Country |
Population (million) |
GDP (Trillions) |
---|---|---|
USA |
329.5 |
20.94 |
UK |
76.22 |
2.7 |
China |
1402 |
14.72 |
Introducing pandas
#
pandas
is a package that enables fluid and efficient storage, manipulation, and analysis of data.
## Import statement: pandas is a "package"
import pandas as pd
pandas.read_csv
#
Tabular data is often stored in .csv
files.
pandas.read_csv
can be used to read in a.csv
file.This is represented as a
pandas.DataFrame
.
pd.read_csv("path/to/file.csv") ### replace with actual filepath!
### .csv file with data about different Pokemon
df_pokemon = pd.read_csv("data/pokemon.csv")
read_csv
with a URL#
You can also pass a URL that points to a .csv
file into read_csv
.
This is a dataset from Brand et al. (2019), which quantified changes in the positivity and negativity in song lyrics over time.
We’ll be working more with that dataset soon!
df_lyrics = pd.read_csv("https://raw.githubusercontent.com/lottybrand/song_lyrics/master/data/billboard_analysis.csv")
df_lyrics.head(2)
artist | artist_processed | rank | song | year | ID_index | negative | positive | n | |
---|---|---|---|---|---|---|---|---|---|
0 | sam the sham and the pharaohs | sam the sham and the pharaohs | 1 | wooly bully | 1965 | 1 | 0.0 | 0.0 | 73.0 |
1 | four tops | four tops | 2 | i cant help myself sugar pie honey bunch | 1965 | 2 | 3.0 | 8.0 | 192.0 |
Using a DataFrame
#
Now that we have a
DataFrame
object, we want to be able to use thatDataFrame
.This includes:
Get basic information about
DataFrame
(e.g., its shape).Accessing specific columns.
Accessing specific rows.
Using shape
#
df.shape
tells us how many rows and columns are in the DataFrame
.
## (#rows, #columns)
df_pokemon.shape
(800, 13)
Using head
and tail
#
The
head(x)
function displays the topx
rows of theDataFrame
.Similarly,
tail(x)
displays the lastx
rows.
df_pokemon.head(2)
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
df_pokemon.tail(2)
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
798 | 720 | HoopaHoopa Unbound | Psychic | Dark | 680 | 80 | 160 | 60 | 170 | 130 | 80 | 6 | True |
799 | 721 | Volcanion | Fire | Water | 600 | 80 | 110 | 120 | 130 | 90 | 70 | 6 | True |
Accessing columns#
A column can be accessed using
dataframe_name['column_name']
.
### What does this bracket syntax (["column_name"]) remind you of?
df_pokemon['Speed'].head(5)
0 45
1 60
2 80
3 80
4 65
Name: Speed, dtype: int64
Useful operations with pandas
#
DataFrame
s enable all sorts of useful operations, including:
sort
ing aDataFrame
by a particular column.Calculating descriptive statistics (e.g.,
mean
,median
, etc.).Filtering a
DataFrame
.Aggregating across levels of a variable using
groupby
.
Note that we’ll also cover these topics more in-depth in Week 4!
sort_values
#
### By default, will sort from lowest to highest
df_pokemon.sort_values("HP").head(2)
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
316 | 292 | Shedinja | Bug | Ghost | 236 | 1 | 90 | 45 | 30 | 30 | 40 | 3 | False |
55 | 50 | Diglett | Ground | NaN | 265 | 10 | 55 | 25 | 35 | 45 | 95 | 1 | False |
### Show the highest HP
df_pokemon.sort_values("HP", ascending = False).head(2)
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
261 | 242 | Blissey | Normal | NaN | 540 | 255 | 10 | 10 | 75 | 135 | 55 | 2 | False |
121 | 113 | Chansey | Normal | NaN | 450 | 250 | 5 | 5 | 35 | 105 | 50 | 1 | False |
Check-in#
What is the Speed
of the Pokemon with the highest Attack
?
### Your code here
Solution#
### Pokemon with highest Attack
df_pokemon.sort_values("Attack", ascending = False).head(1)
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
163 | 150 | MewtwoMega Mewtwo X | Psychic | Fighting | 780 | 106 | 190 | 100 | 154 | 100 | 130 | 1 | True |
Descriptive statistics#
Columns of a DataFrame
can also be summarized:
mean
: average value (for numeric variables)median
: “middle” value (for numeric variables)mode
: most frequent value
df_pokemon['Attack'].mean()
79.00125
df_pokemon['Attack'].median()
75.0
df_pokemon['Attack'].mode()
0 100
Name: Attack, dtype: int64
Filtering a DataFrame
#
Often, we want to filter a
DataFrame
so we only see observations that meet certain conditions.Ultimately, this is similar to using a conditional statement––just with different syntax.
Example 1: filtering on a categorical variable#
The
legendary
column is a categorical variable, meaning there are several discrete categories.
## How many legendary pokemon?
df_pokemon[df_pokemon['Legendary']==True].head(5)
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
156 | 144 | Articuno | Ice | Flying | 580 | 90 | 85 | 100 | 95 | 125 | 85 | 1 | True |
157 | 145 | Zapdos | Electric | Flying | 580 | 90 | 90 | 85 | 125 | 90 | 100 | 1 | True |
158 | 146 | Moltres | Fire | Flying | 580 | 90 | 100 | 90 | 125 | 85 | 90 | 1 | True |
162 | 150 | Mewtwo | Psychic | NaN | 680 | 106 | 110 | 90 | 154 | 90 | 130 | 1 | True |
163 | 150 | MewtwoMega Mewtwo X | Psychic | Fighting | 780 | 106 | 190 | 100 | 154 | 100 | 130 | 1 | True |
Example 2: filtering on a continuous variable#
The
HP
column is a continuous variable.Let’s show only the rows for Pokemon with a
HP > 150
.
df_pokemon[df_pokemon['HP'] > 150].head(3)
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
121 | 113 | Chansey | Normal | NaN | 450 | 250 | 5 | 5 | 35 | 105 | 50 | 1 | False |
155 | 143 | Snorlax | Normal | NaN | 540 | 160 | 110 | 65 | 65 | 110 | 30 | 1 | False |
217 | 202 | Wobbuffet | Psychic | NaN | 405 | 190 | 33 | 58 | 33 | 58 | 33 | 2 | False |
Using groupby
#
The
groupby
function allows you to split data (i.e., along different categories) then apply some function to each split of that data (e.g.,mean
).
The syntax is as follows:
df_name.groupby("column_to_group_by").mean() ## or median, etc.
Example: mean Attack
by Legendary
#
Here, the [[...]]
syntax just limits the columns in the DataFrame
to the columns we directly care about.
df_pokemon[['Legendary', 'Attack']].groupby("Legendary").mean()
Attack | |
---|---|
Legendary | |
False | 75.669388 |
True | 116.676923 |
Check-in:#
How would you calculate the median
Defense
by Legendary
status?
### Your code here
Solution#
df_pokemon[['Legendary', 'Defense']].groupby("Legendary").median()
Defense | |
---|---|
Legendary | |
False | 66.0 |
True | 100.0 |
Check-in:#
How would you calculate the mean
HP
by Type 1
?
### Your code here
Solution#
df_pokemon[['Type 1', 'HP']].groupby("Type 1").mean()
HP | |
---|---|
Type 1 | |
Bug | 56.884058 |
Dark | 66.806452 |
Dragon | 83.312500 |
Electric | 59.795455 |
Fairy | 74.117647 |
Fighting | 69.851852 |
Fire | 69.903846 |
Flying | 70.750000 |
Ghost | 64.437500 |
Grass | 67.271429 |
Ground | 73.781250 |
Ice | 72.000000 |
Normal | 77.275510 |
Poison | 67.250000 |
Psychic | 70.631579 |
Rock | 65.363636 |
Steel | 65.222222 |
Water | 72.062500 |
Conclusion#
This concludes our unit on interacting with data.
Reading in
.csv
files withpandas
.Summarizing and working with tabular data.
We’ll also spend much more time on tabular data in Week 4, particularly as it regards “cleaning” data.