05-Review (Working with Data)#

Goals of this lecture#

This lecture will review tools useful for working with data in Python.

  • Focus: tabular data in .csv files.

  • Intro to pandas.

  • Basic manipulation and analysis using pandas.

What is a file?#

A file is a set of bytes used to store some kind of data.

The format of this data depends on what you’re using it for, but at some level, it is translated into binary bits (1s and 0s).

The file format is usually specified in the file extension.

  • .csv: comma separated values.

  • .txt: a plain text file.

  • .py: an executable Python file.

  • .png: a portable network graphic file (i.e., an image).

What is tabular data?#

Tabular data is data organized in a table with rows and columns.

  • This kind of data is two-dimensional.

  • Typically, each row represents an “observation”.

  • Typicallly, each column represents an attribute.

Often stored in .csv files.

  • .csv = “comma-separated values”

Example: Countries#

Check-in: What does each row represent? What about each column?

Country

Population (million)

GDP (Trillions)

USA

329.5

20.94

UK

76.22

2.7

China

1402

14.72

Introducing pandas#

pandas is a package that enables fluid and efficient storage, manipulation, and analysis of data.

## Import statement: pandas is a "package"
import pandas as pd

pandas.read_csv#

Tabular data is often stored in .csv files.

  • pandas.read_csv can be used to read in a .csv file.

  • This is represented as a pandas.DataFrame.

pd.read_csv("path/to/file.csv") ### replace with actual filepath!
### .csv file with data about different Pokemon
df_pokemon = pd.read_csv("data/pokemon.csv")

read_csv with a URL#

You can also pass a URL that points to a .csv file into read_csv.

  • This is a dataset from Brand et al. (2019), which quantified changes in the positivity and negativity in song lyrics over time.

  • We’ll be working more with that dataset soon!

df_lyrics = pd.read_csv("https://raw.githubusercontent.com/lottybrand/song_lyrics/master/data/billboard_analysis.csv")
df_lyrics.head(2)
artist artist_processed rank song year ID_index negative positive n
0 sam the sham and the pharaohs sam the sham and the pharaohs 1 wooly bully 1965 1 0.0 0.0 73.0
1 four tops four tops 2 i cant help myself sugar pie honey bunch 1965 2 3.0 8.0 192.0

Using a DataFrame#

  • Now that we have a DataFrame object, we want to be able to use that DataFrame.

  • This includes:

    • Get basic information about DataFrame (e.g., its shape).

    • Accessing specific columns.

    • Accessing specific rows.

Using shape#

df.shape tells us how many rows and columns are in the DataFrame.

## (#rows, #columns)
df_pokemon.shape
(800, 13)

Using head and tail#

  • The head(x) function displays the top x rows of the DataFrame.

  • Similarly, tail(x) displays the last x rows.

df_pokemon.head(2)
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
df_pokemon.tail(2)
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
798 720 HoopaHoopa Unbound Psychic Dark 680 80 160 60 170 130 80 6 True
799 721 Volcanion Fire Water 600 80 110 120 130 90 70 6 True

Accessing columns#

  • A column can be accessed using dataframe_name['column_name'].

### What does this bracket syntax (["column_name"]) remind you of?
df_pokemon['Speed'].head(5)
0    45
1    60
2    80
3    80
4    65
Name: Speed, dtype: int64

Useful operations with pandas#

DataFrames enable all sorts of useful operations, including:

  • sorting a DataFrame by a particular column.

  • Calculating descriptive statistics (e.g., mean, median, etc.).

  • Filtering a DataFrame.

  • Aggregating across levels of a variable using groupby.

Note that we’ll also cover these topics more in-depth in Week 4!

sort_values#

### By default, will sort from lowest to highest
df_pokemon.sort_values("HP").head(2)
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
316 292 Shedinja Bug Ghost 236 1 90 45 30 30 40 3 False
55 50 Diglett Ground NaN 265 10 55 25 35 45 95 1 False
### Show the highest HP
df_pokemon.sort_values("HP", ascending = False).head(2)
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
261 242 Blissey Normal NaN 540 255 10 10 75 135 55 2 False
121 113 Chansey Normal NaN 450 250 5 5 35 105 50 1 False

Check-in#

What is the Speed of the Pokemon with the highest Attack?

### Your code here

Solution#

### Pokemon with highest Attack
df_pokemon.sort_values("Attack", ascending = False).head(1)
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
163 150 MewtwoMega Mewtwo X Psychic Fighting 780 106 190 100 154 100 130 1 True

Descriptive statistics#

Columns of a DataFrame can also be summarized:

  • mean: average value (for numeric variables)

  • median: “middle” value (for numeric variables)

  • mode: most frequent value

df_pokemon['Attack'].mean()
79.00125
df_pokemon['Attack'].median()
75.0
df_pokemon['Attack'].mode()
0    100
Name: Attack, dtype: int64

Filtering a DataFrame#

  • Often, we want to filter a DataFrame so we only see observations that meet certain conditions.

  • Ultimately, this is similar to using a conditional statement––just with different syntax.

Example 1: filtering on a categorical variable#

  • The legendary column is a categorical variable, meaning there are several discrete categories.

## How many legendary pokemon?
df_pokemon[df_pokemon['Legendary']==True].head(5)
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
156 144 Articuno Ice Flying 580 90 85 100 95 125 85 1 True
157 145 Zapdos Electric Flying 580 90 90 85 125 90 100 1 True
158 146 Moltres Fire Flying 580 90 100 90 125 85 90 1 True
162 150 Mewtwo Psychic NaN 680 106 110 90 154 90 130 1 True
163 150 MewtwoMega Mewtwo X Psychic Fighting 780 106 190 100 154 100 130 1 True

Example 2: filtering on a continuous variable#

  • The HP column is a continuous variable.

  • Let’s show only the rows for Pokemon with a HP > 150.

df_pokemon[df_pokemon['HP'] > 150].head(3)
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
121 113 Chansey Normal NaN 450 250 5 5 35 105 50 1 False
155 143 Snorlax Normal NaN 540 160 110 65 65 110 30 1 False
217 202 Wobbuffet Psychic NaN 405 190 33 58 33 58 33 2 False

Using groupby#

The groupby function allows you to split data (i.e., along different categories) then apply some function to each split of that data (e.g., mean).

The syntax is as follows:

df_name.groupby("column_to_group_by").mean() ## or median, etc.

Example: mean Attack by Legendary#

Here, the [[...]] syntax just limits the columns in the DataFrame to the columns we directly care about.

df_pokemon[['Legendary', 'Attack']].groupby("Legendary").mean()
Attack
Legendary
False 75.669388
True 116.676923

Check-in:#

How would you calculate the median Defense by Legendary status?

### Your code here

Solution#

df_pokemon[['Legendary', 'Defense']].groupby("Legendary").median()
Defense
Legendary
False 66.0
True 100.0

Check-in:#

How would you calculate the mean HP by Type 1?

### Your code here

Solution#

df_pokemon[['Type 1', 'HP']].groupby("Type 1").mean()
HP
Type 1
Bug 56.884058
Dark 66.806452
Dragon 83.312500
Electric 59.795455
Fairy 74.117647
Fighting 69.851852
Fire 69.903846
Flying 70.750000
Ghost 64.437500
Grass 67.271429
Ground 73.781250
Ice 72.000000
Normal 77.275510
Poison 67.250000
Psychic 70.631579
Rock 65.363636
Steel 65.222222
Water 72.062500

Conclusion#

This concludes our unit on interacting with data.

  • Reading in .csv files with pandas.

  • Summarizing and working with tabular data.

We’ll also spend much more time on tabular data in Week 4, particularly as it regards “cleaning” data.