NAVIGATION
Got Pandas? Practical Data Wrangling with Pandas
Pandas supports a number of data formats out of the box including:
The major benefit for using Pandas to load these data is that it provides a simple, consistent mechanism for each of them and loads them directly into the Pandas DataFrame in a single operation, reducing the need to go elsewhere to perform the same operations with more code or overhead.
Pandas I/O supports loading these data formats directly from local storage or using a URL containing such data. The convenience being that the resource string used can be either a local/network file string or a URL.
NOTEBOOK OBJECTIVES
In this notebook we'll:
You will most often load the Pandas library with the following line:
import pandas as pd
CSV files are still a staple in data file formats. They're portable, flexible, flat, usually easy to parse and ubiquitous. We will begin by showing how to use Pandas to load CSV directly into a DataFrame.
DATA SOURCE
US Baseball Statistics Archive by Sean Lahman (CCBY-SA 3.0):
We have put the dataset for batting data into our local datasets
folder.
Loading this into a Pandas DataFrame will require us to use the read_csv
function, which will attempt to load the CSV data directly into the DataFrame.
df = pd.read_csv("./datasets/Batting.csv")
If we inspect this DataFrame, will get exactly what we expect -- each line corresponding to the row in file. NOTE: where there are missing values, Pandas will automatically fill the data with NaN
.
df
playerID | yearID | stint | teamID | lgID | G | AB | R | H | 2B | ... | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | abercda01 | 1871 | 1 | TRO | NaN | 1 | 4 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | NaN | NaN | NaN | NaN | NaN |
1 | addybo01 | 1871 | 1 | RC1 | NaN | 25 | 118 | 30 | 32 | 6 | ... | 13.0 | 8.0 | 1.0 | 4 | 0.0 | NaN | NaN | NaN | NaN | NaN |
2 | allisar01 | 1871 | 1 | CL1 | NaN | 29 | 137 | 28 | 40 | 4 | ... | 19.0 | 3.0 | 1.0 | 2 | 5.0 | NaN | NaN | NaN | NaN | NaN |
3 | allisdo01 | 1871 | 1 | WS3 | NaN | 27 | 133 | 28 | 44 | 10 | ... | 27.0 | 1.0 | 1.0 | 0 | 2.0 | NaN | NaN | NaN | NaN | NaN |
4 | ansonca01 | 1871 | 1 | RC1 | NaN | 25 | 120 | 29 | 39 | 11 | ... | 16.0 | 6.0 | 2.0 | 2 | 1.0 | NaN | NaN | NaN | NaN | NaN |
5 | armstbo01 | 1871 | 1 | FW1 | NaN | 12 | 49 | 9 | 11 | 2 | ... | 5.0 | 0.0 | 1.0 | 0 | 1.0 | NaN | NaN | NaN | NaN | NaN |
6 | barkeal01 | 1871 | 1 | RC1 | NaN | 1 | 4 | 0 | 1 | 0 | ... | 2.0 | 0.0 | 0.0 | 1 | 0.0 | NaN | NaN | NaN | NaN | NaN |
7 | barnero01 | 1871 | 1 | BS1 | NaN | 31 | 157 | 66 | 63 | 10 | ... | 34.0 | 11.0 | 6.0 | 13 | 1.0 | NaN | NaN | NaN | NaN | NaN |
8 | barrebi01 | 1871 | 1 | FW1 | NaN | 1 | 5 | 1 | 1 | 1 | ... | 1.0 | 0.0 | 0.0 | 0 | 0.0 | NaN | NaN | NaN | NaN | NaN |
9 | barrofr01 | 1871 | 1 | BS1 | NaN | 18 | 86 | 13 | 13 | 2 | ... | 11.0 | 1.0 | 0.0 | 0 | 0.0 | NaN | NaN | NaN | NaN | NaN |
10 | bassjo01 | 1871 | 1 | CL1 | NaN | 22 | 89 | 18 | 27 | 1 | ... | 18.0 | 0.0 | 1.0 | 3 | 4.0 | NaN | NaN | NaN | NaN | NaN |
11 | battijo01 | 1871 | 1 | CL1 | NaN | 1 | 3 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 1 | 0.0 | NaN | NaN | NaN | NaN | NaN |
12 | bealsto01 | 1871 | 1 | WS3 | NaN | 10 | 36 | 6 | 7 | 0 | ... | 1.0 | 2.0 | 0.0 | 2 | 0.0 | NaN | NaN | NaN | NaN | NaN |
13 | beaveed01 | 1871 | 1 | TRO | NaN | 3 | 15 | 7 | 6 | 0 | ... | 5.0 | 2.0 | 0.0 | 0 | 0.0 | NaN | NaN | NaN | NaN | NaN |
14 | bechtge01 | 1871 | 1 | PH1 | NaN | 20 | 94 | 24 | 33 | 9 | ... | 21.0 | 4.0 | 0.0 | 2 | 2.0 | NaN | NaN | NaN | NaN | NaN |
15 | bellast01 | 1871 | 1 | TRO | NaN | 29 | 128 | 26 | 32 | 3 | ... | 23.0 | 4.0 | 4.0 | 9 | 2.0 | NaN | NaN | NaN | NaN | NaN |
16 | berkena01 | 1871 | 1 | PH1 | NaN | 1 | 4 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 3.0 | NaN | NaN | NaN | NaN | NaN |
17 | berryto01 | 1871 | 1 | PH1 | NaN | 1 | 4 | 0 | 1 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | NaN | NaN | NaN | NaN | NaN |
18 | berthha01 | 1871 | 1 | WS3 | NaN | 17 | 73 | 17 | 17 | 1 | ... | 8.0 | 3.0 | 1.0 | 4 | 2.0 | NaN | NaN | NaN | NaN | NaN |
19 | biermch01 | 1871 | 1 | FW1 | NaN | 1 | 2 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 1 | 0.0 | NaN | NaN | NaN | NaN | NaN |
20 | birdge01 | 1871 | 1 | RC1 | NaN | 25 | 106 | 19 | 28 | 2 | ... | 13.0 | 1.0 | 0.0 | 3 | 2.0 | NaN | NaN | NaN | NaN | NaN |
21 | birdsda01 | 1871 | 1 | BS1 | NaN | 29 | 152 | 51 | 46 | 3 | ... | 24.0 | 6.0 | 0.0 | 4 | 4.0 | NaN | NaN | NaN | NaN | NaN |
22 | brainas01 | 1871 | 1 | WS3 | NaN | 30 | 134 | 24 | 30 | 4 | ... | 21.0 | 4.0 | 0.0 | 7 | 2.0 | NaN | NaN | NaN | NaN | NaN |
23 | brannmi01 | 1871 | 1 | CH1 | NaN | 3 | 14 | 2 | 1 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | NaN | NaN | NaN | NaN | NaN |
24 | burrohe01 | 1871 | 1 | WS3 | NaN | 12 | 63 | 11 | 15 | 2 | ... | 14.0 | 0.0 | 0.0 | 1 | 1.0 | NaN | NaN | NaN | NaN | NaN |
25 | careyto01 | 1871 | 1 | FW1 | NaN | 19 | 87 | 16 | 20 | 2 | ... | 10.0 | 5.0 | 0.0 | 2 | 1.0 | NaN | NaN | NaN | NaN | NaN |
26 | carleji01 | 1871 | 1 | CL1 | NaN | 29 | 127 | 31 | 32 | 8 | ... | 18.0 | 2.0 | 1.0 | 8 | 3.0 | NaN | NaN | NaN | NaN | NaN |
27 | conefr01 | 1871 | 1 | BS1 | NaN | 19 | 77 | 17 | 20 | 3 | ... | 16.0 | 12.0 | 1.0 | 8 | 2.0 | NaN | NaN | NaN | NaN | NaN |
28 | connone01 | 1871 | 1 | TRO | NaN | 7 | 33 | 6 | 7 | 0 | ... | 2.0 | 0.0 | 0.0 | 0 | 0.0 | NaN | NaN | NaN | NaN | NaN |
29 | cravebi01 | 1871 | 1 | TRO | NaN | 27 | 118 | 26 | 38 | 8 | ... | 26.0 | 6.0 | 3.0 | 3 | 0.0 | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
102786 | wittgni01 | 2016 | 1 | MIA | NL | 48 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
102787 | wolteto01 | 2016 | 1 | COL | NL | 71 | 205 | 27 | 53 | 15 | ... | 30.0 | 4.0 | 1.0 | 21 | 53.0 | 2.0 | 0.0 | 4.0 | 0.0 | 1.0 |
102788 | wongko01 | 2016 | 1 | SLN | NL | 121 | 313 | 39 | 75 | 7 | ... | 23.0 | 7.0 | 0.0 | 34 | 52.0 | 2.0 | 9.0 | 0.0 | 5.0 | 3.0 |
102789 | woodal02 | 2016 | 1 | LAN | NL | 15 | 16 | 2 | 4 | 0 | ... | 2.0 | 0.0 | 0.0 | 1 | 9.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 |
102790 | woodbl01 | 2016 | 1 | CIN | NL | 70 | 2 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
102791 | woodtr01 | 2016 | 1 | CHN | NL | 81 | 11 | 0 | 2 | 0 | ... | 1.0 | 0.0 | 0.0 | 1 | 5.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
102792 | worleva01 | 2016 | 1 | BAL | AL | 35 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
102793 | worthda01 | 2016 | 1 | HOU | AL | 16 | 39 | 4 | 7 | 2 | ... | 1.0 | 0.0 | 0.0 | 1 | 6.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
102794 | wrighda03 | 2016 | 1 | NYN | NL | 37 | 137 | 18 | 31 | 8 | ... | 14.0 | 3.0 | 2.0 | 26 | 55.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
102795 | wrighda04 | 2016 | 1 | CIN | NL | 4 | 5 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 2.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
102796 | wrighda04 | 2016 | 2 | LAA | AL | 5 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
102797 | wrighmi01 | 2016 | 1 | BAL | AL | 18 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
102798 | wrighst01 | 2016 | 1 | BOS | AL | 25 | 4 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
102799 | yateski01 | 2016 | 1 | NYA | AL | 41 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
102800 | yelicch01 | 2016 | 1 | MIA | NL | 155 | 578 | 78 | 172 | 38 | ... | 98.0 | 9.0 | 4.0 | 72 | 138.0 | 4.0 | 4.0 | 0.0 | 5.0 | 20.0 |
102801 | ynoaga01 | 2016 | 1 | NYN | NL | 10 | 3 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
102802 | ynoami01 | 2016 | 1 | CHA | AL | 23 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
102803 | ynoara01 | 2016 | 1 | COL | NL | 3 | 5 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
102804 | youngch03 | 2016 | 1 | KCA | AL | 34 | 1 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
102805 | youngch04 | 2016 | 1 | BOS | AL | 76 | 203 | 29 | 56 | 18 | ... | 24.0 | 4.0 | 2.0 | 21 | 50.0 | 0.0 | 3.0 | 0.0 | 0.0 | 4.0 |
102806 | younger03 | 2016 | 1 | NYA | AL | 6 | 1 | 2 | 0 | 0 | ... | 0.0 | 1.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
102807 | youngma03 | 2016 | 1 | ATL | NL | 8 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
102808 | zastrro01 | 2016 | 1 | CHN | NL | 8 | 3 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
102809 | zieglbr01 | 2016 | 1 | ARI | NL | 36 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
102810 | zieglbr01 | 2016 | 2 | BOS | AL | 33 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
102811 | zimmejo02 | 2016 | 1 | DET | AL | 19 | 4 | 0 | 1 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 2.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
102812 | zimmery01 | 2016 | 1 | WAS | NL | 115 | 427 | 60 | 93 | 18 | ... | 46.0 | 4.0 | 1.0 | 29 | 104.0 | 1.0 | 5.0 | 0.0 | 6.0 | 12.0 |
102813 | zobribe01 | 2016 | 1 | CHN | NL | 147 | 523 | 94 | 142 | 31 | ... | 76.0 | 6.0 | 4.0 | 96 | 82.0 | 6.0 | 4.0 | 4.0 | 4.0 | 17.0 |
102814 | zuninmi01 | 2016 | 1 | SEA | AL | 55 | 164 | 16 | 34 | 7 | ... | 31.0 | 0.0 | 0.0 | 21 | 65.0 | 0.0 | 6.0 | 0.0 | 1.0 | 0.0 |
102815 | zychto01 | 2016 | 1 | SEA | AL | 12 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
102816 rows × 22 columns
We will soon learn that Pandas, supports some typical "Pythonic" use cases for accesing data. The first we will encounter is with len()
. We can get the size of this dataset (in rows) with the standard Python len()
function, which will return exactly what we expect.
len(df)
102816
Every DataFrame will have a columns
attribute, which contains the column index for our dataset. Thus, getting the length of that attribute returns, again, what we expect.
df.columns
Index(['playerID', 'yearID', 'stint', 'teamID', 'lgID', 'G', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'IBB', 'HBP', 'SH', 'SF', 'GIDP'], dtype='object')
len(df.columns)
22
If we want both column and row counts DataFrame.shape
will return the tuple to do this:
df.shape
(102816, 22)
Which returns what we expect (yet again).
Much like Python slicing of lists, if we want to the first n rows of data, we can use the shorthand:
df[:10]
playerID | yearID | stint | teamID | lgID | G | AB | R | H | 2B | ... | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | abercda01 | 1871 | 1 | TRO | NaN | 1 | 4 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | NaN | NaN | NaN | NaN | NaN |
1 | addybo01 | 1871 | 1 | RC1 | NaN | 25 | 118 | 30 | 32 | 6 | ... | 13.0 | 8.0 | 1.0 | 4 | 0.0 | NaN | NaN | NaN | NaN | NaN |
2 | allisar01 | 1871 | 1 | CL1 | NaN | 29 | 137 | 28 | 40 | 4 | ... | 19.0 | 3.0 | 1.0 | 2 | 5.0 | NaN | NaN | NaN | NaN | NaN |
3 | allisdo01 | 1871 | 1 | WS3 | NaN | 27 | 133 | 28 | 44 | 10 | ... | 27.0 | 1.0 | 1.0 | 0 | 2.0 | NaN | NaN | NaN | NaN | NaN |
4 | ansonca01 | 1871 | 1 | RC1 | NaN | 25 | 120 | 29 | 39 | 11 | ... | 16.0 | 6.0 | 2.0 | 2 | 1.0 | NaN | NaN | NaN | NaN | NaN |
5 | armstbo01 | 1871 | 1 | FW1 | NaN | 12 | 49 | 9 | 11 | 2 | ... | 5.0 | 0.0 | 1.0 | 0 | 1.0 | NaN | NaN | NaN | NaN | NaN |
6 | barkeal01 | 1871 | 1 | RC1 | NaN | 1 | 4 | 0 | 1 | 0 | ... | 2.0 | 0.0 | 0.0 | 1 | 0.0 | NaN | NaN | NaN | NaN | NaN |
7 | barnero01 | 1871 | 1 | BS1 | NaN | 31 | 157 | 66 | 63 | 10 | ... | 34.0 | 11.0 | 6.0 | 13 | 1.0 | NaN | NaN | NaN | NaN | NaN |
8 | barrebi01 | 1871 | 1 | FW1 | NaN | 1 | 5 | 1 | 1 | 1 | ... | 1.0 | 0.0 | 0.0 | 0 | 0.0 | NaN | NaN | NaN | NaN | NaN |
9 | barrofr01 | 1871 | 1 | BS1 | NaN | 18 | 86 | 13 | 13 | 2 | ... | 11.0 | 1.0 | 0.0 | 0 | 0.0 | NaN | NaN | NaN | NaN | NaN |
10 rows × 22 columns
Or just like slicing a list, we can do more complex slicing:
df[:50:5]
playerID | yearID | stint | teamID | lgID | G | AB | R | H | 2B | ... | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | abercda01 | 1871 | 1 | TRO | NaN | 1 | 4 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | NaN | NaN | NaN | NaN | NaN |
5 | armstbo01 | 1871 | 1 | FW1 | NaN | 12 | 49 | 9 | 11 | 2 | ... | 5.0 | 0.0 | 1.0 | 0 | 1.0 | NaN | NaN | NaN | NaN | NaN |
10 | bassjo01 | 1871 | 1 | CL1 | NaN | 22 | 89 | 18 | 27 | 1 | ... | 18.0 | 0.0 | 1.0 | 3 | 4.0 | NaN | NaN | NaN | NaN | NaN |
15 | bellast01 | 1871 | 1 | TRO | NaN | 29 | 128 | 26 | 32 | 3 | ... | 23.0 | 4.0 | 4.0 | 9 | 2.0 | NaN | NaN | NaN | NaN | NaN |
20 | birdge01 | 1871 | 1 | RC1 | NaN | 25 | 106 | 19 | 28 | 2 | ... | 13.0 | 1.0 | 0.0 | 3 | 2.0 | NaN | NaN | NaN | NaN | NaN |
25 | careyto01 | 1871 | 1 | FW1 | NaN | 19 | 87 | 16 | 20 | 2 | ... | 10.0 | 5.0 | 0.0 | 2 | 1.0 | NaN | NaN | NaN | NaN | NaN |
30 | cuthbne01 | 1871 | 1 | PH1 | NaN | 28 | 150 | 47 | 37 | 7 | ... | 30.0 | 16.0 | 2.0 | 10 | 2.0 | NaN | NaN | NaN | NaN | NaN |
35 | ewellge01 | 1871 | 1 | CL1 | NaN | 1 | 3 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | NaN | NaN | NaN | NaN | NaN |
40 | flowedi01 | 1871 | 1 | TRO | NaN | 21 | 105 | 39 | 33 | 5 | ... | 18.0 | 8.0 | 2.0 | 4 | 0.0 | NaN | NaN | NaN | NaN | NaN |
45 | fulmech01 | 1871 | 1 | RC1 | NaN | 16 | 63 | 11 | 17 | 1 | ... | 3.0 | 0.0 | 0.0 | 5 | 1.0 | NaN | NaN | NaN | NaN | NaN |
10 rows × 22 columns
One of the nice things about Pandas is that we can reference the columns of data by their names (or labels). For example, we have a yearID
label, teamID
label, G
label for game counts, and so on. For our dataset to learn what the labels are in detail see the documentation for the provided links.
df.yearID[:10]
0 1871 1 1871 2 1871 3 1871 4 1871 5 1871 6 1871 7 1871 8 1871 9 1871 Name: yearID, dtype: int64
df.G[-10:]
102806 6 102807 8 102808 8 102809 36 102810 33 102811 19 102812 115 102813 147 102814 55 102815 12 Name: G, dtype: int64
Let's say we want all the player data for the Washington Nationals from 2015 and 2016 where a player played in 100 or more games:
df_was = df[(df.yearID > 2014) & (df.teamID=='WAS') & (df.G > 99)] df_was
playerID | yearID | stint | teamID | lgID | G | AB | R | H | 2B | ... | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
100193 | desmoia01 | 2015 | 1 | WAS | NL | 156 | 583 | 69 | 136 | 27 | ... | 62.0 | 13.0 | 5.0 | 45 | 187.0 | 0.0 | 3.0 | 6.0 | 4.0 | 9.0 |
100250 | escobyu01 | 2015 | 1 | WAS | NL | 139 | 535 | 75 | 168 | 25 | ... | 56.0 | 2.0 | 2.0 | 45 | 70.0 | 0.0 | 8.0 | 1.0 | 2.0 | 24.0 |
100251 | espinda01 | 2015 | 1 | WAS | NL | 118 | 367 | 59 | 88 | 21 | ... | 37.0 | 5.0 | 2.0 | 33 | 106.0 | 5.0 | 6.0 | 3.0 | 3.0 | 6.0 |
100422 | harpebr03 | 2015 | 1 | WAS | NL | 153 | 521 | 118 | 172 | 38 | ... | 99.0 | 6.0 | 4.0 | 124 | 131.0 | 15.0 | 5.0 | 0.0 | 4.0 | 15.0 |
100950 | ramoswi01 | 2015 | 1 | WAS | NL | 128 | 475 | 41 | 109 | 16 | ... | 68.0 | 0.0 | 0.0 | 21 | 101.0 | 2.0 | 0.0 | 0.0 | 8.0 | 16.0 |
100993 | robincl01 | 2015 | 1 | WAS | NL | 126 | 309 | 44 | 84 | 15 | ... | 34.0 | 0.0 | 0.0 | 37 | 52.0 | 4.0 | 5.0 | 0.0 | 1.0 | 6.0 |
101176 | taylomi02 | 2015 | 1 | WAS | NL | 138 | 472 | 49 | 108 | 15 | ... | 63.0 | 16.0 | 3.0 | 35 | 158.0 | 9.0 | 1.0 | 1.0 | 2.0 | 5.0 |
101725 | espinda01 | 2016 | 1 | WAS | NL | 157 | 516 | 66 | 108 | 15 | ... | 72.0 | 9.0 | 2.0 | 54 | 174.0 | 12.0 | 20.0 | 7.0 | 4.0 | 4.0 |
101895 | harpebr03 | 2016 | 1 | WAS | NL | 147 | 506 | 84 | 123 | 24 | ... | 86.0 | 21.0 | 10.0 | 108 | 117.0 | 20.0 | 3.0 | 0.0 | 10.0 | 11.0 |
102245 | murphda08 | 2016 | 1 | WAS | NL | 142 | 531 | 88 | 184 | 47 | ... | 104.0 | 5.0 | 3.0 | 35 | 57.0 | 10.0 | 8.0 | 0.0 | 8.0 | 4.0 |
102429 | ramoswi01 | 2016 | 1 | WAS | NL | 131 | 482 | 58 | 148 | 25 | ... | 80.0 | 0.0 | 0.0 | 35 | 79.0 | 2.0 | 2.0 | 0.0 | 4.0 | 17.0 |
102449 | rendoan01 | 2016 | 1 | WAS | NL | 156 | 567 | 91 | 153 | 38 | ... | 85.0 | 12.0 | 6.0 | 65 | 117.0 | 2.0 | 7.0 | 0.0 | 8.0 | 5.0 |
102451 | reverbe01 | 2016 | 1 | WAS | NL | 103 | 350 | 44 | 76 | 9 | ... | 24.0 | 14.0 | 5.0 | 18 | 34.0 | 0.0 | 3.0 | 2.0 | 2.0 | 12.0 |
102472 | robincl01 | 2016 | 1 | WAS | NL | 104 | 196 | 16 | 46 | 4 | ... | 26.0 | 0.0 | 0.0 | 20 | 38.0 | 0.0 | 2.0 | 1.0 | 5.0 | 4.0 |
102763 | werthja01 | 2016 | 1 | WAS | NL | 143 | 525 | 84 | 128 | 28 | ... | 69.0 | 5.0 | 1.0 | 71 | 139.0 | 0.0 | 4.0 | 0.0 | 6.0 | 17.0 |
102812 | zimmery01 | 2016 | 1 | WAS | NL | 115 | 427 | 60 | 93 | 18 | ... | 46.0 | 4.0 | 1.0 | 29 | 104.0 | 1.0 | 5.0 | 0.0 | 6.0 | 12.0 |
16 rows × 22 columns
We'll put all these things in motion later, but for now put a thumbnail on this for future reference. NOTE: we'll need to access the dataset that crosswalks the PlayerID
with the actual player name and vitals, but we'll leave that as an exercise for the interested (hint: take a look in this dataset).
Loading Excel data is nearly as easy as CSV data. This time we'll use a different data source and show how to access it in a slightly different manner. Instead of the local file source, we will use a remote URL for the resource. This will show us exactly how easy it is to seamlessly interchange various data resources.
DATA SOURCES
To read data from the data set we will access it by URL and use the pandas.read_excel()
method note we're using the sheetname=None
parameter to read each sheet to be assigned its own key in a dictionary for easy lookup by sheet name.
xl = pd.read_excel( "https://www.bts.gov/sites/bts.dot.gov/files/docs/newsroom/206581/airline-employment-press-tables-web.xlsx", sheetname=None)
Notice now, if we want to access the sheet called Table1
we can easily do this in a Pythonic way much like any other dictionary. The result is the DataFrame representation of that sheet.
xl_tbl1 = xl['Table1'] xl_tbl1
Table 1: Yearly Change in Scheduled Passenger Airline Full-time Equivalent Employees* by Airline Group | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | |
---|---|---|---|---|---|---|
0 | Most recent 13 months - percent change from sa... | NaN | NaN | NaN | NaN | NaN |
1 | NaN | Network Airlines | Low-Cost Airlines | Regional Airlines | Other Airlines | All Passenger Airlines ** |
2 | May 2015 - May 2016 | 2.3 | 10.7 | 0.2 | 9.3 | 3.7 |
3 | Jun 2015 - Jun 2016 | 2.3 | 11 | 0.9 | 10.6 | 3.9 |
4 | Jul 2015 - Jul 2016 | 2.4 | 11.3 | 3.3 | 11.2 | 4.3 |
5 | Aug 2015 - Aug 2016 | 2.5 | 11 | 3.3 | 11.9 | 4.3 |
6 | Sep 2015 - Sep 2016 | 2.6 | 10.6 | 2.9 | 13 | 4.3 |
7 | Oct 2015 - Oct 2016 | 2.7 | 10.3 | 0.3 | 12.7 | 4 |
8 | Nov 2015 - Nov 2016 | 2.3 | 9.8 | 0.2 | 13.5 | 3.7 |
9 | Dec 2015 - Dec 2016 | 2.4 | 9.5 | 0.2 | 13.7 | 3.7 |
10 | Jan 2016 - Jan 2017 | 2.3 | 9.7 | 1.9 | 12.7 | 3.9 |
11 | Feb 2016 - Feb 2017 | 2.4 | 9.4 | 2.4 | 11.8 | 3.9 |
12 | Mar 2016 - Mar 2017 | 2.7 | 9.1 | 2 | 11.7 | 4 |
13 | Apr 2016 - Apr 2017 | 2.6 | 8.5 | 2.1 | 10.7 | 3.9 |
14 | May 2016 - May 2017 | 2.4 | 8.3 | 2.5 | 4.2 | 3.6 |
15 | Source: Bureau of Transportation Statistics | NaN | NaN | NaN | NaN | NaN |
16 | * Full-time Equivalent Employee (FTE) calculat... | NaN | NaN | NaN | NaN | NaN |
17 | ** Includes network, low-cost, regional and ot... | NaN | NaN | NaN | NaN | NaN |
18 | Note: Percent changes based on numbers prior t... | NaN | NaN | NaN | NaN | NaN |
19 | Note: See Table 2 for all passenger airlines, ... | NaN | NaN | NaN | NaN | NaN |
One problem we have here is that the data is not exactly as clean as we want it to be. We'll spend more time talking about the iloc
() method in the next section, but for now, let's get a flavor for how we might clean this up so it is more usable.
# lets select the (row) index idx = xl_tbl1.iloc[2:15, 0:1] # lets select the (col) index col = xl_tbl1.iloc[1,1:] print(idx) print(col)
Table 1: Yearly Change in Scheduled Passenger Airline Full-time Equivalent Employees* by Airline Group 2 May 2015 - May 2016 3 Jun 2015 - Jun 2016 4 Jul 2015 - Jul 2016 5 Aug 2015 - Aug 2016 6 Sep 2015 - Sep 2016 7 Oct 2015 - Oct 2016 8 Nov 2015 - Nov 2016 9 Dec 2015 - Dec 2016 10 Jan 2016 - Jan 2017 11 Feb 2016 - Feb 2017 12 Mar 2016 - Mar 2017 13 Apr 2016 - Apr 2017 14 May 2016 - May 2017 Unnamed: 1 Network Airlines Unnamed: 2 Low-Cost Airlines Unnamed: 3 Regional Airlines Unnamed: 4 Other Airlines Unnamed: 5 All Passenger Airlines ** Name: 1, dtype: object
# we'll create the index object idxs = pd.Index([v[0] for v in idx.values]) idxs
Index(['May 2015 - May 2016', 'Jun 2015 - Jun 2016', 'Jul 2015 - Jul 2016', 'Aug 2015 - Aug 2016', 'Sep 2015 - Sep 2016', 'Oct 2015 - Oct 2016', 'Nov 2015 - Nov 2016', 'Dec 2015 - Dec 2016', 'Jan 2016 - Jan 2017', 'Feb 2016 - Feb 2017', 'Mar 2016 - Mar 2017', 'Apr 2016 - Apr 2017', 'May 2016 - May 2017'], dtype='object')
# set the columns cols = [v for v in col.values] cols
['Network Airlines', 'Low-Cost Airlines', 'Regional Airlines', 'Other Airlines', 'All Passenger Airlines **']
# now for the data data = xl_tbl1.iloc[2:15,1:].values data
array([[2.3, 10.7, 0.2, 9.3, 3.7], [2.3, 11, 0.9, 10.6, 3.9], [2.4, 11.3, 3.3, 11.2, 4.3], [2.5, 11, 3.3, 11.9, 4.3], [2.6, 10.6, 2.9, 13, 4.3], [2.7, 10.3, 0.3, 12.7, 4], [2.3, 9.8, 0.2, 13.5, 3.7], [2.4, 9.5, 0.2, 13.7, 3.7], [2.3, 9.7, 1.9, 12.7, 3.9], [2.4, 9.4, 2.4, 11.8, 3.9], [2.7, 9.1, 2, 11.7, 4], [2.6, 8.5, 2.1, 10.7, 3.9], [2.4, 8.3, 2.5, 4.2, 3.6]], dtype=object)
# putting it all together ... df_tbl1 = pd.DataFrame(data=xl_tbl1.iloc[2:15,1:].values, columns=[v for v in col.values], index=pd.Index([v[0] for v in idx.values])) df_tbl1
Network Airlines | Low-Cost Airlines | Regional Airlines | Other Airlines | All Passenger Airlines ** | |
---|---|---|---|---|---|
May 2015 - May 2016 | 2.3 | 10.7 | 0.2 | 9.3 | 3.7 |
Jun 2015 - Jun 2016 | 2.3 | 11 | 0.9 | 10.6 | 3.9 |
Jul 2015 - Jul 2016 | 2.4 | 11.3 | 3.3 | 11.2 | 4.3 |
Aug 2015 - Aug 2016 | 2.5 | 11 | 3.3 | 11.9 | 4.3 |
Sep 2015 - Sep 2016 | 2.6 | 10.6 | 2.9 | 13 | 4.3 |
Oct 2015 - Oct 2016 | 2.7 | 10.3 | 0.3 | 12.7 | 4 |
Nov 2015 - Nov 2016 | 2.3 | 9.8 | 0.2 | 13.5 | 3.7 |
Dec 2015 - Dec 2016 | 2.4 | 9.5 | 0.2 | 13.7 | 3.7 |
Jan 2016 - Jan 2017 | 2.3 | 9.7 | 1.9 | 12.7 | 3.9 |
Feb 2016 - Feb 2017 | 2.4 | 9.4 | 2.4 | 11.8 | 3.9 |
Mar 2016 - Mar 2017 | 2.7 | 9.1 | 2 | 11.7 | 4 |
Apr 2016 - Apr 2017 | 2.6 | 8.5 | 2.1 | 10.7 | 3.9 |
May 2016 - May 2017 | 2.4 | 8.3 | 2.5 | 4.2 | 3.6 |
df_tbl1['Network Airlines']
May 2015 - May 2016 2.3 Jun 2015 - Jun 2016 2.3 Jul 2015 - Jul 2016 2.4 Aug 2015 - Aug 2016 2.5 Sep 2015 - Sep 2016 2.6 Oct 2015 - Oct 2016 2.7 Nov 2015 - Nov 2016 2.3 Dec 2015 - Dec 2016 2.4 Jan 2016 - Jan 2017 2.3 Feb 2016 - Feb 2017 2.4 Mar 2016 - Mar 2017 2.7 Apr 2016 - Apr 2017 2.6 May 2016 - May 2017 2.4 Name: Network Airlines, dtype: object
JSON has become a standard format format for many web data sources. It is succinct, readable and very portable -- there are libraries in nearly every modern language that can parse JSON, Python being no exception. We'll load a remote JSON data source to demonstrate remote access as well as the capabilities of using Pandas to load such a source.
JSON DATA SOURCE
If we haven't noticed the pattern yet, loading JSON data will come as no surprise via the pandas.read_json()
.
With JSON data you may get the best results with relatively flat JSON objects. If you need to obtain different results (or you're getting results that are not as expected), you might instead into the orient
parameter to get different resulting DataFrames. We'll load the data as-is and reshape our DataFrame for some extra practice.
df = pd.read_json( "https://raw.githubusercontent.com/fortrabbit/quotes/master/quotes.json") df
author | text | |
---|---|---|
0 | Martin Golding | Always code as if the guy who ends up maintain... |
1 | Unknown | All computers wait at the same speed. |
2 | Unknown | A misplaced decimal point will always end up w... |
3 | Unknown | A good programmer looks both ways before cross... |
4 | Unknown | A computer program does what you tell it to do... |
5 | Unknown | "Intel Inside" is a Government Warning require... |
6 | Arthur Godfrey | Common sense gets a lot of credit that belongs... |
7 | Unknown | Chuck Norris doesn’t go hunting. Chuck Norris ... |
8 | Unknown | Chuck Norris counted to infinity... twice. |
9 | Unknown | C is quirky, flawed, and an enormous success. |
10 | Unknown | Beta is Latin for still doesn’t work. |
11 | Unknown | ASCII stupid question, get a stupid ANSI! |
12 | Unknown | Artificial Intelligence usually beats natural ... |
13 | Ted Nelson | Any fool can use a computer. Many do. |
14 | Unknown | Hey! It compiles! Ship it! |
15 | Martin Luther King Junior | Hate cannot drive out hate; only love can do t... |
16 | Unknown | Guns don’t kill people. Chuck Norris kills peo... |
17 | Unknown | God is real, unless declared integer. |
18 | John Johnson | First, solve the problem. Then, write the code. |
19 | Oscar Wilde | Experience is the name everyone gives to their... |
20 | Miguel de Icaza | Every piece of software written today is likel... |
21 | Unknown | Computers make very fast, very accurate mistakes. |
22 | Unknown | Computers do not solve problems, they execute ... |
23 | Unknown | I have NOT lost my mind—I have it backed up on... |
24 | Unknown | If brute force doesn’t solve your problems, th... |
25 | Unknown | It works on my machine. |
26 | Unknown | Java is, in many ways, C++??. |
27 | Unknown | Keyboard not found...Press any key to continue. |
28 | Unknown | Life would be so much easier if we only had th... |
29 | Unknown | Mac users swear by their Mac, PC users swear a... |
... | ... | ... |
159 | Paul Graham | OO programming offers a sustainable way to wri... |
160 | Nikita Popov | Ruby is rubbish! PHP is phpantastic! |
161 | Douglas Adams | So long and thanks for all the fish! |
162 | Cicero | If I had more time, I would have written a sho... |
163 | Jeff Atwood | The best reaction to "this is confusing, where... |
164 | Jeff Atwood | The older I get, the more I believe that the o... |
165 | Douglas Crockford | "That hardly ever happens" is another way of s... |
166 | Anna Debenham | Hello, PHP, my old friend. |
167 | Melvin Conway | Organizations which design systems are constra... |
168 | Melvin Conway | In design, complexity is toxic. |
169 | Jeffrey Zeldman | Good is the enemy of great, but great is the e... |
170 | Rick Lemons | Don't make the user provide information that t... |
171 | Donald E. Knuth | You're bound to be unhappy if you optimize eve... |
172 | Anna Nachesa | If the programmers like each other, they play ... |
173 | Edsger W. Dijkstra | Simplicity is prerequisite for reliability. |
174 | Jordi Boggiano | Focus on WHY instead of WHAT in your code will... |
175 | Andrei Herasimchuk | The best engineers I know are artists at heart... |
176 | Barry Boehm | Poor management can increase software costs mo... |
177 | Daniel Bryant | If you can't deploy your services independentl... |
178 | Daniel Bryant | If you can't deploy your services independentl... |
179 | Jeff Atwood | No one hates software more than software devel... |
180 | Robert C. Martin | The proper use of comments is to compensate fo... |
181 | Cory House | Code is like humor. When you have to explain i... |
182 | Steve Maguire | Fix the cause, not the symptom. |
183 | David Heinemeier Hansson | Programmers are constantly making things more ... |
184 | Linus Torvalds | People will realize that software is not a pro... |
185 | Ron Fein | Design is choosing how you will fail. |
186 | Steve Jobs | Focus is saying no to 1000 good ideas. |
187 | Ron Jeffries | Code never lies, comments sometimes do. |
188 | Unknown | Be careful with each other, so you can be dang... |
189 rows × 2 columns
Though not a best practice, say we wanted to set the author as the index and the quote of the text the value. In this dataset, we're going to have repeated index values, and it might make sense if we wanted to access the data this way, but be very careful doing this in practice.
df1 = df.set_index(df['author']).drop('author', axis=1) df1
text | |
---|---|
author | |
Martin Golding | Always code as if the guy who ends up maintain... |
Unknown | All computers wait at the same speed. |
Unknown | A misplaced decimal point will always end up w... |
Unknown | A good programmer looks both ways before cross... |
Unknown | A computer program does what you tell it to do... |
Unknown | "Intel Inside" is a Government Warning require... |
Arthur Godfrey | Common sense gets a lot of credit that belongs... |
Unknown | Chuck Norris doesn’t go hunting. Chuck Norris ... |
Unknown | Chuck Norris counted to infinity... twice. |
Unknown | C is quirky, flawed, and an enormous success. |
Unknown | Beta is Latin for still doesn’t work. |
Unknown | ASCII stupid question, get a stupid ANSI! |
Unknown | Artificial Intelligence usually beats natural ... |
Ted Nelson | Any fool can use a computer. Many do. |
Unknown | Hey! It compiles! Ship it! |
Martin Luther King Junior | Hate cannot drive out hate; only love can do t... |
Unknown | Guns don’t kill people. Chuck Norris kills peo... |
Unknown | God is real, unless declared integer. |
John Johnson | First, solve the problem. Then, write the code. |
Oscar Wilde | Experience is the name everyone gives to their... |
Miguel de Icaza | Every piece of software written today is likel... |
Unknown | Computers make very fast, very accurate mistakes. |
Unknown | Computers do not solve problems, they execute ... |
Unknown | I have NOT lost my mind—I have it backed up on... |
Unknown | If brute force doesn’t solve your problems, th... |
Unknown | It works on my machine. |
Unknown | Java is, in many ways, C++??. |
Unknown | Keyboard not found...Press any key to continue. |
Unknown | Life would be so much easier if we only had th... |
Unknown | Mac users swear by their Mac, PC users swear a... |
... | ... |
Paul Graham | OO programming offers a sustainable way to wri... |
Nikita Popov | Ruby is rubbish! PHP is phpantastic! |
Douglas Adams | So long and thanks for all the fish! |
Cicero | If I had more time, I would have written a sho... |
Jeff Atwood | The best reaction to "this is confusing, where... |
Jeff Atwood | The older I get, the more I believe that the o... |
Douglas Crockford | "That hardly ever happens" is another way of s... |
Anna Debenham | Hello, PHP, my old friend. |
Melvin Conway | Organizations which design systems are constra... |
Melvin Conway | In design, complexity is toxic. |
Jeffrey Zeldman | Good is the enemy of great, but great is the e... |
Rick Lemons | Don't make the user provide information that t... |
Donald E. Knuth | You're bound to be unhappy if you optimize eve... |
Anna Nachesa | If the programmers like each other, they play ... |
Edsger W. Dijkstra | Simplicity is prerequisite for reliability. |
Jordi Boggiano | Focus on WHY instead of WHAT in your code will... |
Andrei Herasimchuk | The best engineers I know are artists at heart... |
Barry Boehm | Poor management can increase software costs mo... |
Daniel Bryant | If you can't deploy your services independentl... |
Daniel Bryant | If you can't deploy your services independentl... |
Jeff Atwood | No one hates software more than software devel... |
Robert C. Martin | The proper use of comments is to compensate fo... |
Cory House | Code is like humor. When you have to explain i... |
Steve Maguire | Fix the cause, not the symptom. |
David Heinemeier Hansson | Programmers are constantly making things more ... |
Linus Torvalds | People will realize that software is not a pro... |
Ron Fein | Design is choosing how you will fail. |
Steve Jobs | Focus is saying no to 1000 good ideas. |
Ron Jeffries | Code never lies, comments sometimes do. |
Unknown | Be careful with each other, so you can be dang... |
189 rows × 1 columns
Though we haven't talked about it, there is a very interesting and useful mechanism for filtering data using the apply()
method. In this case, we're going to write a cute anonymous function that finds all the quotes by the author Unknown
with java
in the quote.
df1.loc["Unknown"][df1.loc["Unknown"]["text"] .apply(lambda v: "jav" in v.lower())]
text | |
---|---|
author | |
Unknown | Java is, in many ways, C++??. |
Loading SQL data into a DataFrame is also supported by Pandas. You might need to take a look at the SQLAlchemy and the documentation on creating database engines, as this is the framework supported directly by Pandas.
SQL DATA SOURCE
This file contains a number of tables that contain the Jeopardy! game clues, players, wins, categories, etc. We will only use a fraction of the data to demonstrate the SQL capabilities.
Our example will use a SQLite database so we can demonstrate the example in a standalone context. We'll show reading a table in full using the read_sql_table()
and then how to do ad hoc queries using read_sql_query()
.
from sqlalchemy import create_engine engine = create_engine('sqlite:///datasets/database.sqlite') with engine.connect() as conn, conn.begin(): data = pd.read_sql_table('final', conn)
data[:10]
game_id | clue_id | value | category | clue | strike1 | strike2 | strike3 | answer | |
---|---|---|---|---|---|---|---|---|---|
0 | 280 | 16720 | 100 | HIGH ROLLERS | After an 1891 roulette run, Charles Wells was ... | What is Atlantic City? | What is Las Vegas? | What is Monaco? | Monte Carlo |
1 | 429 | 25403 | 100 | OH, CRAPS! | The combo that totals one shy of "boxcars" | What is 11? | What is 10? | What is 9? | 5 & 6 |
2 | 866 | 51549 | 100 | ROCK & POP | It was the last decade in which Cher didn't ha... | What are the 1980s? | What are the 1970s? | What are the 1990s? | 1950s |
3 | 1018 | 60582 | 100 | LET'S HAVE A BALL | Sink it & you've scratched | Um... | What is the pinball? | What is the 8-ball? | the cue ball |
4 | 1069 | 63644 | 100 | WHAT A YEAR! | Dewaele won the Tour de France, Coco Chanel wa... | What is 1933? | What is 1987? | What is 1927? | 1929 |
5 | 1473 | 84364 | 100 | EUROPEAN HISTORY | A former Socialist, he formed the anti-Communi... | Who was Lenin? | Who was Franco? | Who was Hitler? | Benito Mussolini |
6 | 1635 | 93864 | 100 | CHRISTIANITY | According to tradition, Dismas & Gestas were t... | Who are the thieves? | What is Cavalry? | What is Mt. Olive? | Calvary |
7 | 4166 | 242419 | 100 | NAME THE DECADE | Paul Revere & William Dawes warn colonists tha... | What is the 16th century? | What is the 18th century? | What is the 18th century? | the 1770s |
8 | 112 | 6679 | 200 | ODD ALPHABETS | In alphabet radio code, "B" is Bravo and "F" s... | What's the Flamingo? | What's a Fandango? | What's the Flamenco? - you have it written the... | Foxtrot |
9 | 354 | 20984 | 200 | SPORTS | A filly becomes a mare at this age | What is 3? | What is 1? | What is 2? | 4 |
Now say we want to find out the distribution of occupations of players over the years. When we look into the players
table, we can see we can create a query that allows for us to aggregate these occupations easily.
Using read_sql_query()
we can get the job done and dump this into a DataFrame.
query = """ SELECT occupation, count(occupation) as freq FROM players WHERE occupation != '' GROUP BY occupation ORDER BY count(occupation) DESC """ with engine.connect() as conn, conn.begin(): occupation_data = pd.read_sql_query(query, conn)
occupation_data[:10]
occupation | freq | |
---|---|---|
0 | attorney | 380 |
1 | senior | 228 |
2 | graduate student | 212 |
3 | writer | 176 |
4 | teacher | 159 |
5 | junior | 158 |
6 | law student | 120 |
7 | lawyer | 112 |
8 | homemaker | 101 |
9 | actor | 97 |
If we look closely, we can see that there are many occupations that are the same, but labeled differently. For example, "attorney" and "lawyer", or the various kinds of "teachers". Thus, if we just look at the frequency from above, we might be deceived in thinking that these frequencies are correct for the groupings that make sense at a slightly higher level of granularity than has been captured.
So let's do some data munging with Pandas and see how we might group all the "teachers" together.
To to this we'll need to do a few things:
"teach"
in them (or "teacher"
if you'd like)Let's get going!
We are going to make use of a nice convenience attribution str
of the Series
object. It operates much like the String
object in Python and has a contains()
method, which will allow us to determine if the substring we're looking for is contained as a substring in any of the values of the Series. These methods are indeed very useful to have!
freq_all_occupations = occupation_data.freq.sum() combined_teacher_freq = \ occupation_data[ occupation_data['occupation'] .str.contains('teach')]\ .sum()
combined_teacher_freq
occupation teacherhigh school teacherhigh school English ... freq 830 dtype: object
Notice the occupation is the concatenation of all those teachers. We want to change that to a single label "teacher"
.
combined_teacher_freq['occupation'] = 'teacher'
combined_teacher_freq
occupation teacher freq 830 dtype: object
We now need only append the data to our original DataFrame:
occupation_data = \ occupation_data[ ~occupation_data['occupation'] .str.contains('teach')] \ .append(combined_teacher_freq, ignore_index=True)
occupation_data[-10:]
occupation | freq | |
---|---|---|
4205 | writer for an online magazine | 1 |
4206 | writer's assistant | 1 |
4207 | writer-producer | 1 |
4208 | writing instructor | 1 |
4209 | yoga instructor | 1 |
4210 | yogurt franchise operator | 1 |
4211 | youth ministry consultant | 1 |
4212 | zoo docent | 1 |
4213 | zoo educator | 1 |
4214 | teacher | 830 |
Now let's add the percentage column and call it pct
:
occupation_data['pct'] = occupation_data['freq']/occupation_data.freq.sum()
occupation_data.sort_values(by='pct', ascending=False)[:10]
occupation | freq | pct | |
---|---|---|---|
4214 | teacher | 830 | 0.078905 |
0 | attorney | 380 | 0.036125 |
1 | senior | 228 | 0.021675 |
2 | graduate student | 212 | 0.020154 |
3 | writer | 176 | 0.016732 |
4 | junior | 158 | 0.015020 |
5 | law student | 120 | 0.011408 |
6 | lawyer | 112 | 0.010647 |
7 | homemaker | 101 | 0.009602 |
8 | actor | 97 | 0.009221 |