Talk given at RMACC August 17, 2017 titled "Practical Data Wrangling in Pandas".
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 

192 KiB

NAVIGATION

Got Pandas? Practical Data Wrangling with Pandas

  1. Data Structures
  2. Importing Data
  3. Manipulating DataFrames
  4. Wrap Up


Importing Data in Pandas

Pandas supports a number of data formats out of the box including:

  • CSV, Excel
  • JSON
  • HDF5
  • SQL databases
  • and others

The major benefit for using Pandas to load these data is that it provides a simple, consistent mechanism for each of them and loads them directly into the Pandas DataFrame in a single operation, reducing the need to go elsewhere to perform the same operations with more code or overhead.

Pandas I/O supports loading these data formats directly from local storage or using a URL containing such data. The convenience being that the resource string used can be either a local/network file string or a URL.

NOTEBOOK OBJECTIVES

In this notebook we'll:

  • load a local and remote csv file,
  • load Excel datafile,
  • load JSON data,
  • load data via SQL queries.

Importing Pandas

You will most often load the Pandas library with the following line:

In [1]:
import pandas as pd

Loading CSV and Excel

CSV

CSV files are still a staple in data file formats. They're portable, flexible, flat, usually easy to parse and ubiquitous. We will begin by showing how to use Pandas to load CSV directly into a DataFrame.

DATA SOURCE

US Baseball Statistics Archive by Sean Lahman (CCBY-SA 3.0):

We have put the dataset for batting data into our local datasets folder.

Loading this into a Pandas DataFrame will require us to use the read_csv function, which will attempt to load the CSV data directly into the DataFrame.

In [2]:
df = pd.read_csv("./datasets/Batting.csv")

If we inspect this DataFrame, will get exactly what we expect -- each line corresponding to the row in file. NOTE: where there are missing values, Pandas will automatically fill the data with NaN.

In [3]:
df
Out[3]:
playerID yearID stint teamID lgID G AB R H 2B ... RBI SB CS BB SO IBB HBP SH SF GIDP
0 abercda01 1871 1 TRO NaN 1 4 0 0 0 ... 0.0 0.0 0.0 0 0.0 NaN NaN NaN NaN NaN
1 addybo01 1871 1 RC1 NaN 25 118 30 32 6 ... 13.0 8.0 1.0 4 0.0 NaN NaN NaN NaN NaN
2 allisar01 1871 1 CL1 NaN 29 137 28 40 4 ... 19.0 3.0 1.0 2 5.0 NaN NaN NaN NaN NaN
3 allisdo01 1871 1 WS3 NaN 27 133 28 44 10 ... 27.0 1.0 1.0 0 2.0 NaN NaN NaN NaN NaN
4 ansonca01 1871 1 RC1 NaN 25 120 29 39 11 ... 16.0 6.0 2.0 2 1.0 NaN NaN NaN NaN NaN
5 armstbo01 1871 1 FW1 NaN 12 49 9 11 2 ... 5.0 0.0 1.0 0 1.0 NaN NaN NaN NaN NaN
6 barkeal01 1871 1 RC1 NaN 1 4 0 1 0 ... 2.0 0.0 0.0 1 0.0 NaN NaN NaN NaN NaN
7 barnero01 1871 1 BS1 NaN 31 157 66 63 10 ... 34.0 11.0 6.0 13 1.0 NaN NaN NaN NaN NaN
8 barrebi01 1871 1 FW1 NaN 1 5 1 1 1 ... 1.0 0.0 0.0 0 0.0 NaN NaN NaN NaN NaN
9 barrofr01 1871 1 BS1 NaN 18 86 13 13 2 ... 11.0 1.0 0.0 0 0.0 NaN NaN NaN NaN NaN
10 bassjo01 1871 1 CL1 NaN 22 89 18 27 1 ... 18.0 0.0 1.0 3 4.0 NaN NaN NaN NaN NaN
11 battijo01 1871 1 CL1 NaN 1 3 0 0 0 ... 0.0 0.0 0.0 1 0.0 NaN NaN NaN NaN NaN
12 bealsto01 1871 1 WS3 NaN 10 36 6 7 0 ... 1.0 2.0 0.0 2 0.0 NaN NaN NaN NaN NaN
13 beaveed01 1871 1 TRO NaN 3 15 7 6 0 ... 5.0 2.0 0.0 0 0.0 NaN NaN NaN NaN NaN
14 bechtge01 1871 1 PH1 NaN 20 94 24 33 9 ... 21.0 4.0 0.0 2 2.0 NaN NaN NaN NaN NaN
15 bellast01 1871 1 TRO NaN 29 128 26 32 3 ... 23.0 4.0 4.0 9 2.0 NaN NaN NaN NaN NaN
16 berkena01 1871 1 PH1 NaN 1 4 0 0 0 ... 0.0 0.0 0.0 0 3.0 NaN NaN NaN NaN NaN
17 berryto01 1871 1 PH1 NaN 1 4 0 1 0 ... 0.0 0.0 0.0 0 0.0 NaN NaN NaN NaN NaN
18 berthha01 1871 1 WS3 NaN 17 73 17 17 1 ... 8.0 3.0 1.0 4 2.0 NaN NaN NaN NaN NaN
19 biermch01 1871 1 FW1 NaN 1 2 0 0 0 ... 0.0 0.0 0.0 1 0.0 NaN NaN NaN NaN NaN
20 birdge01 1871 1 RC1 NaN 25 106 19 28 2 ... 13.0 1.0 0.0 3 2.0 NaN NaN NaN NaN NaN
21 birdsda01 1871 1 BS1 NaN 29 152 51 46 3 ... 24.0 6.0 0.0 4 4.0 NaN NaN NaN NaN NaN
22 brainas01 1871 1 WS3 NaN 30 134 24 30 4 ... 21.0 4.0 0.0 7 2.0 NaN NaN NaN NaN NaN
23 brannmi01 1871 1 CH1 NaN 3 14 2 1 0 ... 0.0 0.0 0.0 0 0.0 NaN NaN NaN NaN NaN
24 burrohe01 1871 1 WS3 NaN 12 63 11 15 2 ... 14.0 0.0 0.0 1 1.0 NaN NaN NaN NaN NaN
25 careyto01 1871 1 FW1 NaN 19 87 16 20 2 ... 10.0 5.0 0.0 2 1.0 NaN NaN NaN NaN NaN
26 carleji01 1871 1 CL1 NaN 29 127 31 32 8 ... 18.0 2.0 1.0 8 3.0 NaN NaN NaN NaN NaN
27 conefr01 1871 1 BS1 NaN 19 77 17 20 3 ... 16.0 12.0 1.0 8 2.0 NaN NaN NaN NaN NaN
28 connone01 1871 1 TRO NaN 7 33 6 7 0 ... 2.0 0.0 0.0 0 0.0 NaN NaN NaN NaN NaN
29 cravebi01 1871 1 TRO NaN 27 118 26 38 8 ... 26.0 6.0 3.0 3 0.0 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
102786 wittgni01 2016 1 MIA NL 48 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
102787 wolteto01 2016 1 COL NL 71 205 27 53 15 ... 30.0 4.0 1.0 21 53.0 2.0 0.0 4.0 0.0 1.0
102788 wongko01 2016 1 SLN NL 121 313 39 75 7 ... 23.0 7.0 0.0 34 52.0 2.0 9.0 0.0 5.0 3.0
102789 woodal02 2016 1 LAN NL 15 16 2 4 0 ... 2.0 0.0 0.0 1 9.0 0.0 0.0 2.0 0.0 0.0
102790 woodbl01 2016 1 CIN NL 70 2 0 0 0 ... 0.0 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0
102791 woodtr01 2016 1 CHN NL 81 11 0 2 0 ... 1.0 0.0 0.0 1 5.0 0.0 0.0 0.0 0.0 0.0
102792 worleva01 2016 1 BAL AL 35 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
102793 worthda01 2016 1 HOU AL 16 39 4 7 2 ... 1.0 0.0 0.0 1 6.0 0.0 0.0 0.0 0.0 1.0
102794 wrighda03 2016 1 NYN NL 37 137 18 31 8 ... 14.0 3.0 2.0 26 55.0 0.0 0.0 0.0 0.0 0.0
102795 wrighda04 2016 1 CIN NL 4 5 0 0 0 ... 0.0 0.0 0.0 0 2.0 0.0 0.0 1.0 0.0 0.0
102796 wrighda04 2016 2 LAA AL 5 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
102797 wrighmi01 2016 1 BAL AL 18 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
102798 wrighst01 2016 1 BOS AL 25 4 0 0 0 ... 0.0 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0
102799 yateski01 2016 1 NYA AL 41 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
102800 yelicch01 2016 1 MIA NL 155 578 78 172 38 ... 98.0 9.0 4.0 72 138.0 4.0 4.0 0.0 5.0 20.0
102801 ynoaga01 2016 1 NYN NL 10 3 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
102802 ynoami01 2016 1 CHA AL 23 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
102803 ynoara01 2016 1 COL NL 3 5 0 0 0 ... 0.0 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0
102804 youngch03 2016 1 KCA AL 34 1 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
102805 youngch04 2016 1 BOS AL 76 203 29 56 18 ... 24.0 4.0 2.0 21 50.0 0.0 3.0 0.0 0.0 4.0
102806 younger03 2016 1 NYA AL 6 1 2 0 0 ... 0.0 1.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
102807 youngma03 2016 1 ATL NL 8 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
102808 zastrro01 2016 1 CHN NL 8 3 0 0 0 ... 0.0 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0
102809 zieglbr01 2016 1 ARI NL 36 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
102810 zieglbr01 2016 2 BOS AL 33 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0
102811 zimmejo02 2016 1 DET AL 19 4 0 1 0 ... 0.0 0.0 0.0 0 2.0 0.0 0.0 1.0 0.0 0.0
102812 zimmery01 2016 1 WAS NL 115 427 60 93 18 ... 46.0 4.0 1.0 29 104.0 1.0 5.0 0.0 6.0 12.0
102813 zobribe01 2016 1 CHN NL 147 523 94 142 31 ... 76.0 6.0 4.0 96 82.0 6.0 4.0 4.0 4.0 17.0
102814 zuninmi01 2016 1 SEA AL 55 164 16 34 7 ... 31.0 0.0 0.0 21 65.0 0.0 6.0 0.0 1.0 0.0
102815 zychto01 2016 1 SEA AL 12 0 0 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0

102816 rows × 22 columns

We will soon learn that Pandas, supports some typical "Pythonic" use cases for accesing data. The first we will encounter is with len(). We can get the size of this dataset (in rows) with the standard Python len() function, which will return exactly what we expect.

In [4]:
len(df)
Out[4]:
102816

Every DataFrame will have a columns attribute, which contains the column index for our dataset. Thus, getting the length of that attribute returns, again, what we expect.

In [5]:
df.columns
Out[5]:
Index(['playerID', 'yearID', 'stint', 'teamID', 'lgID', 'G', 'AB', 'R', 'H',
       '2B', '3B', 'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'IBB', 'HBP', 'SH',
       'SF', 'GIDP'],
      dtype='object')
In [6]:
len(df.columns)
Out[6]:
22

If we want both column and row counts DataFrame.shape will return the tuple to do this:

In [7]:
df.shape
Out[7]:
(102816, 22)

Which returns what we expect (yet again).

Much like Python slicing of lists, if we want to the first n rows of data, we can use the shorthand:

In [8]:
df[:10]
Out[8]:
playerID yearID stint teamID lgID G AB R H 2B ... RBI SB CS BB SO IBB HBP SH SF GIDP
0 abercda01 1871 1 TRO NaN 1 4 0 0 0 ... 0.0 0.0 0.0 0 0.0 NaN NaN NaN NaN NaN
1 addybo01 1871 1 RC1 NaN 25 118 30 32 6 ... 13.0 8.0 1.0 4 0.0 NaN NaN NaN NaN NaN
2 allisar01 1871 1 CL1 NaN 29 137 28 40 4 ... 19.0 3.0 1.0 2 5.0 NaN NaN NaN NaN NaN
3 allisdo01 1871 1 WS3 NaN 27 133 28 44 10 ... 27.0 1.0 1.0 0 2.0 NaN NaN NaN NaN NaN
4 ansonca01 1871 1 RC1 NaN 25 120 29 39 11 ... 16.0 6.0 2.0 2 1.0 NaN NaN NaN NaN NaN
5 armstbo01 1871 1 FW1 NaN 12 49 9 11 2 ... 5.0 0.0 1.0 0 1.0 NaN NaN NaN NaN NaN
6 barkeal01 1871 1 RC1 NaN 1 4 0 1 0 ... 2.0 0.0 0.0 1 0.0 NaN NaN NaN NaN NaN
7 barnero01 1871 1 BS1 NaN 31 157 66 63 10 ... 34.0 11.0 6.0 13 1.0 NaN NaN NaN NaN NaN
8 barrebi01 1871 1 FW1 NaN 1 5 1 1 1 ... 1.0 0.0 0.0 0 0.0 NaN NaN NaN NaN NaN
9 barrofr01 1871 1 BS1 NaN 18 86 13 13 2 ... 11.0 1.0 0.0 0 0.0 NaN NaN NaN NaN NaN

10 rows × 22 columns

Or just like slicing a list, we can do more complex slicing:

In [9]:
df[:50:5]
Out[9]:
playerID yearID stint teamID lgID G AB R H 2B ... RBI SB CS BB SO IBB HBP SH SF GIDP
0 abercda01 1871 1 TRO NaN 1 4 0 0 0 ... 0.0 0.0 0.0 0 0.0 NaN NaN NaN NaN NaN
5 armstbo01 1871 1 FW1 NaN 12 49 9 11 2 ... 5.0 0.0 1.0 0 1.0 NaN NaN NaN NaN NaN
10 bassjo01 1871 1 CL1 NaN 22 89 18 27 1 ... 18.0 0.0 1.0 3 4.0 NaN NaN NaN NaN NaN
15 bellast01 1871 1 TRO NaN 29 128 26 32 3 ... 23.0 4.0 4.0 9 2.0 NaN NaN NaN NaN NaN
20 birdge01 1871 1 RC1 NaN 25 106 19 28 2 ... 13.0 1.0 0.0 3 2.0 NaN NaN NaN NaN NaN
25 careyto01 1871 1 FW1 NaN 19 87 16 20 2 ... 10.0 5.0 0.0 2 1.0 NaN NaN NaN NaN NaN
30 cuthbne01 1871 1 PH1 NaN 28 150 47 37 7 ... 30.0 16.0 2.0 10 2.0 NaN NaN NaN NaN NaN
35 ewellge01 1871 1 CL1 NaN 1 3 0 0 0 ... 0.0 0.0 0.0 0 0.0 NaN NaN NaN NaN NaN
40 flowedi01 1871 1 TRO NaN 21 105 39 33 5 ... 18.0 8.0 2.0 4 0.0 NaN NaN NaN NaN NaN
45 fulmech01 1871 1 RC1 NaN 16 63 11 17 1 ... 3.0 0.0 0.0 5 1.0 NaN NaN NaN NaN NaN

10 rows × 22 columns

Accessing column data by label

One of the nice things about Pandas is that we can reference the columns of data by their names (or labels). For example, we have a yearID label, teamID label, G label for game counts, and so on. For our dataset to learn what the labels are in detail see the documentation for the provided links.

In [10]:
df.yearID[:10]
Out[10]:
0    1871
1    1871
2    1871
3    1871
4    1871
5    1871
6    1871
7    1871
8    1871
9    1871
Name: yearID, dtype: int64
In [11]:
df.G[-10:]
Out[11]:
102806      6
102807      8
102808      8
102809     36
102810     33
102811     19
102812    115
102813    147
102814     55
102815     12
Name: G, dtype: int64

Let's say we want all the player data for the Washington Nationals from 2015 and 2016 where a player played in 100 or more games:

In [12]:
df_was = df[(df.yearID > 2014) & (df.teamID=='WAS') & (df.G > 99)]
df_was
Out[12]:
playerID yearID stint teamID lgID G AB R H 2B ... RBI SB CS BB SO IBB HBP SH SF GIDP
100193 desmoia01 2015 1 WAS NL 156 583 69 136 27 ... 62.0 13.0 5.0 45 187.0 0.0 3.0 6.0 4.0 9.0
100250 escobyu01 2015 1 WAS NL 139 535 75 168 25 ... 56.0 2.0 2.0 45 70.0 0.0 8.0 1.0 2.0 24.0
100251 espinda01 2015 1 WAS NL 118 367 59 88 21 ... 37.0 5.0 2.0 33 106.0 5.0 6.0 3.0 3.0 6.0
100422 harpebr03 2015 1 WAS NL 153 521 118 172 38 ... 99.0 6.0 4.0 124 131.0 15.0 5.0 0.0 4.0 15.0
100950 ramoswi01 2015 1 WAS NL 128 475 41 109 16 ... 68.0 0.0 0.0 21 101.0 2.0 0.0 0.0 8.0 16.0
100993 robincl01 2015 1 WAS NL 126 309 44 84 15 ... 34.0 0.0 0.0 37 52.0 4.0 5.0 0.0 1.0 6.0
101176 taylomi02 2015 1 WAS NL 138 472 49 108 15 ... 63.0 16.0 3.0 35 158.0 9.0 1.0 1.0 2.0 5.0
101725 espinda01 2016 1 WAS NL 157 516 66 108 15 ... 72.0 9.0 2.0 54 174.0 12.0 20.0 7.0 4.0 4.0
101895 harpebr03 2016 1 WAS NL 147 506 84 123 24 ... 86.0 21.0 10.0 108 117.0 20.0 3.0 0.0 10.0 11.0
102245 murphda08 2016 1 WAS NL 142 531 88 184 47 ... 104.0 5.0 3.0 35 57.0 10.0 8.0 0.0 8.0 4.0
102429 ramoswi01 2016 1 WAS NL 131 482 58 148 25 ... 80.0 0.0 0.0 35 79.0 2.0 2.0 0.0 4.0 17.0
102449 rendoan01 2016 1 WAS NL 156 567 91 153 38 ... 85.0 12.0 6.0 65 117.0 2.0 7.0 0.0 8.0 5.0
102451 reverbe01 2016 1 WAS NL 103 350 44 76 9 ... 24.0 14.0 5.0 18 34.0 0.0 3.0 2.0 2.0 12.0
102472 robincl01 2016 1 WAS NL 104 196 16 46 4 ... 26.0 0.0 0.0 20 38.0 0.0 2.0 1.0 5.0 4.0
102763 werthja01 2016 1 WAS NL 143 525 84 128 28 ... 69.0 5.0 1.0 71 139.0 0.0 4.0 0.0 6.0 17.0
102812 zimmery01 2016 1 WAS NL 115 427 60 93 18 ... 46.0 4.0 1.0 29 104.0 1.0 5.0 0.0 6.0 12.0

16 rows × 22 columns

We'll put all these things in motion later, but for now put a thumbnail on this for future reference. NOTE: we'll need to access the dataset that crosswalks the PlayerID with the actual player name and vitals, but we'll leave that as an exercise for the interested (hint: take a look in this dataset).

Excel

Loading Excel data is nearly as easy as CSV data. This time we'll use a different data source and show how to access it in a slightly different manner. Instead of the local file source, we will use a remote URL for the resource. This will show us exactly how easy it is to seamlessly interchange various data resources.

DATA SOURCES

To read data from the data set we will access it by URL and use the pandas.read_excel() method note we're using the sheetname=None parameter to read each sheet to be assigned its own key in a dictionary for easy lookup by sheet name.

In [13]:
xl = pd.read_excel(
    "https://www.bts.gov/sites/bts.dot.gov/files/docs/newsroom/206581/airline-employment-press-tables-web.xlsx",
    sheetname=None)

Notice now, if we want to access the sheet called Table1 we can easily do this in a Pythonic way much like any other dictionary. The result is the DataFrame representation of that sheet.

In [14]:
xl_tbl1 = xl['Table1']
xl_tbl1
Out[14]:
Table 1: Yearly Change in Scheduled Passenger Airline Full-time Equivalent Employees* by Airline Group Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5
0 Most recent 13 months - percent change from sa... NaN NaN NaN NaN NaN
1 NaN Network Airlines Low-Cost Airlines Regional Airlines Other Airlines All Passenger Airlines **
2 May 2015 - May 2016 2.3 10.7 0.2 9.3 3.7
3 Jun 2015 - Jun 2016 2.3 11 0.9 10.6 3.9
4 Jul 2015 - Jul 2016 2.4 11.3 3.3 11.2 4.3
5 Aug 2015 - Aug 2016 2.5 11 3.3 11.9 4.3
6 Sep 2015 - Sep 2016 2.6 10.6 2.9 13 4.3
7 Oct 2015 - Oct 2016 2.7 10.3 0.3 12.7 4
8 Nov 2015 - Nov 2016 2.3 9.8 0.2 13.5 3.7
9 Dec 2015 - Dec 2016 2.4 9.5 0.2 13.7 3.7
10 Jan 2016 - Jan 2017 2.3 9.7 1.9 12.7 3.9
11 Feb 2016 - Feb 2017 2.4 9.4 2.4 11.8 3.9
12 Mar 2016 - Mar 2017 2.7 9.1 2 11.7 4
13 Apr 2016 - Apr 2017 2.6 8.5 2.1 10.7 3.9
14 May 2016 - May 2017 2.4 8.3 2.5 4.2 3.6
15 Source: Bureau of Transportation Statistics NaN NaN NaN NaN NaN
16 * Full-time Equivalent Employee (FTE) calculat... NaN NaN NaN NaN NaN
17 ** Includes network, low-cost, regional and ot... NaN NaN NaN NaN NaN
18 Note: Percent changes based on numbers prior t... NaN NaN NaN NaN NaN
19 Note: See Table 2 for all passenger airlines, ... NaN NaN NaN NaN NaN

One problem we have here is that the data is not exactly as clean as we want it to be. We'll spend more time talking about the iloc() method in the next section, but for now, let's get a flavor for how we might clean this up so it is more usable.

In [15]:
# lets select the (row) index 
idx = xl_tbl1.iloc[2:15, 0:1]

# lets select the (col) index
col = xl_tbl1.iloc[1,1:]

print(idx)
print(col)
   Table 1: Yearly Change in Scheduled Passenger Airline Full-time Equivalent Employees* by Airline Group
2                                 May 2015 - May 2016                                                    
3                                 Jun 2015 - Jun 2016                                                    
4                                 Jul 2015 - Jul 2016                                                    
5                                 Aug 2015 - Aug 2016                                                    
6                                 Sep 2015 - Sep 2016                                                    
7                                 Oct 2015 - Oct 2016                                                    
8                                 Nov 2015 - Nov 2016                                                    
9                                 Dec 2015 - Dec 2016                                                    
10                                Jan 2016 - Jan 2017                                                    
11                                Feb 2016 - Feb 2017                                                    
12                                Mar 2016 - Mar 2017                                                    
13                                Apr 2016 - Apr 2017                                                    
14                                May 2016 - May 2017                                                    
Unnamed: 1             Network Airlines
Unnamed: 2            Low-Cost Airlines
Unnamed: 3            Regional Airlines
Unnamed: 4               Other Airlines
Unnamed: 5    All Passenger Airlines **
Name: 1, dtype: object
In [16]:
# we'll create the index object
idxs = pd.Index([v[0] for v in idx.values])
idxs
Out[16]:
Index(['May 2015 - May 2016', 'Jun 2015 - Jun 2016', 'Jul 2015 - Jul 2016',
       'Aug 2015 - Aug 2016', 'Sep 2015 - Sep 2016', 'Oct 2015 - Oct 2016',
       'Nov 2015 - Nov 2016', 'Dec 2015 - Dec 2016', 'Jan 2016 - Jan 2017',
       'Feb 2016 - Feb 2017', 'Mar 2016 - Mar 2017', 'Apr 2016 - Apr 2017',
       'May 2016 - May 2017'],
      dtype='object')
In [17]:
# set the columns
cols = [v for v in col.values]
cols
Out[17]:
['Network Airlines',
 'Low-Cost Airlines',
 'Regional Airlines',
 'Other Airlines',
 'All Passenger Airlines **']
In [18]:
# now for the data
data = xl_tbl1.iloc[2:15,1:].values
data
Out[18]:
array([[2.3, 10.7, 0.2, 9.3, 3.7],
       [2.3, 11, 0.9, 10.6, 3.9],
       [2.4, 11.3, 3.3, 11.2, 4.3],
       [2.5, 11, 3.3, 11.9, 4.3],
       [2.6, 10.6, 2.9, 13, 4.3],
       [2.7, 10.3, 0.3, 12.7, 4],
       [2.3, 9.8, 0.2, 13.5, 3.7],
       [2.4, 9.5, 0.2, 13.7, 3.7],
       [2.3, 9.7, 1.9, 12.7, 3.9],
       [2.4, 9.4, 2.4, 11.8, 3.9],
       [2.7, 9.1, 2, 11.7, 4],
       [2.6, 8.5, 2.1, 10.7, 3.9],
       [2.4, 8.3, 2.5, 4.2, 3.6]], dtype=object)
In [19]:
# putting it all together ...
df_tbl1 = pd.DataFrame(data=xl_tbl1.iloc[2:15,1:].values,
                       columns=[v for v in col.values], 
                       index=pd.Index([v[0] for v in idx.values]))
df_tbl1
Out[19]:
Network Airlines Low-Cost Airlines Regional Airlines Other Airlines All Passenger Airlines **
May 2015 - May 2016 2.3 10.7 0.2 9.3 3.7
Jun 2015 - Jun 2016 2.3 11 0.9 10.6 3.9
Jul 2015 - Jul 2016 2.4 11.3 3.3 11.2 4.3
Aug 2015 - Aug 2016 2.5 11 3.3 11.9 4.3
Sep 2015 - Sep 2016 2.6 10.6 2.9 13 4.3
Oct 2015 - Oct 2016 2.7 10.3 0.3 12.7 4
Nov 2015 - Nov 2016 2.3 9.8 0.2 13.5 3.7
Dec 2015 - Dec 2016 2.4 9.5 0.2 13.7 3.7
Jan 2016 - Jan 2017 2.3 9.7 1.9 12.7 3.9
Feb 2016 - Feb 2017 2.4 9.4 2.4 11.8 3.9
Mar 2016 - Mar 2017 2.7 9.1 2 11.7 4
Apr 2016 - Apr 2017 2.6 8.5 2.1 10.7 3.9
May 2016 - May 2017 2.4 8.3 2.5 4.2 3.6
In [20]:
df_tbl1['Network Airlines']
Out[20]:
May 2015 - May 2016    2.3
Jun 2015 - Jun 2016    2.3
Jul 2015 - Jul 2016    2.4
Aug 2015 - Aug 2016    2.5
Sep 2015 - Sep 2016    2.6
Oct 2015 - Oct 2016    2.7
Nov 2015 - Nov 2016    2.3
Dec 2015 - Dec 2016    2.4
Jan 2016 - Jan 2017    2.3
Feb 2016 - Feb 2017    2.4
Mar 2016 - Mar 2017    2.7
Apr 2016 - Apr 2017    2.6
May 2016 - May 2017    2.4
Name: Network Airlines, dtype: object

JSON

JSON has become a standard format format for many web data sources. It is succinct, readable and very portable -- there are libraries in nearly every modern language that can parse JSON, Python being no exception. We'll load a remote JSON data source to demonstrate remote access as well as the capabilities of using Pandas to load such a source.

JSON DATA SOURCE

If we haven't noticed the pattern yet, loading JSON data will come as no surprise via the pandas.read_json().

With JSON data you may get the best results with relatively flat JSON objects. If you need to obtain different results (or you're getting results that are not as expected), you might instead into the orient parameter to get different resulting DataFrames. We'll load the data as-is and reshape our DataFrame for some extra practice.

In [21]:
df = pd.read_json(
    "https://raw.githubusercontent.com/fortrabbit/quotes/master/quotes.json")
df
Out[21]:
author text
0 Martin Golding Always code as if the guy who ends up maintain...
1 Unknown All computers wait at the same speed.
2 Unknown A misplaced decimal point will always end up w...
3 Unknown A good programmer looks both ways before cross...
4 Unknown A computer program does what you tell it to do...
5 Unknown "Intel Inside" is a Government Warning require...
6 Arthur Godfrey Common sense gets a lot of credit that belongs...
7 Unknown Chuck Norris doesn’t go hunting. Chuck Norris ...
8 Unknown Chuck Norris counted to infinity... twice.
9 Unknown C is quirky, flawed, and an enormous success.
10 Unknown Beta is Latin for still doesn’t work.
11 Unknown ASCII stupid question, get a stupid ANSI!
12 Unknown Artificial Intelligence usually beats natural ...
13 Ted Nelson Any fool can use a computer. Many do.
14 Unknown Hey! It compiles! Ship it!
15 Martin Luther King Junior Hate cannot drive out hate; only love can do t...
16 Unknown Guns don’t kill people. Chuck Norris kills peo...
17 Unknown God is real, unless declared integer.
18 John Johnson First, solve the problem. Then, write the code.
19 Oscar Wilde Experience is the name everyone gives to their...
20 Miguel de Icaza Every piece of software written today is likel...
21 Unknown Computers make very fast, very accurate mistakes.
22 Unknown Computers do not solve problems, they execute ...
23 Unknown I have NOT lost my mind—I have it backed up on...
24 Unknown If brute force doesn’t solve your problems, th...
25 Unknown It works on my machine.
26 Unknown Java is, in many ways, C++??.
27 Unknown Keyboard not found...Press any key to continue.
28 Unknown Life would be so much easier if we only had th...
29 Unknown Mac users swear by their Mac, PC users swear a...
... ... ...
159 Paul Graham OO programming offers a sustainable way to wri...
160 Nikita Popov Ruby is rubbish! PHP is phpantastic!
161 Douglas Adams So long and thanks for all the fish!
162 Cicero If I had more time, I would have written a sho...
163 Jeff Atwood The best reaction to "this is confusing, where...
164 Jeff Atwood The older I get, the more I believe that the o...
165 Douglas Crockford "That hardly ever happens" is another way of s...
166 Anna Debenham Hello, PHP, my old friend.
167 Melvin Conway Organizations which design systems are constra...
168 Melvin Conway In design, complexity is toxic.
169 Jeffrey Zeldman Good is the enemy of great, but great is the e...
170 Rick Lemons Don't make the user provide information that t...
171 Donald E. Knuth You're bound to be unhappy if you optimize eve...
172 Anna Nachesa If the programmers like each other, they play ...
173 Edsger W. Dijkstra Simplicity is prerequisite for reliability.
174 Jordi Boggiano Focus on WHY instead of WHAT in your code will...
175 Andrei Herasimchuk The best engineers I know are artists at heart...
176 Barry Boehm Poor management can increase software costs mo...
177 Daniel Bryant If you can't deploy your services independentl...
178 Daniel Bryant If you can't deploy your services independentl...
179 Jeff Atwood No one hates software more than software devel...
180 Robert C. Martin The proper use of comments is to compensate fo...
181 Cory House Code is like humor. When you have to explain i...
182 Steve Maguire Fix the cause, not the symptom.
183 David Heinemeier Hansson Programmers are constantly making things more ...
184 Linus Torvalds People will realize that software is not a pro...
185 Ron Fein Design is choosing how you will fail.
186 Steve Jobs Focus is saying no to 1000 good ideas.
187 Ron Jeffries Code never lies, comments sometimes do.
188 Unknown Be careful with each other, so you can be dang...

189 rows × 2 columns

Though not a best practice, say we wanted to set the author as the index and the quote of the text the value. In this dataset, we're going to have repeated index values, and it might make sense if we wanted to access the data this way, but be very careful doing this in practice.

In [22]:
df1 = df.set_index(df['author']).drop('author', axis=1)
df1
Out[22]:
text
author
Martin Golding Always code as if the guy who ends up maintain...
Unknown All computers wait at the same speed.
Unknown A misplaced decimal point will always end up w...
Unknown A good programmer looks both ways before cross...
Unknown A computer program does what you tell it to do...
Unknown "Intel Inside" is a Government Warning require...
Arthur Godfrey Common sense gets a lot of credit that belongs...
Unknown Chuck Norris doesn’t go hunting. Chuck Norris ...
Unknown Chuck Norris counted to infinity... twice.
Unknown C is quirky, flawed, and an enormous success.
Unknown Beta is Latin for still doesn’t work.
Unknown ASCII stupid question, get a stupid ANSI!
Unknown Artificial Intelligence usually beats natural ...
Ted Nelson Any fool can use a computer. Many do.
Unknown Hey! It compiles! Ship it!
Martin Luther King Junior Hate cannot drive out hate; only love can do t...
Unknown Guns don’t kill people. Chuck Norris kills peo...
Unknown God is real, unless declared integer.
John Johnson First, solve the problem. Then, write the code.
Oscar Wilde Experience is the name everyone gives to their...
Miguel de Icaza Every piece of software written today is likel...
Unknown Computers make very fast, very accurate mistakes.
Unknown Computers do not solve problems, they execute ...
Unknown I have NOT lost my mind—I have it backed up on...
Unknown If brute force doesn’t solve your problems, th...
Unknown It works on my machine.
Unknown Java is, in many ways, C++??.
Unknown Keyboard not found...Press any key to continue.
Unknown Life would be so much easier if we only had th...
Unknown Mac users swear by their Mac, PC users swear a...
... ...
Paul Graham OO programming offers a sustainable way to wri...
Nikita Popov Ruby is rubbish! PHP is phpantastic!
Douglas Adams So long and thanks for all the fish!
Cicero If I had more time, I would have written a sho...
Jeff Atwood The best reaction to "this is confusing, where...
Jeff Atwood The older I get, the more I believe that the o...
Douglas Crockford "That hardly ever happens" is another way of s...
Anna Debenham Hello, PHP, my old friend.
Melvin Conway Organizations which design systems are constra...
Melvin Conway In design, complexity is toxic.
Jeffrey Zeldman Good is the enemy of great, but great is the e...
Rick Lemons Don't make the user provide information that t...
Donald E. Knuth You're bound to be unhappy if you optimize eve...
Anna Nachesa If the programmers like each other, they play ...
Edsger W. Dijkstra Simplicity is prerequisite for reliability.
Jordi Boggiano Focus on WHY instead of WHAT in your code will...
Andrei Herasimchuk The best engineers I know are artists at heart...
Barry Boehm Poor management can increase software costs mo...
Daniel Bryant If you can't deploy your services independentl...
Daniel Bryant If you can't deploy your services independentl...
Jeff Atwood No one hates software more than software devel...
Robert C. Martin The proper use of comments is to compensate fo...
Cory House Code is like humor. When you have to explain i...
Steve Maguire Fix the cause, not the symptom.
David Heinemeier Hansson Programmers are constantly making things more ...
Linus Torvalds People will realize that software is not a pro...
Ron Fein Design is choosing how you will fail.
Steve Jobs Focus is saying no to 1000 good ideas.
Ron Jeffries Code never lies, comments sometimes do.
Unknown Be careful with each other, so you can be dang...

189 rows × 1 columns

Though we haven't talked about it, there is a very interesting and useful mechanism for filtering data using the apply() method. In this case, we're going to write a cute anonymous function that finds all the quotes by the author Unknown with java in the quote.

In [23]:
df1.loc["Unknown"][df1.loc["Unknown"]["text"]
                   .apply(lambda v: "jav" in v.lower())]
Out[23]:
text
author
Unknown Java is, in many ways, C++??.

SQL

Loading SQL data into a DataFrame is also supported by Pandas. You might need to take a look at the SQLAlchemy and the documentation on creating database engines, as this is the framework supported directly by Pandas.

SQL DATA SOURCE

This file contains a number of tables that contain the Jeopardy! game clues, players, wins, categories, etc. We will only use a fraction of the data to demonstrate the SQL capabilities.

Our example will use a SQLite database so we can demonstrate the example in a standalone context. We'll show reading a table in full using the read_sql_table() and then how to do ad hoc queries using read_sql_query().

In [24]:
from sqlalchemy import create_engine
engine = create_engine('sqlite:///datasets/database.sqlite')

with engine.connect() as conn, conn.begin():
    data = pd.read_sql_table('final', conn)
In [25]:
data[:10]
Out[25]:
game_id clue_id value category clue strike1 strike2 strike3 answer
0 280 16720 100 HIGH ROLLERS After an 1891 roulette run, Charles Wells was ... What is Atlantic City? What is Las Vegas? What is Monaco? Monte Carlo
1 429 25403 100 OH, CRAPS! The combo that totals one shy of "boxcars" What is 11? What is 10? What is 9? 5 & 6
2 866 51549 100 ROCK & POP It was the last decade in which Cher didn't ha... What are the 1980s? What are the 1970s? What are the 1990s? 1950s
3 1018 60582 100 LET'S HAVE A BALL Sink it & you've scratched Um... What is the pinball? What is the 8-ball? the cue ball
4 1069 63644 100 WHAT A YEAR! Dewaele won the Tour de France, Coco Chanel wa... What is 1933? What is 1987? What is 1927? 1929
5 1473 84364 100 EUROPEAN HISTORY A former Socialist, he formed the anti-Communi... Who was Lenin? Who was Franco? Who was Hitler? Benito Mussolini
6 1635 93864 100 CHRISTIANITY According to tradition, Dismas & Gestas were t... Who are the thieves? What is Cavalry? What is Mt. Olive? Calvary
7 4166 242419 100 NAME THE DECADE Paul Revere & William Dawes warn colonists tha... What is the 16th century? What is the 18th century? What is the 18th century? the 1770s
8 112 6679 200 ODD ALPHABETS In alphabet radio code, "B" is Bravo and "F" s... What's the Flamingo? What's a Fandango? What's the Flamenco? - you have it written the... Foxtrot
9 354 20984 200 SPORTS A filly becomes a mare at this age What is 3? What is 1? What is 2? 4

Now say we want to find out the distribution of occupations of players over the years. When we look into the players table, we can see we can create a query that allows for us to aggregate these occupations easily.

Using read_sql_query() we can get the job done and dump this into a DataFrame.

In [26]:
query = """
    SELECT occupation, count(occupation) as freq FROM players
    WHERE occupation != ''
    GROUP BY occupation 
    ORDER BY count(occupation) DESC 
    """

with engine.connect() as conn, conn.begin():
    occupation_data = pd.read_sql_query(query, conn)
In [27]:
occupation_data[:10]
Out[27]:
occupation freq
0 attorney 380
1 senior 228
2 graduate student 212
3 writer 176
4 teacher 159
5 junior 158
6 law student 120
7 lawyer 112
8 homemaker 101
9 actor 97

If we look closely, we can see that there are many occupations that are the same, but labeled differently. For example, "attorney" and "lawyer", or the various kinds of "teachers". Thus, if we just look at the frequency from above, we might be deceived in thinking that these frequencies are correct for the groupings that make sense at a slightly higher level of granularity than has been captured.

So let's do some data munging with Pandas and see how we might group all the "teachers" together.

To to this we'll need to do a few things:

  • find all occupations that have "teach" in them (or "teacher" if you'd like)
  • remove all of those from the data frame
  • add just the aggregate and apply the generic label "teacher"
  • as a bonus, we'll generate the percentages as an additional column

Let's get going!

We are going to make use of a nice convenience attribution str of the Series object. It operates much like the String object in Python and has a contains() method, which will allow us to determine if the substring we're looking for is contained as a substring in any of the values of the Series. These methods are indeed very useful to have!

In [28]:
freq_all_occupations = occupation_data.freq.sum()

combined_teacher_freq = \
        occupation_data[
            occupation_data['occupation']
                .str.contains('teach')]\
        .sum()
In [29]:
combined_teacher_freq
Out[29]:
occupation    teacherhigh school teacherhigh school English ...
freq                                                        830
dtype: object

Notice the occupation is the concatenation of all those teachers. We want to change that to a single label "teacher".

In [30]:
combined_teacher_freq['occupation'] = 'teacher'
In [31]:
combined_teacher_freq
Out[31]:
occupation    teacher
freq              830
dtype: object

We now need only append the data to our original DataFrame:

In [32]:
occupation_data = \
    occupation_data[
        ~occupation_data['occupation']
            .str.contains('teach')] \
    .append(combined_teacher_freq, ignore_index=True)
In [33]:
occupation_data[-10:]
Out[33]:
occupation freq
4205 writer for an online magazine 1
4206 writer's assistant 1
4207 writer-producer 1
4208 writing instructor 1
4209 yoga instructor 1
4210 yogurt franchise operator 1
4211 youth ministry consultant 1
4212 zoo docent 1
4213 zoo educator 1
4214 teacher 830

Now let's add the percentage column and call it pct:

In [34]:
occupation_data['pct'] = occupation_data['freq']/occupation_data.freq.sum()
In [35]:
occupation_data.sort_values(by='pct', ascending=False)[:10]
Out[35]:
occupation freq pct
4214 teacher 830 0.078905
0 attorney 380 0.036125
1 senior 228 0.021675
2 graduate student 212 0.020154
3 writer 176 0.016732
4 junior 158 0.015020
5 law student 120 0.011408
6 lawyer 112 0.010647
7 homemaker 101 0.009602
8 actor 97 0.009221

You can explore how you might make a more complex filter by looking at apply, applymap and aggregate. Ξ