NAVIGATION
Got Pandas? Practical Data Wrangling with Pandas
We will review our terminology for a quick moment:
0 is the column-axis and 1 is the row-axis. When dealing with multi-indices, the hierarchy within the axis are referred to as levels and accessed similarly NOTEBOOK OBJECTIVES
In this notebook we'll:
In the example for this section, we're going to go back to our Baseball data set and load the batting statistics into a DataFrame.
import pandas as pd # get the data for players in 2015-16 who played in 100 or more games df = pd.read_csv("./datasets/Batting.csv")
[] operator (again)¶As before basic slice selections can be made with the syntax similar to that found in lists using the convenience of the [] operator. For example, obtaining the first 5 rows of our data, or the last 15.
df[:5]
| playerID | yearID | stint | teamID | lgID | G | AB | R | H | 2B | ... | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | abercda01 | 1871 | 1 | TRO | NaN | 1 | 4 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | NaN | NaN | NaN | NaN | NaN |
| 1 | addybo01 | 1871 | 1 | RC1 | NaN | 25 | 118 | 30 | 32 | 6 | ... | 13.0 | 8.0 | 1.0 | 4 | 0.0 | NaN | NaN | NaN | NaN | NaN |
| 2 | allisar01 | 1871 | 1 | CL1 | NaN | 29 | 137 | 28 | 40 | 4 | ... | 19.0 | 3.0 | 1.0 | 2 | 5.0 | NaN | NaN | NaN | NaN | NaN |
| 3 | allisdo01 | 1871 | 1 | WS3 | NaN | 27 | 133 | 28 | 44 | 10 | ... | 27.0 | 1.0 | 1.0 | 0 | 2.0 | NaN | NaN | NaN | NaN | NaN |
| 4 | ansonca01 | 1871 | 1 | RC1 | NaN | 25 | 120 | 29 | 39 | 11 | ... | 16.0 | 6.0 | 2.0 | 2 | 1.0 | NaN | NaN | NaN | NaN | NaN |
5 rows × 22 columns
df[-15:]
| playerID | yearID | stint | teamID | lgID | G | AB | R | H | 2B | ... | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 102801 | ynoaga01 | 2016 | 1 | NYN | NL | 10 | 3 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 102802 | ynoami01 | 2016 | 1 | CHA | AL | 23 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 102803 | ynoara01 | 2016 | 1 | COL | NL | 3 | 5 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 102804 | youngch03 | 2016 | 1 | KCA | AL | 34 | 1 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 102805 | youngch04 | 2016 | 1 | BOS | AL | 76 | 203 | 29 | 56 | 18 | ... | 24.0 | 4.0 | 2.0 | 21 | 50.0 | 0.0 | 3.0 | 0.0 | 0.0 | 4.0 |
| 102806 | younger03 | 2016 | 1 | NYA | AL | 6 | 1 | 2 | 0 | 0 | ... | 0.0 | 1.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 102807 | youngma03 | 2016 | 1 | ATL | NL | 8 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 102808 | zastrro01 | 2016 | 1 | CHN | NL | 8 | 3 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 102809 | zieglbr01 | 2016 | 1 | ARI | NL | 36 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 102810 | zieglbr01 | 2016 | 2 | BOS | AL | 33 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 102811 | zimmejo02 | 2016 | 1 | DET | AL | 19 | 4 | 0 | 1 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 2.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 102812 | zimmery01 | 2016 | 1 | WAS | NL | 115 | 427 | 60 | 93 | 18 | ... | 46.0 | 4.0 | 1.0 | 29 | 104.0 | 1.0 | 5.0 | 0.0 | 6.0 | 12.0 |
| 102813 | zobribe01 | 2016 | 1 | CHN | NL | 147 | 523 | 94 | 142 | 31 | ... | 76.0 | 6.0 | 4.0 | 96 | 82.0 | 6.0 | 4.0 | 4.0 | 4.0 | 17.0 |
| 102814 | zuninmi01 | 2016 | 1 | SEA | AL | 55 | 164 | 16 | 34 | 7 | ... | 31.0 | 0.0 | 0.0 | 21 | 65.0 | 0.0 | 6.0 | 0.0 | 1.0 | 0.0 |
| 102815 | zychto01 | 2016 | 1 | SEA | AL | 12 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
15 rows × 22 columns
We mostly worked on row slicing with the [] selector, but if we pass a column label or list of the columns we'd like, say the RBI and G (games played) data, we get mostly what we'd expect:
df["RBI"][:5]
df[["RBI", "G"]][:10]
. selector on column and index name¶We can obtain column data by column labels (note that the column index was loaded for us when we read the file into the DataFrame). For example to get all the RBI data:
df.RBI[:10]
0 0.0 1 13.0 2 19.0 3 27.0 4 16.0 5 5.0 6 2.0 7 34.0 8 1.0 9 11.0 Name: RBI, dtype: float64
Similarly, we can pass a list of the columns we'd like, so let's get the RBI and G (games played) data:
df[["RBI", "G"]][:10]
| RBI | G | |
|---|---|---|
| 0 | 0.0 | 1 |
| 1 | 13.0 | 25 |
| 2 | 19.0 | 29 |
| 3 | 27.0 | 27 |
| 4 | 16.0 | 25 |
| 5 | 5.0 | 12 |
| 6 | 2.0 | 1 |
| 7 | 34.0 | 31 |
| 8 | 1.0 | 1 |
| 9 | 11.0 | 18 |
We have yet to make more complex selections beyond index values. Now we're ready to introduce selecting by boolean value. With this kinds of selection, we're going to as Pandas to give us the Series or DataFrame that represents the boolean values of what we want, then we will allow iloc to reduce the resulting Series or DataFrame to what we're looking for. Let's see this in action.
Say we want to find all items in our DataFrame where yearID is 2015 or
df.yearID == 2015
Let's first see what this does.
df.yearID == 2015
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False
24 False
25 False
26 False
27 False
28 False
29 False
...
102786 False
102787 False
102788 False
102789 False
102790 False
102791 False
102792 False
102793 False
102794 False
102795 False
102796 False
102797 False
102798 False
102799 False
102800 False
102801 False
102802 False
102803 False
102804 False
102805 False
102806 False
102807 False
102808 False
102809 False
102810 False
102811 False
102812 False
102813 False
102814 False
102815 False
Name: yearID, Length: 102816, dtype: bool
We're returned the Series that contains a True or False given our boolean query. We need now pass this boolean Series into loc and we will see the outcome.
df.loc[df.yearID == 2015][:10] # note we're restricting the return to just the first 10 values
| playerID | yearID | stint | teamID | lgID | G | AB | R | H | 2B | ... | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 99847 | aardsda01 | 2015 | 1 | ATL | NL | 33 | 1 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 99848 | abadfe01 | 2015 | 1 | OAK | AL | 62 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 99849 | abreujo02 | 2015 | 1 | CHA | AL | 154 | 613 | 88 | 178 | 34 | ... | 101.0 | 0.0 | 0.0 | 39 | 140.0 | 11.0 | 15.0 | 0.0 | 1.0 | 16.0 |
| 99850 | achteaj01 | 2015 | 1 | MIN | AL | 11 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 99851 | ackledu01 | 2015 | 1 | SEA | AL | 85 | 186 | 22 | 40 | 8 | ... | 19.0 | 2.0 | 2.0 | 14 | 38.0 | 0.0 | 1.0 | 3.0 | 3.0 | 3.0 |
| 99852 | ackledu01 | 2015 | 2 | NYA | AL | 23 | 52 | 6 | 15 | 3 | ... | 11.0 | 0.0 | 0.0 | 4 | 7.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 99853 | adamecr01 | 2015 | 1 | COL | NL | 26 | 53 | 4 | 13 | 1 | ... | 3.0 | 0.0 | 1.0 | 3 | 11.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 |
| 99854 | adamsau01 | 2015 | 1 | CLE | AL | 28 | 1 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 99855 | adamsma01 | 2015 | 1 | SLN | NL | 60 | 175 | 14 | 42 | 9 | ... | 24.0 | 1.0 | 0.0 | 10 | 41.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| 99856 | adcocna01 | 2015 | 1 | CIN | NL | 13 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
10 rows × 22 columns
Now what if we wanted the restrict this further by team. Say we wanted to see only the Minesota Twins player data for 2015. That is
df.yearID == 2015
AND
df.teamID == "MIN"
We simply put these in parethesis and use the & operator.
df.loc[(df.yearID == 2015) & (df.teamID == "MIN")].head(10)
| playerID | yearID | stint | teamID | lgID | G | AB | R | H | 2B | ... | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 99850 | achteaj01 | 2015 | 1 | MIN | AL | 11 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 99891 | arciaos01 | 2015 | 1 | MIN | AL | 19 | 58 | 6 | 16 | 0 | ... | 8.0 | 0.0 | 0.0 | 4 | 15.0 | 4.0 | 2.0 | 0.0 | 1.0 | 2.0 |
| 99954 | bernido01 | 2015 | 1 | MIN | AL | 4 | 5 | 1 | 1 | 1 | ... | 2.0 | 0.0 | 0.0 | 1 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 99988 | boyerbl01 | 2015 | 1 | MIN | AL | 68 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 100030 | buxtoby01 | 2015 | 1 | MIN | AL | 46 | 129 | 16 | 27 | 7 | ... | 6.0 | 2.0 | 2.0 | 6 | 44.0 | 0.0 | 1.0 | 2.0 | 0.0 | 1.0 |
| 100139 | cottsne01 | 2015 | 2 | MIN | AL | 17 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 100215 | doziebr01 | 2015 | 1 | MIN | AL | 157 | 628 | 101 | 148 | 39 | ... | 77.0 | 12.0 | 4.0 | 61 | 148.0 | 2.0 | 7.0 | 0.0 | 8.0 | 10.0 |
| 100221 | duensbr01 | 2015 | 1 | MIN | AL | 55 | 1 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 100222 | duffety01 | 2015 | 1 | MIN | AL | 10 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 100249 | escobed01 | 2015 | 1 | MIN | AL | 127 | 409 | 48 | 107 | 31 | ... | 58.0 | 2.0 | 3.0 | 28 | 86.0 | 1.0 | 2.0 | 2.0 | 5.0 | 7.0 |
10 rows × 22 columns
Now what if we wanted to restrict a subset of columns. This is easy with iloc[] ... we will just use our boolean expression as above for the row selection and then the list of columns for our column selection (in this case a much smaller subset of data).
df.loc[(df.yearID == 2015) & (df.teamID == "MIN"),\ ['playerID', 'G', 'AB', 'H', 'HR', 'RBI']]
| playerID | G | AB | H | HR | RBI | |
|---|---|---|---|---|---|---|
| 99850 | achteaj01 | 11 | 0 | 0 | 0 | 0.0 |
| 99891 | arciaos01 | 19 | 58 | 16 | 2 | 8.0 |
| 99954 | bernido01 | 4 | 5 | 1 | 0 | 2.0 |
| 99988 | boyerbl01 | 68 | 0 | 0 | 0 | 0.0 |
| 100030 | buxtoby01 | 46 | 129 | 27 | 2 | 6.0 |
| 100139 | cottsne01 | 17 | 0 | 0 | 0 | 0.0 |
| 100215 | doziebr01 | 157 | 628 | 148 | 28 | 77.0 |
| 100221 | duensbr01 | 55 | 1 | 0 | 0 | 0.0 |
| 100222 | duffety01 | 10 | 0 | 0 | 0 | 0.0 |
| 100249 | escobed01 | 127 | 409 | 107 | 12 | 58.0 |
| 100270 | fienca01 | 62 | 0 | 0 | 0 | 0.0 |
| 100302 | fryerer01 | 15 | 22 | 5 | 0 | 2.0 |
| 100333 | gibsoky01 | 32 | 5 | 1 | 0 | 0.0 |
| 100373 | grahajr01 | 39 | 0 | 0 | 0 | 0.0 |
| 100455 | herrmch01 | 45 | 103 | 15 | 2 | 10.0 |
| 100459 | hicksaa01 | 97 | 352 | 90 | 11 | 33.0 |
| 100486 | hugheph01 | 27 | 3 | 0 | 0 | 0.0 |
| 100488 | hunteto01 | 139 | 521 | 125 | 22 | 81.0 |
| 100521 | jepseke01 | 29 | 0 | 0 | 0 | 0.0 |
| 100564 | keplema01 | 3 | 7 | 1 | 0 | 0.0 |
| 100696 | mauerjo01 | 158 | 592 | 157 | 10 | 66.0 |
| 100701 | maytr01 | 48 | 3 | 0 | 0 | 0.0 |
| 100729 | meyeral01 | 2 | 0 | 0 | 0 | 0.0 |
| 100737 | milonto01 | 24 | 2 | 0 | 0 | 0.0 |
| 100807 | nolasri01 | 9 | 3 | 0 | 0 | 0.0 |
| 100816 | nunezed02 | 72 | 188 | 53 | 4 | 20.0 |
| 100837 | orourry01 | 28 | 0 | 0 | 0 | 0.0 |
| 100872 | pelfrmi01 | 30 | 3 | 2 | 0 | 0.0 |
| 100895 | perkigl01 | 60 | 0 | 0 | 0 | 0.0 |
| 100915 | plouftr01 | 152 | 573 | 140 | 22 | 86.0 |
| 100917 | polanjo01 | 4 | 10 | 3 | 0 | 1.0 |
| 100925 | pressry01 | 27 | 0 | 0 | 0 | 0.0 |
| 100994 | robinsh01 | 83 | 180 | 45 | 0 | 16.0 |
| 101023 | rosared01 | 122 | 453 | 121 | 13 | 50.0 |
| 101067 | sanomi01 | 80 | 279 | 75 | 18 | 52.0 |
| 101069 | santada01 | 91 | 261 | 56 | 0 | 21.0 |
| 101072 | santaer01 | 17 | 0 | 0 | 0 | 0.0 |
| 101079 | schafjo02 | 27 | 69 | 15 | 0 | 5.0 |
| 101144 | staufti01 | 13 | 0 | 0 | 0 | 0.0 |
| 101164 | suzukku01 | 131 | 433 | 104 | 5 | 50.0 |
| 101189 | thielca01 | 6 | 0 | 0 | 0 | 0.0 |
| 101193 | thompaa01 | 41 | 0 | 0 | 0 | 0.0 |
| 101203 | tonkimi01 | 26 | 0 | 0 | 0 | 0.0 |
| 101240 | vargake01 | 58 | 175 | 42 | 5 | 17.0 |
Sorting is facilitated by the sort_values() method. By default, sorting is done in ascending order, specify the parameter ascending=False to get descending order.
df_min_2015 = df.loc[(df.yearID == 2015) & (df.teamID == "MIN"),\ ['playerID', 'G', 'AB', 'H', 'HR', 'RBI']]\ .sort_values('G', ascending=False) df_min_2015.head(20)
| playerID | G | AB | H | HR | RBI | |
|---|---|---|---|---|---|---|
| 100696 | mauerjo01 | 158 | 592 | 157 | 10 | 66.0 |
| 100215 | doziebr01 | 157 | 628 | 148 | 28 | 77.0 |
| 100915 | plouftr01 | 152 | 573 | 140 | 22 | 86.0 |
| 100488 | hunteto01 | 139 | 521 | 125 | 22 | 81.0 |
| 101164 | suzukku01 | 131 | 433 | 104 | 5 | 50.0 |
| 100249 | escobed01 | 127 | 409 | 107 | 12 | 58.0 |
| 101023 | rosared01 | 122 | 453 | 121 | 13 | 50.0 |
| 100459 | hicksaa01 | 97 | 352 | 90 | 11 | 33.0 |
| 101069 | santada01 | 91 | 261 | 56 | 0 | 21.0 |
| 100994 | robinsh01 | 83 | 180 | 45 | 0 | 16.0 |
| 101067 | sanomi01 | 80 | 279 | 75 | 18 | 52.0 |
| 100816 | nunezed02 | 72 | 188 | 53 | 4 | 20.0 |
| 99988 | boyerbl01 | 68 | 0 | 0 | 0 | 0.0 |
| 100270 | fienca01 | 62 | 0 | 0 | 0 | 0.0 |
| 100895 | perkigl01 | 60 | 0 | 0 | 0 | 0.0 |
| 101240 | vargake01 | 58 | 175 | 42 | 5 | 17.0 |
| 100221 | duensbr01 | 55 | 1 | 0 | 0 | 0.0 |
| 100701 | maytr01 | 48 | 3 | 0 | 0 | 0.0 |
| 100030 | buxtoby01 | 46 | 129 | 27 | 2 | 6.0 |
| 100455 | herrmch01 | 45 | 103 | 15 | 2 | 10.0 |
We may also do a multi-sort by passing in the list of columns we want sorted. This will sort in the order of the columns provided. For example,
df.loc[(df.yearID == 2015) & (df.teamID == "MIN"),\ ['playerID', 'G', 'AB', 'H', 'HR', 'RBI']]\ .sort_values(['G', 'HR'], ascending=False).tail()
| playerID | G | AB | H | HR | RBI | |
|---|---|---|---|---|---|---|
| 101189 | thielca01 | 6 | 0 | 0 | 0 | 0.0 |
| 99954 | bernido01 | 4 | 5 | 1 | 0 | 2.0 |
| 100917 | polanjo01 | 4 | 10 | 3 | 0 | 1.0 |
| 100564 | keplema01 | 3 | 7 | 1 | 0 | 0.0 |
| 100729 | meyeral01 | 2 | 0 | 0 | 0 | 0.0 |
df_min_2015.loc[:,'HtoAB'] = 0 df_min_2015.head()
| playerID | G | AB | H | HR | RBI | HtoAB | |
|---|---|---|---|---|---|---|---|
| 100696 | mauerjo01 | 158 | 592 | 157 | 10 | 66.0 | 0 |
| 100215 | doziebr01 | 157 | 628 | 148 | 28 | 77.0 | 0 |
| 100915 | plouftr01 | 152 | 573 | 140 | 22 | 86.0 | 0 |
| 100488 | hunteto01 | 139 | 521 | 125 | 22 | 81.0 | 0 |
| 101164 | suzukku01 | 131 | 433 | 104 | 5 | 50.0 | 0 |
df_min_2015 = df_min_2015.drop('HtoAB', axis=1) df_min_2015.head()
| playerID | G | AB | H | HR | RBI | |
|---|---|---|---|---|---|---|
| 100696 | mauerjo01 | 158 | 592 | 157 | 10 | 66.0 |
| 100215 | doziebr01 | 157 | 628 | 148 | 28 | 77.0 |
| 100915 | plouftr01 | 152 | 573 | 140 | 22 | 86.0 |
| 100488 | hunteto01 | 139 | 521 | 125 | 22 | 81.0 |
| 101164 | suzukku01 | 131 | 433 | 104 | 5 | 50.0 |
df_min_2015.H.head(10)
100696 157 100215 148 100915 140 100488 125 101164 104 100249 107 101023 121 100459 90 101069 56 100994 45 Name: H, dtype: int64
df_min_2015.loc[:,'HtoAB'] = 0 df_min_2015.loc[:,'HtoAB'] = [v.H/v.AB if v.AB > 0 else 0 for r, v in df_min_2015.iterrows()]
df_min_2015.head(10)
| playerID | G | AB | H | HR | RBI | HtoAB | |
|---|---|---|---|---|---|---|---|
| 100696 | mauerjo01 | 158 | 592 | 157 | 10 | 66.0 | 0.265203 |
| 100215 | doziebr01 | 157 | 628 | 148 | 28 | 77.0 | 0.235669 |
| 100915 | plouftr01 | 152 | 573 | 140 | 22 | 86.0 | 0.244328 |
| 100488 | hunteto01 | 139 | 521 | 125 | 22 | 81.0 | 0.239923 |
| 101164 | suzukku01 | 131 | 433 | 104 | 5 | 50.0 | 0.240185 |
| 100249 | escobed01 | 127 | 409 | 107 | 12 | 58.0 | 0.261614 |
| 101023 | rosared01 | 122 | 453 | 121 | 13 | 50.0 | 0.267108 |
| 100459 | hicksaa01 | 97 | 352 | 90 | 11 | 33.0 | 0.255682 |
| 101069 | santada01 | 91 | 261 | 56 | 0 | 21.0 | 0.214559 |
| 100994 | robinsh01 | 83 | 180 | 45 | 0 | 16.0 | 0.250000 |
df_min_2015[df_min_2015.G>80].sort_values('HtoAB', ascending=False)
| playerID | G | AB | H | HR | RBI | HtoAB | |
|---|---|---|---|---|---|---|---|
| 101023 | rosared01 | 122 | 453 | 121 | 13 | 50.0 | 0.267108 |
| 100696 | mauerjo01 | 158 | 592 | 157 | 10 | 66.0 | 0.265203 |
| 100249 | escobed01 | 127 | 409 | 107 | 12 | 58.0 | 0.261614 |
| 100459 | hicksaa01 | 97 | 352 | 90 | 11 | 33.0 | 0.255682 |
| 100994 | robinsh01 | 83 | 180 | 45 | 0 | 16.0 | 0.250000 |
| 100915 | plouftr01 | 152 | 573 | 140 | 22 | 86.0 | 0.244328 |
| 101164 | suzukku01 | 131 | 433 | 104 | 5 | 50.0 | 0.240185 |
| 100488 | hunteto01 | 139 | 521 | 125 | 22 | 81.0 | 0.239923 |
| 100215 | doziebr01 | 157 | 628 | 148 | 28 | 77.0 | 0.235669 |
| 101069 | santada01 | 91 | 261 | 56 | 0 | 21.0 | 0.214559 |
df_min_2015 = df_min_2015.reindex(columns=['playerID', 'HtoAB', 'AB', 'H', 'HR', 'RBI', 'G']) df_min_2015.head()
| playerID | HtoAB | AB | H | HR | RBI | G | |
|---|---|---|---|---|---|---|---|
| 100696 | mauerjo01 | 0.265203 | 592 | 157 | 10 | 66.0 | 158 |
| 100215 | doziebr01 | 0.235669 | 628 | 148 | 28 | 77.0 | 157 |
| 100915 | plouftr01 | 0.244328 | 573 | 140 | 22 | 86.0 | 152 |
| 100488 | hunteto01 | 0.239923 | 521 | 125 | 22 | 81.0 | 139 |
| 101164 | suzukku01 | 0.240185 | 433 | 104 | 5 | 50.0 | 131 |
Finally, we can return our DataFrame back to its original columns (and order) by reindexing again. Notice, also that we can effectively perform a drop() by doing this, though the syntax with reindex() is more verbose.
df_min_2015 = df_min_2015.reindex(columns=['playerID', 'G', 'AB', 'H', 'HR', 'RBI']) df_min_2015.head()
| playerID | G | AB | H | HR | RBI | |
|---|---|---|---|---|---|---|
| 100696 | mauerjo01 | 158 | 592 | 157 | 10 | 66.0 |
| 100215 | doziebr01 | 157 | 628 | 148 | 28 | 77.0 |
| 100915 | plouftr01 | 152 | 573 | 140 | 22 | 86.0 |
| 100488 | hunteto01 | 139 | 521 | 125 | 22 | 81.0 |
| 101164 | suzukku01 | 131 | 433 | 104 | 5 | 50.0 |
Adding rows can be achieved using loc[] and setting the new index to a dictionary of values using the column labels as keys.
df_min_2015.loc[200000] = \ { 'playerID': 'keith01', 'RBI': '0', 'G': '0', 'H': '0', 'HR': '0', 'AB': '0' } df_min_2015.tail()
| playerID | G | AB | H | HR | RBI | |
|---|---|---|---|---|---|---|
| 100917 | polanjo01 | 4 | 10 | 3 | 0 | 1 |
| 99954 | bernido01 | 4 | 5 | 1 | 0 | 2 |
| 100564 | keplema01 | 3 | 7 | 1 | 0 | 0 |
| 100729 | meyeral01 | 2 | 0 | 0 | 0 | 0 |
| 200000 | keith01 | 0 | 0 | 0 | 0 | 0 |
It is also the same with lists and tuples.
df_min_2015.loc[200000] = ('keith01', 1, 1, 1, 1, 1) df_min_2015.loc[200001] = ['keith02', 1, 1, 1, 1, 1] df_min_2015.tail()
| playerID | G | AB | H | HR | RBI | |
|---|---|---|---|---|---|---|
| 99954 | bernido01 | 4 | 5 | 1 | 0 | 2 |
| 100564 | keplema01 | 3 | 7 | 1 | 0 | 0 |
| 100729 | meyeral01 | 2 | 0 | 0 | 0 | 0 |
| 200000 | keith01 | 1 | 1 | 1 | 1 | 1 |
| 200001 | keith02 | 1 | 1 | 1 | 1 | 1 |
Note that we can drop a number of rows at a time by passing a list of the indices we'd like dropped.
df_min_2015 = df_min_2015.drop([200000, 200001], axis=0) df_min_2015.tail()
| playerID | G | AB | H | HR | RBI | |
|---|---|---|---|---|---|---|
| 101189 | thielca01 | 6 | 0 | 0 | 0 | 0 |
| 100917 | polanjo01 | 4 | 10 | 3 | 0 | 1 |
| 99954 | bernido01 | 4 | 5 | 1 | 0 | 2 |
| 100564 | keplema01 | 3 | 7 | 1 | 0 | 0 |
| 100729 | meyeral01 | 2 | 0 | 0 | 0 | 0 |
Similar results can be achieved using append(). With append, you can append, Series, DataFrames and/or a list of these.
df_min_2015.append( pd.Series( {'playerID': 'keith01', 'G': 0, 'AB': 0, 'H':0, 'HR': 0, 'RBI': 0}, name='200000')).tail()
| playerID | G | AB | H | HR | RBI | |
|---|---|---|---|---|---|---|
| 100917 | polanjo01 | 4 | 10 | 3 | 0 | 1 |
| 99954 | bernido01 | 4 | 5 | 1 | 0 | 2 |
| 100564 | keplema01 | 3 | 7 | 1 | 0 | 0 |
| 100729 | meyeral01 | 2 | 0 | 0 | 0 | 0 |
| 200000 | keith01 | 0 | 0 | 0 | 0 | 0 |
df_min_2015[:5].append(df_min_2015[-5:])
| playerID | G | AB | H | HR | RBI | |
|---|---|---|---|---|---|---|
| 100696 | mauerjo01 | 158 | 592 | 157 | 10 | 66 |
| 100215 | doziebr01 | 157 | 628 | 148 | 28 | 77 |
| 100915 | plouftr01 | 152 | 573 | 140 | 22 | 86 |
| 100488 | hunteto01 | 139 | 521 | 125 | 22 | 81 |
| 101164 | suzukku01 | 131 | 433 | 104 | 5 | 50 |
| 101189 | thielca01 | 6 | 0 | 0 | 0 | 0 |
| 100917 | polanjo01 | 4 | 10 | 3 | 0 | 1 |
| 99954 | bernido01 | 4 | 5 | 1 | 0 | 2 |
| 100564 | keplema01 | 3 | 7 | 1 | 0 | 0 |
| 100729 | meyeral01 | 2 | 0 | 0 | 0 | 0 |
df_min_2015[:5].append([df_min_2015[10:12], df_min_2015[-5:]])
| playerID | G | AB | H | HR | RBI | |
|---|---|---|---|---|---|---|
| 100696 | mauerjo01 | 158 | 592 | 157 | 10 | 66 |
| 100215 | doziebr01 | 157 | 628 | 148 | 28 | 77 |
| 100915 | plouftr01 | 152 | 573 | 140 | 22 | 86 |
| 100488 | hunteto01 | 139 | 521 | 125 | 22 | 81 |
| 101164 | suzukku01 | 131 | 433 | 104 | 5 | 50 |
| 101067 | sanomi01 | 80 | 279 | 75 | 18 | 52 |
| 100816 | nunezed02 | 72 | 188 | 53 | 4 | 20 |
| 101189 | thielca01 | 6 | 0 | 0 | 0 | 0 |
| 100917 | polanjo01 | 4 | 10 | 3 | 0 | 1 |
| 99954 | bernido01 | 4 | 5 | 1 | 0 | 2 |
| 100564 | keplema01 | 3 | 7 | 1 | 0 | 0 |
| 100729 | meyeral01 | 2 | 0 | 0 | 0 | 0 |
The same result can be achieved with pd.concat(), where the defaut axis is 0.
pd.concat([df_min_2015[:5], df_min_2015[-5:]], axis=0)
| playerID | G | AB | H | HR | RBI | |
|---|---|---|---|---|---|---|
| 100696 | mauerjo01 | 158 | 592 | 157 | 10 | 66 |
| 100215 | doziebr01 | 157 | 628 | 148 | 28 | 77 |
| 100915 | plouftr01 | 152 | 573 | 140 | 22 | 86 |
| 100488 | hunteto01 | 139 | 521 | 125 | 22 | 81 |
| 101164 | suzukku01 | 131 | 433 | 104 | 5 | 50 |
| 101189 | thielca01 | 6 | 0 | 0 | 0 | 0 |
| 100917 | polanjo01 | 4 | 10 | 3 | 0 | 1 |
| 99954 | bernido01 | 4 | 5 | 1 | 0 | 2 |
| 100564 | keplema01 | 3 | 7 | 1 | 0 | 0 |
| 100729 | meyeral01 | 2 | 0 | 0 | 0 | 0 |
But we can use concat() to make a column-wise concatenation using axis=1 (columns).
pd.concat([df_min_2015[:5], df_min_2015[-5:]], axis=1)
| playerID | G | AB | H | HR | RBI | playerID | G | AB | H | HR | RBI | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 99954 | NaN | NaN | NaN | NaN | NaN | NaN | bernido01 | 4 | 5 | 1 | 0 | 2 |
| 100215 | doziebr01 | 157 | 628 | 148 | 28 | 77 | NaN | NaN | NaN | NaN | NaN | NaN |
| 100488 | hunteto01 | 139 | 521 | 125 | 22 | 81 | NaN | NaN | NaN | NaN | NaN | NaN |
| 100564 | NaN | NaN | NaN | NaN | NaN | NaN | keplema01 | 3 | 7 | 1 | 0 | 0 |
| 100696 | mauerjo01 | 158 | 592 | 157 | 10 | 66 | NaN | NaN | NaN | NaN | NaN | NaN |
| 100729 | NaN | NaN | NaN | NaN | NaN | NaN | meyeral01 | 2 | 0 | 0 | 0 | 0 |
| 100915 | plouftr01 | 152 | 573 | 140 | 22 | 86 | NaN | NaN | NaN | NaN | NaN | NaN |
| 100917 | NaN | NaN | NaN | NaN | NaN | NaN | polanjo01 | 4 | 10 | 3 | 0 | 1 |
| 101164 | suzukku01 | 131 | 433 | 104 | 5 | 50 | NaN | NaN | NaN | NaN | NaN | NaN |
| 101189 | NaN | NaN | NaN | NaN | NaN | NaN | thielca01 | 6 | 0 | 0 | 0 | 0 |
We can see that the indices are being considered in the concatenation and row indices are being joined. This behavior can be controlled via the join parameter, which we'll leave for the reader to explore.
One last thing we might want to do in an operation like this is to reset the index. To do so, we might start with ignoring the column index using the ignore_index=True so we can set it later to something more appropriate after the concatenation.
pd.concat([df_min_2015[:5], df_min_2015[-5:]], axis=1, ignore_index=True)
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 99954 | NaN | NaN | NaN | NaN | NaN | NaN | bernido01 | 4 | 5 | 1 | 0 | 2 |
| 100215 | doziebr01 | 157 | 628 | 148 | 28 | 77 | NaN | NaN | NaN | NaN | NaN | NaN |
| 100488 | hunteto01 | 139 | 521 | 125 | 22 | 81 | NaN | NaN | NaN | NaN | NaN | NaN |
| 100564 | NaN | NaN | NaN | NaN | NaN | NaN | keplema01 | 3 | 7 | 1 | 0 | 0 |
| 100696 | mauerjo01 | 158 | 592 | 157 | 10 | 66 | NaN | NaN | NaN | NaN | NaN | NaN |
| 100729 | NaN | NaN | NaN | NaN | NaN | NaN | meyeral01 | 2 | 0 | 0 | 0 | 0 |
| 100915 | plouftr01 | 152 | 573 | 140 | 22 | 86 | NaN | NaN | NaN | NaN | NaN | NaN |
| 100917 | NaN | NaN | NaN | NaN | NaN | NaN | polanjo01 | 4 | 10 | 3 | 0 | 1 |
| 101164 | suzukku01 | 131 | 433 | 104 | 5 | 50 | NaN | NaN | NaN | NaN | NaN | NaN |
| 101189 | NaN | NaN | NaN | NaN | NaN | NaN | thielca01 | 6 | 0 | 0 | 0 | 0 |
Pandas provides the ability to build more complex indices allowing for highly flexible and natural data access.
We will cover the basics of through the MultiIndex object and will the the remaining exploration to the reader.
Let's get the players on the Washington Nationals who played 100 or more games in 2015 and 2016.
df_was = df[(df.yearID > 2014) & (df.teamID=='WAS') & (df.G > 99)]
df_was.head()
| playerID | yearID | stint | teamID | lgID | G | AB | R | H | 2B | ... | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 100193 | desmoia01 | 2015 | 1 | WAS | NL | 156 | 583 | 69 | 136 | 27 | ... | 62.0 | 13.0 | 5.0 | 45 | 187.0 | 0.0 | 3.0 | 6.0 | 4.0 | 9.0 |
| 100250 | escobyu01 | 2015 | 1 | WAS | NL | 139 | 535 | 75 | 168 | 25 | ... | 56.0 | 2.0 | 2.0 | 45 | 70.0 | 0.0 | 8.0 | 1.0 | 2.0 | 24.0 |
| 100251 | espinda01 | 2015 | 1 | WAS | NL | 118 | 367 | 59 | 88 | 21 | ... | 37.0 | 5.0 | 2.0 | 33 | 106.0 | 5.0 | 6.0 | 3.0 | 3.0 | 6.0 |
| 100422 | harpebr03 | 2015 | 1 | WAS | NL | 153 | 521 | 118 | 172 | 38 | ... | 99.0 | 6.0 | 4.0 | 124 | 131.0 | 15.0 | 5.0 | 0.0 | 4.0 | 15.0 |
| 100950 | ramoswi01 | 2015 | 1 | WAS | NL | 128 | 475 | 41 | 109 | 16 | ... | 68.0 | 0.0 | 0.0 | 21 | 101.0 | 2.0 | 0.0 | 0.0 | 8.0 | 16.0 |
5 rows × 22 columns
One obvious problem if we were to access the data here by player and year, we have to build a much more involved query and even more so if we needed to ignore data.
We are going to create a hierarchical index or MultiIndex to solve this problem. We'll take take liberty to drop columns we don't need (teamID, ldID, stint) and reorganize the index hierarchically.
We will use MultiIndex using a tuple of the data we need and provide the index first by player, then by year. To do this we'll just grab all the player IDs and zip them with the year. This will look something like this:
tuple( zip( df_was[['playerID','yearID']].sort_values(by='playerID')['playerID'], df_was[['playerID','yearID']].sort_values(by='playerID')['yearID'] ) )
(('desmoia01', 2015),
('escobyu01', 2015),
('espinda01', 2015),
('espinda01', 2016),
('harpebr03', 2015),
('harpebr03', 2016),
('murphda08', 2016),
('ramoswi01', 2015),
('ramoswi01', 2016),
('rendoan01', 2016),
('reverbe01', 2016),
('robincl01', 2015),
('robincl01', 2016),
('taylomi02', 2015),
('werthja01', 2016),
('zimmery01', 2016))
# create an index to be used over the data we're interested in idx = \ pd.MultiIndex.from_tuples( tuple( zip( df_was[['playerID','yearID']].sort_values(by='playerID')['playerID'], df_was[['playerID','yearID']].sort_values(by='playerID')['yearID'])) ) idx
MultiIndex(levels=[['desmoia01', 'escobyu01', 'espinda01', 'harpebr03', 'murphda08', 'ramoswi01', 'rendoan01', 'reverbe01', 'robincl01', 'taylomi02', 'werthja01', 'zimmery01'], [2015, 2016]],
labels=[[0, 1, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8, 8, 9, 10, 11], [0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1]])
Notice now that we have two levels in our row axis (axis 0) and we will now use that index to build the hierachically indexed DataFrame.
# sorting the indices is critical for lining up the data in the tuples df_was = df_was.sort_values(by=['playerID']).\ set_index(idx).\ drop(['playerID', 'yearID', 'teamID', 'lgID', 'stint'], axis=1) df_was
| G | AB | R | H | 2B | 3B | HR | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| desmoia01 | 2015 | 156 | 583 | 69 | 136 | 27 | 2 | 19 | 62.0 | 13.0 | 5.0 | 45 | 187.0 | 0.0 | 3.0 | 6.0 | 4.0 | 9.0 |
| escobyu01 | 2015 | 139 | 535 | 75 | 168 | 25 | 1 | 9 | 56.0 | 2.0 | 2.0 | 45 | 70.0 | 0.0 | 8.0 | 1.0 | 2.0 | 24.0 |
| espinda01 | 2015 | 118 | 367 | 59 | 88 | 21 | 1 | 13 | 37.0 | 5.0 | 2.0 | 33 | 106.0 | 5.0 | 6.0 | 3.0 | 3.0 | 6.0 |
| 2016 | 157 | 516 | 66 | 108 | 15 | 0 | 24 | 72.0 | 9.0 | 2.0 | 54 | 174.0 | 12.0 | 20.0 | 7.0 | 4.0 | 4.0 | |
| harpebr03 | 2015 | 153 | 521 | 118 | 172 | 38 | 1 | 42 | 99.0 | 6.0 | 4.0 | 124 | 131.0 | 15.0 | 5.0 | 0.0 | 4.0 | 15.0 |
| 2016 | 147 | 506 | 84 | 123 | 24 | 2 | 24 | 86.0 | 21.0 | 10.0 | 108 | 117.0 | 20.0 | 3.0 | 0.0 | 10.0 | 11.0 | |
| murphda08 | 2016 | 142 | 531 | 88 | 184 | 47 | 5 | 25 | 104.0 | 5.0 | 3.0 | 35 | 57.0 | 10.0 | 8.0 | 0.0 | 8.0 | 4.0 |
| ramoswi01 | 2015 | 128 | 475 | 41 | 109 | 16 | 0 | 15 | 68.0 | 0.0 | 0.0 | 21 | 101.0 | 2.0 | 0.0 | 0.0 | 8.0 | 16.0 |
| 2016 | 131 | 482 | 58 | 148 | 25 | 0 | 22 | 80.0 | 0.0 | 0.0 | 35 | 79.0 | 2.0 | 2.0 | 0.0 | 4.0 | 17.0 | |
| rendoan01 | 2016 | 156 | 567 | 91 | 153 | 38 | 2 | 20 | 85.0 | 12.0 | 6.0 | 65 | 117.0 | 2.0 | 7.0 | 0.0 | 8.0 | 5.0 |
| reverbe01 | 2016 | 103 | 350 | 44 | 76 | 9 | 7 | 2 | 24.0 | 14.0 | 5.0 | 18 | 34.0 | 0.0 | 3.0 | 2.0 | 2.0 | 12.0 |
| robincl01 | 2015 | 126 | 309 | 44 | 84 | 15 | 1 | 10 | 34.0 | 0.0 | 0.0 | 37 | 52.0 | 4.0 | 5.0 | 0.0 | 1.0 | 6.0 |
| 2016 | 104 | 196 | 16 | 46 | 4 | 0 | 5 | 26.0 | 0.0 | 0.0 | 20 | 38.0 | 0.0 | 2.0 | 1.0 | 5.0 | 4.0 | |
| taylomi02 | 2015 | 138 | 472 | 49 | 108 | 15 | 2 | 14 | 63.0 | 16.0 | 3.0 | 35 | 158.0 | 9.0 | 1.0 | 1.0 | 2.0 | 5.0 |
| werthja01 | 2016 | 143 | 525 | 84 | 128 | 28 | 0 | 21 | 69.0 | 5.0 | 1.0 | 71 | 139.0 | 0.0 | 4.0 | 0.0 | 6.0 | 17.0 |
| zimmery01 | 2016 | 115 | 427 | 60 | 93 | 18 | 1 | 15 | 46.0 | 4.0 | 1.0 | 29 | 104.0 | 1.0 | 5.0 | 0.0 | 6.0 | 12.0 |
df_was.loc[('robincl01', ),['G', 'AB', 'H', 'SO']]
| G | AB | H | SO | |
|---|---|---|---|---|
| 2015 | 126 | 309 | 84 | 52.0 |
| 2016 | 104 | 196 | 46 | 38.0 |
df_was.loc[('robincl01', 2016),['G', 'AB', 'H', 'SO']]
G 104.0 AB 196.0 H 46.0 SO 38.0 Name: (robincl01, 2016), dtype: float64
For the sake of the example, let's take the DataFrame for all rows of data past 2016 and create a multi-index using year, league, team and player as the groupings of the index.
df.head()
| playerID | yearID | stint | teamID | lgID | G | AB | R | H | 2B | ... | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | abercda01 | 1871 | 1 | TRO | NaN | 1 | 4 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | NaN | NaN | NaN | NaN | NaN |
| 1 | addybo01 | 1871 | 1 | RC1 | NaN | 25 | 118 | 30 | 32 | 6 | ... | 13.0 | 8.0 | 1.0 | 4 | 0.0 | NaN | NaN | NaN | NaN | NaN |
| 2 | allisar01 | 1871 | 1 | CL1 | NaN | 29 | 137 | 28 | 40 | 4 | ... | 19.0 | 3.0 | 1.0 | 2 | 5.0 | NaN | NaN | NaN | NaN | NaN |
| 3 | allisdo01 | 1871 | 1 | WS3 | NaN | 27 | 133 | 28 | 44 | 10 | ... | 27.0 | 1.0 | 1.0 | 0 | 2.0 | NaN | NaN | NaN | NaN | NaN |
| 4 | ansonca01 | 1871 | 1 | RC1 | NaN | 25 | 120 | 29 | 39 | 11 | ... | 16.0 | 6.0 | 2.0 | 2 | 1.0 | NaN | NaN | NaN | NaN | NaN |
5 rows × 22 columns
df_mi = df[df.yearID>2006].copy() idx_labels = ['yearID', 'lgID', 'teamID', 'playerID'] tuple( zip( df_mi[idx_labels]\ .sort_values(idx_labels)['yearID'], df_mi[idx_labels]\ .sort_values(idx_labels)['lgID'], df_mi[idx_labels]\ .sort_values(idx_labels)['teamID'], df_mi[idx_labels]\ .sort_values(idx_labels)['playerID']))[-10:]
((2016, 'NL', 'WAS', 'rzepcma01'), (2016, 'NL', 'WAS', 'scherma01'), (2016, 'NL', 'WAS', 'severpe01'), (2016, 'NL', 'WAS', 'solissa01'), (2016, 'NL', 'WAS', 'strasst01'), (2016, 'NL', 'WAS', 'taylomi02'), (2016, 'NL', 'WAS', 'treinbl01'), (2016, 'NL', 'WAS', 'turnetr01'), (2016, 'NL', 'WAS', 'werthja01'), (2016, 'NL', 'WAS', 'zimmery01'))
idx = \ pd.MultiIndex.from_tuples( tuple( zip( df_mi[idx_labels]\ .sort_values(idx_labels)['yearID'], df_mi[idx_labels]\ .sort_values(idx_labels)['lgID'], df_mi[idx_labels]\ .sort_values(idx_labels)['teamID'], df_mi[idx_labels]\ .sort_values(idx_labels)['playerID'])) )
df_mi = df_mi.sort_values(['yearID', 'teamID']).set_index(idx)#.drop(['playerID', 'yearID', 'teamID', 'stint'], axis=1)
df_mi.head()
| playerID | yearID | stint | teamID | lgID | G | AB | R | H | 2B | ... | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2007 | AL | BAL | baezda01 | bardebr01 | 2007 | 1 | ARI | NL | 8 | 12 | 0 | 1 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| bakopa01 | bonifem01 | 2007 | 1 | ARI | NL | 11 | 23 | 2 | 5 | 1 | ... | 2.0 | 0.0 | 1.0 | 4 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |||
| bedarer01 | byrneer01 | 2007 | 1 | ARI | NL | 160 | 626 | 103 | 179 | 30 | ... | 83.0 | 50.0 | 7.0 | 57 | 98.0 | 5.0 | 10.0 | 1.0 | 4.0 | 12.0 | |||
| bellro01 | callaal01 | 2007 | 1 | ARI | NL | 56 | 144 | 10 | 31 | 8 | ... | 7.0 | 1.0 | 1.0 | 9 | 14.0 | 0.0 | 1.0 | 1.0 | 1.0 | 8.0 | |||
| birkiku01 | choatra01 | 2007 | 1 | ARI | NL | 2 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 22 columns
df_mi.tail()
| playerID | yearID | stint | teamID | lgID | G | AB | R | H | 2B | ... | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2016 | NL | WAS | taylomi02 | taylomi02 | 2016 | 1 | WAS | NL | 76 | 221 | 28 | 51 | 11 | ... | 16.0 | 14.0 | 3.0 | 14 | 77.0 | 0.0 | 1.0 | 0.0 | 1.0 | 2.0 |
| treinbl01 | treinbl01 | 2016 | 1 | WAS | NL | 73 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |||
| turnetr01 | turnetr01 | 2016 | 1 | WAS | NL | 73 | 307 | 53 | 105 | 14 | ... | 40.0 | 33.0 | 6.0 | 14 | 59.0 | 0.0 | 1.0 | 0.0 | 2.0 | 1.0 | |||
| werthja01 | werthja01 | 2016 | 1 | WAS | NL | 143 | 525 | 84 | 128 | 28 | ... | 69.0 | 5.0 | 1.0 | 71 | 139.0 | 0.0 | 4.0 | 0.0 | 6.0 | 17.0 | |||
| zimmery01 | zimmery01 | 2016 | 1 | WAS | NL | 115 | 427 | 60 | 93 | 18 | ... | 46.0 | 4.0 | 1.0 | 29 | 104.0 | 1.0 | 5.0 | 0.0 | 6.0 | 12.0 |
5 rows × 22 columns
Now we can use this multi-index to out advantage, using the tuple of the index values we want and restricting the columns to just the data of interest.
df_mi.loc[(2007, 'AL', 'TOR'), ['G', 'AB']].head()
| G | AB | |
|---|---|---|
| accarje01 | 152 | 509 |
| adamsru01 | 62 | 1 |
| banksjo01 | 26 | 5 |
| burneaj01 | 8 | 14 |
| chacigu01 | 65 | 0 |
Ξ