We will review our terminology for a quick moment:
0
is the column-axis and 1
is the row-axis. When dealing with multi-indices, the hierarchy within the axis are referred to as levels and accessed similarly In the example for this section, we're going to go back to our Baseball data set and load the batting statistics into a DataFrame.
import pandas as pd
# get the data for players in 2015-16 who played in 100 or more games
df = pd.read_csv("./datasets/Batting.csv")
[]
operator (again)¶As before basic slice selections can be made with the syntax similar to that found in lists using the convenience of the []
operator. For example, obtaining the first 5 rows of our data, or the last 15.
We mostly worked on row slicing with the []
selector, but if we pass a column label or list of the columns we'd like, say the RBI
and G
(games played) data, we get mostly what we'd expect:
df["RBI"][:5]
0 0.0 1 13.0 2 19.0 3 27.0 4 16.0 Name: RBI, dtype: float64
df[["RBI", "G"]][:10]
RBI | G | |
---|---|---|
0 | 0.0 | 1 |
1 | 13.0 | 25 |
2 | 19.0 | 29 |
3 | 27.0 | 27 |
4 | 16.0 | 25 |
5 | 5.0 | 12 |
6 | 2.0 | 1 |
7 | 34.0 | 31 |
8 | 1.0 | 1 |
9 | 11.0 | 18 |
We have yet to make more complex selections beyond index values. Now we're ready to introduce selecting by boolean value. With this kinds of selection, we're going to as Pandas to give us the Series or DataFrame that represents the boolean values of what we want, then we will allow iloc
to reduce the resulting Series or DataFrame to what we're looking for. Let's see this in action.
Say we want to find all items in our DataFrame where yearID
is 2015
or
df.yearID == 2015
Let's first see what this does.
df.yearID == 2015
0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False 12 False 13 False 14 False 15 False 16 False 17 False 18 False 19 False 20 False 21 False 22 False 23 False 24 False 25 False 26 False 27 False 28 False 29 False ... 102786 False 102787 False 102788 False 102789 False 102790 False 102791 False 102792 False 102793 False 102794 False 102795 False 102796 False 102797 False 102798 False 102799 False 102800 False 102801 False 102802 False 102803 False 102804 False 102805 False 102806 False 102807 False 102808 False 102809 False 102810 False 102811 False 102812 False 102813 False 102814 False 102815 False Name: yearID, Length: 102816, dtype: bool
df.loc[df.yearID == 2015][:10] # note we're restricting the return to just the first 10 values
playerID | yearID | stint | teamID | lgID | G | AB | R | H | 2B | ... | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
99847 | aardsda01 | 2015 | 1 | ATL | NL | 33 | 1 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
99848 | abadfe01 | 2015 | 1 | OAK | AL | 62 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
99849 | abreujo02 | 2015 | 1 | CHA | AL | 154 | 613 | 88 | 178 | 34 | ... | 101.0 | 0.0 | 0.0 | 39 | 140.0 | 11.0 | 15.0 | 0.0 | 1.0 | 16.0 |
99850 | achteaj01 | 2015 | 1 | MIN | AL | 11 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
99851 | ackledu01 | 2015 | 1 | SEA | AL | 85 | 186 | 22 | 40 | 8 | ... | 19.0 | 2.0 | 2.0 | 14 | 38.0 | 0.0 | 1.0 | 3.0 | 3.0 | 3.0 |
99852 | ackledu01 | 2015 | 2 | NYA | AL | 23 | 52 | 6 | 15 | 3 | ... | 11.0 | 0.0 | 0.0 | 4 | 7.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
99853 | adamecr01 | 2015 | 1 | COL | NL | 26 | 53 | 4 | 13 | 1 | ... | 3.0 | 0.0 | 1.0 | 3 | 11.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 |
99854 | adamsau01 | 2015 | 1 | CLE | AL | 28 | 1 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
99855 | adamsma01 | 2015 | 1 | SLN | NL | 60 | 175 | 14 | 42 | 9 | ... | 24.0 | 1.0 | 0.0 | 10 | 41.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 |
99856 | adcocna01 | 2015 | 1 | CIN | NL | 13 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
10 rows × 22 columns
Now what if we wanted the restrict this further by team. Say we wanted to see only the Minesota Twins player data for 2015. That is
df.yearID == 2015
AND
df.teamID == "MIN"
We simply put these in parethesis and use the &
operator.
df.loc[(df.yearID == 2015) & (df.teamID == "MIN")].head(10)
playerID | yearID | stint | teamID | lgID | G | AB | R | H | 2B | ... | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
99850 | achteaj01 | 2015 | 1 | MIN | AL | 11 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
99891 | arciaos01 | 2015 | 1 | MIN | AL | 19 | 58 | 6 | 16 | 0 | ... | 8.0 | 0.0 | 0.0 | 4 | 15.0 | 4.0 | 2.0 | 0.0 | 1.0 | 2.0 |
99954 | bernido01 | 2015 | 1 | MIN | AL | 4 | 5 | 1 | 1 | 1 | ... | 2.0 | 0.0 | 0.0 | 1 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
99988 | boyerbl01 | 2015 | 1 | MIN | AL | 68 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
100030 | buxtoby01 | 2015 | 1 | MIN | AL | 46 | 129 | 16 | 27 | 7 | ... | 6.0 | 2.0 | 2.0 | 6 | 44.0 | 0.0 | 1.0 | 2.0 | 0.0 | 1.0 |
100139 | cottsne01 | 2015 | 2 | MIN | AL | 17 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
100215 | doziebr01 | 2015 | 1 | MIN | AL | 157 | 628 | 101 | 148 | 39 | ... | 77.0 | 12.0 | 4.0 | 61 | 148.0 | 2.0 | 7.0 | 0.0 | 8.0 | 10.0 |
100221 | duensbr01 | 2015 | 1 | MIN | AL | 55 | 1 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
100222 | duffety01 | 2015 | 1 | MIN | AL | 10 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
100249 | escobed01 | 2015 | 1 | MIN | AL | 127 | 409 | 48 | 107 | 31 | ... | 58.0 | 2.0 | 3.0 | 28 | 86.0 | 1.0 | 2.0 | 2.0 | 5.0 | 7.0 |
10 rows × 22 columns
Now what if we wanted to restrict a subset of columns. This is easy with iloc[]
... we will just use our boolean expression as above for the row selection and then the list of columns for our column selection (in this case a much smaller subset of data).
df.loc[(df.yearID == 2015) & (df.teamID == "MIN"),\
['playerID', 'G', 'AB', 'H', 'HR', 'RBI']]
playerID | G | AB | H | HR | RBI | |
---|---|---|---|---|---|---|
99850 | achteaj01 | 11 | 0 | 0 | 0 | 0.0 |
99891 | arciaos01 | 19 | 58 | 16 | 2 | 8.0 |
99954 | bernido01 | 4 | 5 | 1 | 0 | 2.0 |
99988 | boyerbl01 | 68 | 0 | 0 | 0 | 0.0 |
100030 | buxtoby01 | 46 | 129 | 27 | 2 | 6.0 |
100139 | cottsne01 | 17 | 0 | 0 | 0 | 0.0 |
100215 | doziebr01 | 157 | 628 | 148 | 28 | 77.0 |
100221 | duensbr01 | 55 | 1 | 0 | 0 | 0.0 |
100222 | duffety01 | 10 | 0 | 0 | 0 | 0.0 |
100249 | escobed01 | 127 | 409 | 107 | 12 | 58.0 |
100270 | fienca01 | 62 | 0 | 0 | 0 | 0.0 |
100302 | fryerer01 | 15 | 22 | 5 | 0 | 2.0 |
100333 | gibsoky01 | 32 | 5 | 1 | 0 | 0.0 |
100373 | grahajr01 | 39 | 0 | 0 | 0 | 0.0 |
100455 | herrmch01 | 45 | 103 | 15 | 2 | 10.0 |
100459 | hicksaa01 | 97 | 352 | 90 | 11 | 33.0 |
100486 | hugheph01 | 27 | 3 | 0 | 0 | 0.0 |
100488 | hunteto01 | 139 | 521 | 125 | 22 | 81.0 |
100521 | jepseke01 | 29 | 0 | 0 | 0 | 0.0 |
100564 | keplema01 | 3 | 7 | 1 | 0 | 0.0 |
100696 | mauerjo01 | 158 | 592 | 157 | 10 | 66.0 |
100701 | maytr01 | 48 | 3 | 0 | 0 | 0.0 |
100729 | meyeral01 | 2 | 0 | 0 | 0 | 0.0 |
100737 | milonto01 | 24 | 2 | 0 | 0 | 0.0 |
100807 | nolasri01 | 9 | 3 | 0 | 0 | 0.0 |
100816 | nunezed02 | 72 | 188 | 53 | 4 | 20.0 |
100837 | orourry01 | 28 | 0 | 0 | 0 | 0.0 |
100872 | pelfrmi01 | 30 | 3 | 2 | 0 | 0.0 |
100895 | perkigl01 | 60 | 0 | 0 | 0 | 0.0 |
100915 | plouftr01 | 152 | 573 | 140 | 22 | 86.0 |
100917 | polanjo01 | 4 | 10 | 3 | 0 | 1.0 |
100925 | pressry01 | 27 | 0 | 0 | 0 | 0.0 |
100994 | robinsh01 | 83 | 180 | 45 | 0 | 16.0 |
101023 | rosared01 | 122 | 453 | 121 | 13 | 50.0 |
101067 | sanomi01 | 80 | 279 | 75 | 18 | 52.0 |
101069 | santada01 | 91 | 261 | 56 | 0 | 21.0 |
101072 | santaer01 | 17 | 0 | 0 | 0 | 0.0 |
101079 | schafjo02 | 27 | 69 | 15 | 0 | 5.0 |
101144 | staufti01 | 13 | 0 | 0 | 0 | 0.0 |
101164 | suzukku01 | 131 | 433 | 104 | 5 | 50.0 |
101189 | thielca01 | 6 | 0 | 0 | 0 | 0.0 |
101193 | thompaa01 | 41 | 0 | 0 | 0 | 0.0 |
101203 | tonkimi01 | 26 | 0 | 0 | 0 | 0.0 |
101240 | vargake01 | 58 | 175 | 42 | 5 | 17.0 |
Sorting is facilitated by the sort_values()
method. By default, sorting is done in ascending order, specify the parameter ascending=False
to get descending order.
df_min_2015 = df.loc[(df.yearID == 2015) & (df.teamID == "MIN"),\
['playerID', 'G', 'AB', 'H', 'HR', 'RBI']]\
.sort_values('G', ascending=False)
df_min_2015.head(20)
playerID | G | AB | H | HR | RBI | |
---|---|---|---|---|---|---|
100696 | mauerjo01 | 158 | 592 | 157 | 10 | 66.0 |
100215 | doziebr01 | 157 | 628 | 148 | 28 | 77.0 |
100915 | plouftr01 | 152 | 573 | 140 | 22 | 86.0 |
100488 | hunteto01 | 139 | 521 | 125 | 22 | 81.0 |
101164 | suzukku01 | 131 | 433 | 104 | 5 | 50.0 |
100249 | escobed01 | 127 | 409 | 107 | 12 | 58.0 |
101023 | rosared01 | 122 | 453 | 121 | 13 | 50.0 |
100459 | hicksaa01 | 97 | 352 | 90 | 11 | 33.0 |
101069 | santada01 | 91 | 261 | 56 | 0 | 21.0 |
100994 | robinsh01 | 83 | 180 | 45 | 0 | 16.0 |
101067 | sanomi01 | 80 | 279 | 75 | 18 | 52.0 |
100816 | nunezed02 | 72 | 188 | 53 | 4 | 20.0 |
99988 | boyerbl01 | 68 | 0 | 0 | 0 | 0.0 |
100270 | fienca01 | 62 | 0 | 0 | 0 | 0.0 |
100895 | perkigl01 | 60 | 0 | 0 | 0 | 0.0 |
101240 | vargake01 | 58 | 175 | 42 | 5 | 17.0 |
100221 | duensbr01 | 55 | 1 | 0 | 0 | 0.0 |
100701 | maytr01 | 48 | 3 | 0 | 0 | 0.0 |
100030 | buxtoby01 | 46 | 129 | 27 | 2 | 6.0 |
100455 | herrmch01 | 45 | 103 | 15 | 2 | 10.0 |
We may also do a multi-sort by passing in the list of columns we want sorted. This will sort in the order of the columns provided. For example,
df.loc[(df.yearID == 2015) & (df.teamID == "MIN"),\
['playerID', 'G', 'AB', 'H', 'HR', 'RBI']]\
.sort_values(['G', 'HR'], ascending=False).tail()
playerID | G | AB | H | HR | RBI | |
---|---|---|---|---|---|---|
101189 | thielca01 | 6 | 0 | 0 | 0 | 0.0 |
99954 | bernido01 | 4 | 5 | 1 | 0 | 2.0 |
100917 | polanjo01 | 4 | 10 | 3 | 0 | 1.0 |
100564 | keplema01 | 3 | 7 | 1 | 0 | 0.0 |
100729 | meyeral01 | 2 | 0 | 0 | 0 | 0.0 |
df_min_2015.loc[:,'HtoAB'] = 0
df_min_2015.head()
playerID | G | AB | H | HR | RBI | HtoAB | |
---|---|---|---|---|---|---|---|
100696 | mauerjo01 | 158 | 592 | 157 | 10 | 66.0 | 0 |
100215 | doziebr01 | 157 | 628 | 148 | 28 | 77.0 | 0 |
100915 | plouftr01 | 152 | 573 | 140 | 22 | 86.0 | 0 |
100488 | hunteto01 | 139 | 521 | 125 | 22 | 81.0 | 0 |
101164 | suzukku01 | 131 | 433 | 104 | 5 | 50.0 | 0 |
df_min_2015 = df_min_2015.drop('HtoAB', axis=1)
df_min_2015.head()
playerID | G | AB | H | HR | RBI | |
---|---|---|---|---|---|---|
100696 | mauerjo01 | 158 | 592 | 157 | 10 | 66.0 |
100215 | doziebr01 | 157 | 628 | 148 | 28 | 77.0 |
100915 | plouftr01 | 152 | 573 | 140 | 22 | 86.0 |
100488 | hunteto01 | 139 | 521 | 125 | 22 | 81.0 |
101164 | suzukku01 | 131 | 433 | 104 | 5 | 50.0 |
df_min_2015.H.head(10)
100696 157 100215 148 100915 140 100488 125 101164 104 100249 107 101023 121 100459 90 101069 56 100994 45 Name: H, dtype: int64
df_min_2015.loc[:,'HtoAB'] = 0
df_min_2015.loc[:,'HtoAB'] = [v.H/v.AB
if v.AB > 0 else 0
for r, v in df_min_2015.iterrows()]
df_min_2015.head(10)
playerID | G | AB | H | HR | RBI | HtoAB | |
---|---|---|---|---|---|---|---|
100696 | mauerjo01 | 158 | 592 | 157 | 10 | 66.0 | 0.265203 |
100215 | doziebr01 | 157 | 628 | 148 | 28 | 77.0 | 0.235669 |
100915 | plouftr01 | 152 | 573 | 140 | 22 | 86.0 | 0.244328 |
100488 | hunteto01 | 139 | 521 | 125 | 22 | 81.0 | 0.239923 |
101164 | suzukku01 | 131 | 433 | 104 | 5 | 50.0 | 0.240185 |
100249 | escobed01 | 127 | 409 | 107 | 12 | 58.0 | 0.261614 |
101023 | rosared01 | 122 | 453 | 121 | 13 | 50.0 | 0.267108 |
100459 | hicksaa01 | 97 | 352 | 90 | 11 | 33.0 | 0.255682 |
101069 | santada01 | 91 | 261 | 56 | 0 | 21.0 | 0.214559 |
100994 | robinsh01 | 83 | 180 | 45 | 0 | 16.0 | 0.250000 |
df_min_2015[df_min_2015.G>80].sort_values('HtoAB', ascending=False)
playerID | G | AB | H | HR | RBI | HtoAB | |
---|---|---|---|---|---|---|---|
101023 | rosared01 | 122 | 453 | 121 | 13 | 50.0 | 0.267108 |
100696 | mauerjo01 | 158 | 592 | 157 | 10 | 66.0 | 0.265203 |
100249 | escobed01 | 127 | 409 | 107 | 12 | 58.0 | 0.261614 |
100459 | hicksaa01 | 97 | 352 | 90 | 11 | 33.0 | 0.255682 |
100994 | robinsh01 | 83 | 180 | 45 | 0 | 16.0 | 0.250000 |
100915 | plouftr01 | 152 | 573 | 140 | 22 | 86.0 | 0.244328 |
101164 | suzukku01 | 131 | 433 | 104 | 5 | 50.0 | 0.240185 |
100488 | hunteto01 | 139 | 521 | 125 | 22 | 81.0 | 0.239923 |
100215 | doziebr01 | 157 | 628 | 148 | 28 | 77.0 | 0.235669 |
101069 | santada01 | 91 | 261 | 56 | 0 | 21.0 | 0.214559 |
df_min_2015 = df_min_2015.reindex(columns=['playerID', 'HtoAB', 'AB', 'H', 'HR', 'RBI', 'G'])
df_min_2015.head()
playerID | HtoAB | AB | H | HR | RBI | G | |
---|---|---|---|---|---|---|---|
100696 | mauerjo01 | 0.265203 | 592 | 157 | 10 | 66.0 | 158 |
100215 | doziebr01 | 0.235669 | 628 | 148 | 28 | 77.0 | 157 |
100915 | plouftr01 | 0.244328 | 573 | 140 | 22 | 86.0 | 152 |
100488 | hunteto01 | 0.239923 | 521 | 125 | 22 | 81.0 | 139 |
101164 | suzukku01 | 0.240185 | 433 | 104 | 5 | 50.0 | 131 |
Finally, we can return our DataFrame back to its original columns (and order) by reindexing again. Notice, also that we can effectively perform a drop()
by doing this, though the syntax with reindex()
is more verbose.
df_min_2015 = df_min_2015.reindex(columns=['playerID', 'G', 'AB', 'H', 'HR', 'RBI'])
df_min_2015.head()
playerID | G | AB | H | HR | RBI | |
---|---|---|---|---|---|---|
100696 | mauerjo01 | 158 | 592 | 157 | 10 | 66.0 |
100215 | doziebr01 | 157 | 628 | 148 | 28 | 77.0 |
100915 | plouftr01 | 152 | 573 | 140 | 22 | 86.0 |
100488 | hunteto01 | 139 | 521 | 125 | 22 | 81.0 |
101164 | suzukku01 | 131 | 433 | 104 | 5 | 50.0 |
Adding rows can be achieved using loc[]
and setting the new index to a dictionary of values using the column labels as keys.
df_min_2015.loc[200000] = \
{ 'playerID': 'keith01',
'RBI': '0',
'G': '0',
'H': '0',
'HR': '0',
'AB': '0' }
df_min_2015.tail()
playerID | G | AB | H | HR | RBI | |
---|---|---|---|---|---|---|
100917 | polanjo01 | 4 | 10 | 3 | 0 | 1 |
99954 | bernido01 | 4 | 5 | 1 | 0 | 2 |
100564 | keplema01 | 3 | 7 | 1 | 0 | 0 |
100729 | meyeral01 | 2 | 0 | 0 | 0 | 0 |
200000 | keith01 | 0 | 0 | 0 | 0 | 0 |
It is also the same with lists and tuples.
df_min_2015.loc[200000] = ('keith01', 1, 1, 1, 1, 1)
df_min_2015.loc[200001] = ['keith02', 1, 1, 1, 1, 1]
df_min_2015.tail()
playerID | G | AB | H | HR | RBI | |
---|---|---|---|---|---|---|
99954 | bernido01 | 4 | 5 | 1 | 0 | 2 |
100564 | keplema01 | 3 | 7 | 1 | 0 | 0 |
100729 | meyeral01 | 2 | 0 | 0 | 0 | 0 |
200000 | keith01 | 1 | 1 | 1 | 1 | 1 |
200001 | keith02 | 1 | 1 | 1 | 1 | 1 |
Note that we can drop a number of rows at a time by passing a list of the indices we'd like dropped.
df_min_2015 = df_min_2015.drop([200000, 200001], axis=0)
df_min_2015.tail()
playerID | G | AB | H | HR | RBI | |
---|---|---|---|---|---|---|
101189 | thielca01 | 6 | 0 | 0 | 0 | 0 |
100917 | polanjo01 | 4 | 10 | 3 | 0 | 1 |
99954 | bernido01 | 4 | 5 | 1 | 0 | 2 |
100564 | keplema01 | 3 | 7 | 1 | 0 | 0 |
100729 | meyeral01 | 2 | 0 | 0 | 0 | 0 |
Similar results can be achieved using append()
. With append, you can append, Series, DataFrames and/or a list of these.
df_min_2015.append(
pd.Series(
{'playerID': 'keith01',
'G': 0,
'AB': 0,
'H':0,
'HR': 0,
'RBI': 0}, name='200000')).tail()
playerID | G | AB | H | HR | RBI | |
---|---|---|---|---|---|---|
100917 | polanjo01 | 4 | 10 | 3 | 0 | 1 |
99954 | bernido01 | 4 | 5 | 1 | 0 | 2 |
100564 | keplema01 | 3 | 7 | 1 | 0 | 0 |
100729 | meyeral01 | 2 | 0 | 0 | 0 | 0 |
200000 | keith01 | 0 | 0 | 0 | 0 | 0 |
df_min_2015[:5].append(df_min_2015[-5:])
playerID | G | AB | H | HR | RBI | |
---|---|---|---|---|---|---|
100696 | mauerjo01 | 158 | 592 | 157 | 10 | 66 |
100215 | doziebr01 | 157 | 628 | 148 | 28 | 77 |
100915 | plouftr01 | 152 | 573 | 140 | 22 | 86 |
100488 | hunteto01 | 139 | 521 | 125 | 22 | 81 |
101164 | suzukku01 | 131 | 433 | 104 | 5 | 50 |
101189 | thielca01 | 6 | 0 | 0 | 0 | 0 |
100917 | polanjo01 | 4 | 10 | 3 | 0 | 1 |
99954 | bernido01 | 4 | 5 | 1 | 0 | 2 |
100564 | keplema01 | 3 | 7 | 1 | 0 | 0 |
100729 | meyeral01 | 2 | 0 | 0 | 0 | 0 |
df_min_2015[:5].append([df_min_2015[10:12], df_min_2015[-5:]])
playerID | G | AB | H | HR | RBI | |
---|---|---|---|---|---|---|
100696 | mauerjo01 | 158 | 592 | 157 | 10 | 66 |
100215 | doziebr01 | 157 | 628 | 148 | 28 | 77 |
100915 | plouftr01 | 152 | 573 | 140 | 22 | 86 |
100488 | hunteto01 | 139 | 521 | 125 | 22 | 81 |
101164 | suzukku01 | 131 | 433 | 104 | 5 | 50 |
101067 | sanomi01 | 80 | 279 | 75 | 18 | 52 |
100816 | nunezed02 | 72 | 188 | 53 | 4 | 20 |
101189 | thielca01 | 6 | 0 | 0 | 0 | 0 |
100917 | polanjo01 | 4 | 10 | 3 | 0 | 1 |
99954 | bernido01 | 4 | 5 | 1 | 0 | 2 |
100564 | keplema01 | 3 | 7 | 1 | 0 | 0 |
100729 | meyeral01 | 2 | 0 | 0 | 0 | 0 |
The same result can be achieved with pd.concat()
, where the defaut axis
is 0
.
pd.concat([df_min_2015[:5],
df_min_2015[-5:]], axis=0)
playerID | G | AB | H | HR | RBI | |
---|---|---|---|---|---|---|
100696 | mauerjo01 | 158 | 592 | 157 | 10 | 66 |
100215 | doziebr01 | 157 | 628 | 148 | 28 | 77 |
100915 | plouftr01 | 152 | 573 | 140 | 22 | 86 |
100488 | hunteto01 | 139 | 521 | 125 | 22 | 81 |
101164 | suzukku01 | 131 | 433 | 104 | 5 | 50 |
101189 | thielca01 | 6 | 0 | 0 | 0 | 0 |
100917 | polanjo01 | 4 | 10 | 3 | 0 | 1 |
99954 | bernido01 | 4 | 5 | 1 | 0 | 2 |
100564 | keplema01 | 3 | 7 | 1 | 0 | 0 |
100729 | meyeral01 | 2 | 0 | 0 | 0 | 0 |
But we can use concat()
to make a column-wise concatenation using axis=1
(columns).
pd.concat([df_min_2015[:5],
df_min_2015[-5:]], axis=1)
playerID | G | AB | H | HR | RBI | playerID | G | AB | H | HR | RBI | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
99954 | NaN | NaN | NaN | NaN | NaN | NaN | bernido01 | 4 | 5 | 1 | 0 | 2 |
100215 | doziebr01 | 157 | 628 | 148 | 28 | 77 | NaN | NaN | NaN | NaN | NaN | NaN |
100488 | hunteto01 | 139 | 521 | 125 | 22 | 81 | NaN | NaN | NaN | NaN | NaN | NaN |
100564 | NaN | NaN | NaN | NaN | NaN | NaN | keplema01 | 3 | 7 | 1 | 0 | 0 |
100696 | mauerjo01 | 158 | 592 | 157 | 10 | 66 | NaN | NaN | NaN | NaN | NaN | NaN |
100729 | NaN | NaN | NaN | NaN | NaN | NaN | meyeral01 | 2 | 0 | 0 | 0 | 0 |
100915 | plouftr01 | 152 | 573 | 140 | 22 | 86 | NaN | NaN | NaN | NaN | NaN | NaN |
100917 | NaN | NaN | NaN | NaN | NaN | NaN | polanjo01 | 4 | 10 | 3 | 0 | 1 |
101164 | suzukku01 | 131 | 433 | 104 | 5 | 50 | NaN | NaN | NaN | NaN | NaN | NaN |
101189 | NaN | NaN | NaN | NaN | NaN | NaN | thielca01 | 6 | 0 | 0 | 0 | 0 |
We can see that the indices are being considered in the concatenation and row indices are being joined. This behavior can be controlled via the join
parameter, which we'll leave for the reader to explore.
One last thing we might want to do in an operation like this is to reset the index. To do so, we might start with ignoring the column index using the ignore_index=True
so we can set it later to something more appropriate after the concatenation.
pd.concat([df_min_2015[:5],
df_min_2015[-5:]], axis=1, ignore_index=True)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
99954 | NaN | NaN | NaN | NaN | NaN | NaN | bernido01 | 4 | 5 | 1 | 0 | 2 |
100215 | doziebr01 | 157 | 628 | 148 | 28 | 77 | NaN | NaN | NaN | NaN | NaN | NaN |
100488 | hunteto01 | 139 | 521 | 125 | 22 | 81 | NaN | NaN | NaN | NaN | NaN | NaN |
100564 | NaN | NaN | NaN | NaN | NaN | NaN | keplema01 | 3 | 7 | 1 | 0 | 0 |
100696 | mauerjo01 | 158 | 592 | 157 | 10 | 66 | NaN | NaN | NaN | NaN | NaN | NaN |
100729 | NaN | NaN | NaN | NaN | NaN | NaN | meyeral01 | 2 | 0 | 0 | 0 | 0 |
100915 | plouftr01 | 152 | 573 | 140 | 22 | 86 | NaN | NaN | NaN | NaN | NaN | NaN |
100917 | NaN | NaN | NaN | NaN | NaN | NaN | polanjo01 | 4 | 10 | 3 | 0 | 1 |
101164 | suzukku01 | 131 | 433 | 104 | 5 | 50 | NaN | NaN | NaN | NaN | NaN | NaN |
101189 | NaN | NaN | NaN | NaN | NaN | NaN | thielca01 | 6 | 0 | 0 | 0 | 0 |
Pandas provides the ability to build more complex indices allowing for highly flexible and natural data access.
We will cover the basics of through the MultiIndex
object and will the the remaining exploration to the reader.
Let's get the players on the Washington Nationals who played 100 or more games in 2015 and 2016.
df_was = df[(df.yearID > 2014) & (df.teamID=='WAS') & (df.G > 99)]
df_was.head()
playerID | yearID | stint | teamID | lgID | G | AB | R | H | 2B | ... | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
100193 | desmoia01 | 2015 | 1 | WAS | NL | 156 | 583 | 69 | 136 | 27 | ... | 62.0 | 13.0 | 5.0 | 45 | 187.0 | 0.0 | 3.0 | 6.0 | 4.0 | 9.0 |
100250 | escobyu01 | 2015 | 1 | WAS | NL | 139 | 535 | 75 | 168 | 25 | ... | 56.0 | 2.0 | 2.0 | 45 | 70.0 | 0.0 | 8.0 | 1.0 | 2.0 | 24.0 |
100251 | espinda01 | 2015 | 1 | WAS | NL | 118 | 367 | 59 | 88 | 21 | ... | 37.0 | 5.0 | 2.0 | 33 | 106.0 | 5.0 | 6.0 | 3.0 | 3.0 | 6.0 |
100422 | harpebr03 | 2015 | 1 | WAS | NL | 153 | 521 | 118 | 172 | 38 | ... | 99.0 | 6.0 | 4.0 | 124 | 131.0 | 15.0 | 5.0 | 0.0 | 4.0 | 15.0 |
100950 | ramoswi01 | 2015 | 1 | WAS | NL | 128 | 475 | 41 | 109 | 16 | ... | 68.0 | 0.0 | 0.0 | 21 | 101.0 | 2.0 | 0.0 | 0.0 | 8.0 | 16.0 |
5 rows × 22 columns
One obvious problem if we were to access the data here by player and year, we have to build a much more involved query and even more so if we needed to ignore data.
We are going to create a hierarchical index or MultiIndex to solve this problem. We'll take take liberty to drop columns we don't need (teamID
, ldID
, stint
) and reorganize the index hierarchically.
We will use MultiIndex
using a tuple of the data we need and provide the index first by player, then by year. To do this we'll just grab all the player IDs and zip
them with the year. This will look something like this:
tuple(
zip(
df_was[['playerID','yearID']].sort_values(by='playerID')['playerID'],
df_was[['playerID','yearID']].sort_values(by='playerID')['yearID']
)
)
(('desmoia01', 2015), ('escobyu01', 2015), ('espinda01', 2015), ('espinda01', 2016), ('harpebr03', 2015), ('harpebr03', 2016), ('murphda08', 2016), ('ramoswi01', 2015), ('ramoswi01', 2016), ('rendoan01', 2016), ('reverbe01', 2016), ('robincl01', 2015), ('robincl01', 2016), ('taylomi02', 2015), ('werthja01', 2016), ('zimmery01', 2016))
# create an index to be used over the data we're interested in
idx = \
pd.MultiIndex.from_tuples(
tuple(
zip(
df_was[['playerID','yearID']].sort_values(by='playerID')['playerID'],
df_was[['playerID','yearID']].sort_values(by='playerID')['yearID']))
)
idx
MultiIndex(levels=[['desmoia01', 'escobyu01', 'espinda01', 'harpebr03', 'murphda08', 'ramoswi01', 'rendoan01', 'reverbe01', 'robincl01', 'taylomi02', 'werthja01', 'zimmery01'], [2015, 2016]], labels=[[0, 1, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8, 8, 9, 10, 11], [0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1]])
Notice now that we have two levels in our row axis (axis 0) and we will now use that index to build the hierachically indexed DataFrame.
# sorting the indices is critical for lining up the data in the tuples
df_was = df_was.sort_values(by=['playerID']).\
set_index(idx).\
drop(['playerID', 'yearID', 'teamID', 'lgID', 'stint'], axis=1)
df_was
G | AB | R | H | 2B | 3B | HR | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
desmoia01 | 2015 | 156 | 583 | 69 | 136 | 27 | 2 | 19 | 62.0 | 13.0 | 5.0 | 45 | 187.0 | 0.0 | 3.0 | 6.0 | 4.0 | 9.0 |
escobyu01 | 2015 | 139 | 535 | 75 | 168 | 25 | 1 | 9 | 56.0 | 2.0 | 2.0 | 45 | 70.0 | 0.0 | 8.0 | 1.0 | 2.0 | 24.0 |
espinda01 | 2015 | 118 | 367 | 59 | 88 | 21 | 1 | 13 | 37.0 | 5.0 | 2.0 | 33 | 106.0 | 5.0 | 6.0 | 3.0 | 3.0 | 6.0 |
2016 | 157 | 516 | 66 | 108 | 15 | 0 | 24 | 72.0 | 9.0 | 2.0 | 54 | 174.0 | 12.0 | 20.0 | 7.0 | 4.0 | 4.0 | |
harpebr03 | 2015 | 153 | 521 | 118 | 172 | 38 | 1 | 42 | 99.0 | 6.0 | 4.0 | 124 | 131.0 | 15.0 | 5.0 | 0.0 | 4.0 | 15.0 |
2016 | 147 | 506 | 84 | 123 | 24 | 2 | 24 | 86.0 | 21.0 | 10.0 | 108 | 117.0 | 20.0 | 3.0 | 0.0 | 10.0 | 11.0 | |
murphda08 | 2016 | 142 | 531 | 88 | 184 | 47 | 5 | 25 | 104.0 | 5.0 | 3.0 | 35 | 57.0 | 10.0 | 8.0 | 0.0 | 8.0 | 4.0 |
ramoswi01 | 2015 | 128 | 475 | 41 | 109 | 16 | 0 | 15 | 68.0 | 0.0 | 0.0 | 21 | 101.0 | 2.0 | 0.0 | 0.0 | 8.0 | 16.0 |
2016 | 131 | 482 | 58 | 148 | 25 | 0 | 22 | 80.0 | 0.0 | 0.0 | 35 | 79.0 | 2.0 | 2.0 | 0.0 | 4.0 | 17.0 | |
rendoan01 | 2016 | 156 | 567 | 91 | 153 | 38 | 2 | 20 | 85.0 | 12.0 | 6.0 | 65 | 117.0 | 2.0 | 7.0 | 0.0 | 8.0 | 5.0 |
reverbe01 | 2016 | 103 | 350 | 44 | 76 | 9 | 7 | 2 | 24.0 | 14.0 | 5.0 | 18 | 34.0 | 0.0 | 3.0 | 2.0 | 2.0 | 12.0 |
robincl01 | 2015 | 126 | 309 | 44 | 84 | 15 | 1 | 10 | 34.0 | 0.0 | 0.0 | 37 | 52.0 | 4.0 | 5.0 | 0.0 | 1.0 | 6.0 |
2016 | 104 | 196 | 16 | 46 | 4 | 0 | 5 | 26.0 | 0.0 | 0.0 | 20 | 38.0 | 0.0 | 2.0 | 1.0 | 5.0 | 4.0 | |
taylomi02 | 2015 | 138 | 472 | 49 | 108 | 15 | 2 | 14 | 63.0 | 16.0 | 3.0 | 35 | 158.0 | 9.0 | 1.0 | 1.0 | 2.0 | 5.0 |
werthja01 | 2016 | 143 | 525 | 84 | 128 | 28 | 0 | 21 | 69.0 | 5.0 | 1.0 | 71 | 139.0 | 0.0 | 4.0 | 0.0 | 6.0 | 17.0 |
zimmery01 | 2016 | 115 | 427 | 60 | 93 | 18 | 1 | 15 | 46.0 | 4.0 | 1.0 | 29 | 104.0 | 1.0 | 5.0 | 0.0 | 6.0 | 12.0 |
df_was.loc[('robincl01', ),['G', 'AB', 'H', 'SO']]
G | AB | H | SO | |
---|---|---|---|---|
2015 | 126 | 309 | 84 | 52.0 |
2016 | 104 | 196 | 46 | 38.0 |
df_was.loc[('robincl01', 2016),['G', 'AB', 'H', 'SO']]
G 104.0 AB 196.0 H 46.0 SO 38.0 Name: (robincl01, 2016), dtype: float64
df_mi.head()
playerID | yearID | stint | teamID | lgID | G | AB | R | H | 2B | ... | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2007 | AL | BAL | baezda01 | bardebr01 | 2007 | 1 | ARI | NL | 8 | 12 | 0 | 1 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
bakopa01 | bonifem01 | 2007 | 1 | ARI | NL | 11 | 23 | 2 | 5 | 1 | ... | 2.0 | 0.0 | 1.0 | 4 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |||
bedarer01 | byrneer01 | 2007 | 1 | ARI | NL | 160 | 626 | 103 | 179 | 30 | ... | 83.0 | 50.0 | 7.0 | 57 | 98.0 | 5.0 | 10.0 | 1.0 | 4.0 | 12.0 | |||
bellro01 | callaal01 | 2007 | 1 | ARI | NL | 56 | 144 | 10 | 31 | 8 | ... | 7.0 | 1.0 | 1.0 | 9 | 14.0 | 0.0 | 1.0 | 1.0 | 1.0 | 8.0 | |||
birkiku01 | choatra01 | 2007 | 1 | ARI | NL | 2 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 22 columns
... on to Part IV: Wrapping Up.