index : the column and row indices of your Series or DataFrame, the index for each of these may be hiearchical
- row index : the index along the horizontal dimension, and typically used as the primary index
- column index : the index along the vertical dimension

axis : the numeric designation for the column or row indices; typically 0 is the column-axis and 1 is the row-axis. When dealing with multi-indices, the hierarchy within the axis are referred to as levels and accessed similarly

NOTEBOOK OBJECTIVES

In this notebook we'll:

explore more complex slicing and selecting,
look at DataFrame concatenation and appending,
explore Multi-Indices / hierarchical indexing in Pandas.

More Selecting¶

In the example for this section, we're going to go back to our Baseball data set and load the batting statistics into a DataFrame.

In [1]:

import pandas as pd

# get the data for players in 2015-16 who played in 100 or more games
df = pd.read_csv("./datasets/Batting.csv")

The convenient `[]` operator (again)¶

As before basic slice selections can be made with the syntax similar to that found in lists using the convenience of the [] operator. For example, obtaining the first 5 rows of our data, or the last 15.

In [2]:

df[:5]

Out[2]:

	playerID	yearID	stint	teamID	lgID	G	AB	R	H	2B	...	RBI	SB	CS	BB	SO	IBB	HBP	SH	SF	GIDP
0	abercda01	1871	1	TRO	NaN	1	4	0	0	0	...	0.0	0.0	0.0	0	0.0	NaN	NaN	NaN	NaN	NaN
1	addybo01	1871	1	RC1	NaN	25	118	30	32	6	...	13.0	8.0	1.0	4	0.0	NaN	NaN	NaN	NaN	NaN
2	allisar01	1871	1	CL1	NaN	29	137	28	40	4	...	19.0	3.0	1.0	2	5.0	NaN	NaN	NaN	NaN	NaN
3	allisdo01	1871	1	WS3	NaN	27	133	28	44	10	...	27.0	1.0	1.0	0	2.0	NaN	NaN	NaN	NaN	NaN
4	ansonca01	1871	1	RC1	NaN	25	120	29	39	11	...	16.0	6.0	2.0	2	1.0	NaN	NaN	NaN	NaN	NaN

5 rows × 22 columns

In [3]:

df[-15:]

Out[3]:

	playerID	yearID	stint	teamID	lgID	G	AB	R	H	2B	...	RBI	SB	CS	BB	SO	IBB	HBP	SH	SF	GIDP
102801	ynoaga01	2016	1	NYN	NL	10	3	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
102802	ynoami01	2016	1	CHA	AL	23	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
102803	ynoara01	2016	1	COL	NL	3	5	0	0	0	...	0.0	0.0	0.0	0	2.0	0.0	0.0	0.0	0.0	0.0
102804	youngch03	2016	1	KCA	AL	34	1	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
102805	youngch04	2016	1	BOS	AL	76	203	29	56	18	...	24.0	4.0	2.0	21	50.0	0.0	3.0	0.0	0.0	4.0
102806	younger03	2016	1	NYA	AL	6	1	2	0	0	...	0.0	1.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
102807	youngma03	2016	1	ATL	NL	8	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
102808	zastrro01	2016	1	CHN	NL	8	3	0	0	0	...	0.0	0.0	0.0	0	2.0	0.0	0.0	0.0	0.0	0.0
102809	zieglbr01	2016	1	ARI	NL	36	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
102810	zieglbr01	2016	2	BOS	AL	33	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
102811	zimmejo02	2016	1	DET	AL	19	4	0	1	0	...	0.0	0.0	0.0	0	2.0	0.0	0.0	1.0	0.0	0.0
102812	zimmery01	2016	1	WAS	NL	115	427	60	93	18	...	46.0	4.0	1.0	29	104.0	1.0	5.0	0.0	6.0	12.0
102813	zobribe01	2016	1	CHN	NL	147	523	94	142	31	...	76.0	6.0	4.0	96	82.0	6.0	4.0	4.0	4.0	17.0
102814	zuninmi01	2016	1	SEA	AL	55	164	16	34	7	...	31.0	0.0	0.0	21	65.0	0.0	6.0	0.0	1.0	0.0
102815	zychto01	2016	1	SEA	AL	12	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0

15 rows × 22 columns

We mostly worked on row slicing with the [] selector, but if we pass a column label or list of the columns we'd like, say the RBI and G (games played) data, we get mostly what we'd expect:

In [ ]:

df["RBI"][:5]

In [ ]:

df[["RBI", "G"]][:10]

Selecting data by `.` selector on column and index name¶

We can obtain column data by column labels (note that the column index was loaded for us when we read the file into the DataFrame). For example to get all the RBI data:

In [4]:

df.RBI[:10]

Out[4]:

0     0.0
1    13.0
2    19.0
3    27.0
4    16.0
5     5.0
6     2.0
7    34.0
8     1.0
9    11.0
Name: RBI, dtype: float64

Similarly, we can pass a list of the columns we'd like, so let's get the RBI and G (games played) data:

In [5]:

df[["RBI", "G"]][:10]

Out[5]:

	RBI	G
0	0.0	1
1	13.0	25
2	19.0	29
3	27.0	27
4	16.0	25
5	5.0	12
6	2.0	1
7	34.0	31
8	1.0	1
9	11.0	18

Boolean selecting¶

We have yet to make more complex selections beyond index values. Now we're ready to introduce selecting by boolean value. With this kinds of selection, we're going to as Pandas to give us the Series or DataFrame that represents the boolean values of what we want, then we will allow iloc to reduce the resulting Series or DataFrame to what we're looking for. Let's see this in action.

Say we want to find all items in our DataFrame where yearID is 2015 or

df.yearID == 2015

Let's first see what this does.

In [6]:

df.yearID == 2015

Out[6]:

0         False
1         False
2         False
3         False
4         False
5         False
6         False
7         False
8         False
9         False
10        False
11        False
12        False
13        False
14        False
15        False
16        False
17        False
18        False
19        False
20        False
21        False
22        False
23        False
24        False
25        False
26        False
27        False
28        False
29        False
          ...  
102786    False
102787    False
102788    False
102789    False
102790    False
102791    False
102792    False
102793    False
102794    False
102795    False
102796    False
102797    False
102798    False
102799    False
102800    False
102801    False
102802    False
102803    False
102804    False
102805    False
102806    False
102807    False
102808    False
102809    False
102810    False
102811    False
102812    False
102813    False
102814    False
102815    False
Name: yearID, Length: 102816, dtype: bool

We're returned the Series that contains a True or False given our boolean query. We need now pass this boolean Series into loc and we will see the outcome.

In [7]:

df.loc[df.yearID == 2015][:10] # note we're restricting the return to just the first 10 values

Out[7]:

	playerID	yearID	stint	teamID	lgID	G	AB	R	H	2B	...	RBI	SB	CS	BB	SO	IBB	HBP	SH	SF	GIDP
99847	aardsda01	2015	1	ATL	NL	33	1	0	0	0	...	0.0	0.0	0.0	0	1.0	0.0	0.0	0.0	0.0	0.0
99848	abadfe01	2015	1	OAK	AL	62	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
99849	abreujo02	2015	1	CHA	AL	154	613	88	178	34	...	101.0	0.0	0.0	39	140.0	11.0	15.0	0.0	1.0	16.0
99850	achteaj01	2015	1	MIN	AL	11	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
99851	ackledu01	2015	1	SEA	AL	85	186	22	40	8	...	19.0	2.0	2.0	14	38.0	0.0	1.0	3.0	3.0	3.0
99852	ackledu01	2015	2	NYA	AL	23	52	6	15	3	...	11.0	0.0	0.0	4	7.0	0.0	0.0	0.0	1.0	0.0
99853	adamecr01	2015	1	COL	NL	26	53	4	13	1	...	3.0	0.0	1.0	3	11.0	1.0	1.0	1.0	0.0	0.0
99854	adamsau01	2015	1	CLE	AL	28	1	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	1.0
99855	adamsma01	2015	1	SLN	NL	60	175	14	42	9	...	24.0	1.0	0.0	10	41.0	1.0	0.0	0.0	1.0	1.0
99856	adcocna01	2015	1	CIN	NL	13	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0

10 rows × 22 columns

Now what if we wanted the restrict this further by team. Say we wanted to see only the Minesota Twins player data for 2015. That is

df.yearID == 2015
AND
df.teamID == "MIN"

We simply put these in parethesis and use the & operator.

In [8]:

df.loc[(df.yearID == 2015) & (df.teamID == "MIN")].head(10)

Out[8]:

	playerID	yearID	stint	teamID	lgID	G	AB	R	H	2B	...	RBI	SB	CS	BB	SO	IBB	HBP	SH	SF	GIDP
99850	achteaj01	2015	1	MIN	AL	11	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
99891	arciaos01	2015	1	MIN	AL	19	58	6	16	0	...	8.0	0.0	0.0	4	15.0	4.0	2.0	0.0	1.0	2.0
99954	bernido01	2015	1	MIN	AL	4	5	1	1	1	...	2.0	0.0	0.0	1	3.0	0.0	0.0	0.0	0.0	0.0
99988	boyerbl01	2015	1	MIN	AL	68	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
100030	buxtoby01	2015	1	MIN	AL	46	129	16	27	7	...	6.0	2.0	2.0	6	44.0	0.0	1.0	2.0	0.0	1.0
100139	cottsne01	2015	2	MIN	AL	17	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
100215	doziebr01	2015	1	MIN	AL	157	628	101	148	39	...	77.0	12.0	4.0	61	148.0	2.0	7.0	0.0	8.0	10.0
100221	duensbr01	2015	1	MIN	AL	55	1	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
100222	duffety01	2015	1	MIN	AL	10	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
100249	escobed01	2015	1	MIN	AL	127	409	48	107	31	...	58.0	2.0	3.0	28	86.0	1.0	2.0	2.0	5.0	7.0

10 rows × 22 columns

Now what if we wanted to restrict a subset of columns. This is easy with iloc[] ... we will just use our boolean expression as above for the row selection and then the list of columns for our column selection (in this case a much smaller subset of data).

In [9]:

df.loc[(df.yearID == 2015) & (df.teamID == "MIN"),\
       ['playerID', 'G', 'AB', 'H', 'HR', 'RBI']]

Out[9]:

	playerID	G	AB	H	HR	RBI
99850	achteaj01	11	0	0	0	0.0
99891	arciaos01	19	58	16	2	8.0
99954	bernido01	4	5	1	0	2.0
99988	boyerbl01	68	0	0	0	0.0
100030	buxtoby01	46	129	27	2	6.0
100139	cottsne01	17	0	0	0	0.0
100215	doziebr01	157	628	148	28	77.0
100221	duensbr01	55	1	0	0	0.0
100222	duffety01	10	0	0	0	0.0
100249	escobed01	127	409	107	12	58.0
100270	fienca01	62	0	0	0	0.0
100302	fryerer01	15	22	5	0	2.0
100333	gibsoky01	32	5	1	0	0.0
100373	grahajr01	39	0	0	0	0.0
100455	herrmch01	45	103	15	2	10.0
100459	hicksaa01	97	352	90	11	33.0
100486	hugheph01	27	3	0	0	0.0
100488	hunteto01	139	521	125	22	81.0
100521	jepseke01	29	0	0	0	0.0
100564	keplema01	3	7	1	0	0.0
100696	mauerjo01	158	592	157	10	66.0
100701	maytr01	48	3	0	0	0.0
100729	meyeral01	2	0	0	0	0.0
100737	milonto01	24	2	0	0	0.0
100807	nolasri01	9	3	0	0	0.0
100816	nunezed02	72	188	53	4	20.0
100837	orourry01	28	0	0	0	0.0
100872	pelfrmi01	30	3	2	0	0.0
100895	perkigl01	60	0	0	0	0.0
100915	plouftr01	152	573	140	22	86.0
100917	polanjo01	4	10	3	0	1.0
100925	pressry01	27	0	0	0	0.0
100994	robinsh01	83	180	45	0	16.0
101023	rosared01	122	453	121	13	50.0
101067	sanomi01	80	279	75	18	52.0
101069	santada01	91	261	56	0	21.0
101072	santaer01	17	0	0	0	0.0
101079	schafjo02	27	69	15	0	5.0
101144	staufti01	13	0	0	0	0.0
101164	suzukku01	131	433	104	5	50.0
101189	thielca01	6	0	0	0	0.0
101193	thompaa01	41	0	0	0	0.0
101203	tonkimi01	26	0	0	0	0.0
101240	vargake01	58	175	42	5	17.0

Sorting¶

Sorting is facilitated by the sort_values() method. By default, sorting is done in ascending order, specify the parameter ascending=False to get descending order.

In [10]:

df_min_2015 = df.loc[(df.yearID == 2015) & (df.teamID == "MIN"),\
                     ['playerID', 'G', 'AB', 'H', 'HR', 'RBI']]\
            .sort_values('G', ascending=False)
df_min_2015.head(20)

Out[10]:

	playerID	G	AB	H	HR	RBI
100696	mauerjo01	158	592	157	10	66.0
100215	doziebr01	157	628	148	28	77.0
100915	plouftr01	152	573	140	22	86.0
100488	hunteto01	139	521	125	22	81.0
101164	suzukku01	131	433	104	5	50.0
100249	escobed01	127	409	107	12	58.0
101023	rosared01	122	453	121	13	50.0
100459	hicksaa01	97	352	90	11	33.0
101069	santada01	91	261	56	0	21.0
100994	robinsh01	83	180	45	0	16.0
101067	sanomi01	80	279	75	18	52.0
100816	nunezed02	72	188	53	4	20.0
99988	boyerbl01	68	0	0	0	0.0
100270	fienca01	62	0	0	0	0.0
100895	perkigl01	60	0	0	0	0.0
101240	vargake01	58	175	42	5	17.0
100221	duensbr01	55	1	0	0	0.0
100701	maytr01	48	3	0	0	0.0
100030	buxtoby01	46	129	27	2	6.0
100455	herrmch01	45	103	15	2	10.0

We may also do a multi-sort by passing in the list of columns we want sorted. This will sort in the order of the columns provided. For example,

In [11]:

df.loc[(df.yearID == 2015) & (df.teamID == "MIN"),\
        ['playerID', 'G', 'AB', 'H', 'HR', 'RBI']]\
        .sort_values(['G', 'HR'], ascending=False).tail()

Out[11]:

	playerID	G	AB	H	RBI
101189	thielca01	6	0	0	0.0
99954	bernido01	4	5	1	2.0
100917	polanjo01	4	10	3	1.0
100564	keplema01	3	7	1	0.0
100729	meyeral01	2	0	0	0.0

DataFrame manipulation¶

Adding and dropping columns¶

In [12]:

df_min_2015.loc[:,'HtoAB'] = 0
df_min_2015.head()

Out[12]:

	playerID	G	AB	H	HR	RBI
100696	mauerjo01	158	592	157	10	66.0
100215	doziebr01	157	628	148	28	77.0
100915	plouftr01	152	573	140	22	86.0
100488	hunteto01	139	521	125	22	81.0
101164	suzukku01	131	433	104	5	50.0

In [13]:

df_min_2015 = df_min_2015.drop('HtoAB', axis=1)
df_min_2015.head()

Out[13]:

	playerID	G	AB	H	HR	RBI
100696	mauerjo01	158	592	157	10	66.0
100215	doziebr01	157	628	148	28	77.0
100915	plouftr01	152	573	140	22	86.0
100488	hunteto01	139	521	125	22	81.0
101164	suzukku01	131	433	104	5	50.0

In [14]:

df_min_2015.H.head(10)

Out[14]:

100696    157
100215    148
100915    140
100488    125
101164    104
100249    107
101023    121
100459     90
101069     56
100994     45
Name: H, dtype: int64

In [15]:

df_min_2015.loc[:,'HtoAB'] = 0
df_min_2015.loc[:,'HtoAB'] = [v.H/v.AB 
                              if v.AB > 0 else 0 
                              for r, v in df_min_2015.iterrows()]

In [16]:

df_min_2015.head(10)

Out[16]:

	playerID	G	AB	H	HR	RBI	HtoAB
100696	mauerjo01	158	592	157	10	66.0	0.265203
100215	doziebr01	157	628	148	28	77.0	0.235669
100915	plouftr01	152	573	140	22	86.0	0.244328
100488	hunteto01	139	521	125	22	81.0	0.239923
101164	suzukku01	131	433	104	5	50.0	0.240185
100249	escobed01	127	409	107	12	58.0	0.261614
101023	rosared01	122	453	121	13	50.0	0.267108
100459	hicksaa01	97	352	90	11	33.0	0.255682
101069	santada01	91	261	56	0	21.0	0.214559
100994	robinsh01	83	180	45	0	16.0	0.250000

In [17]:

df_min_2015[df_min_2015.G>80].sort_values('HtoAB', ascending=False)

Out[17]:

	playerID	G	AB	H	HR	RBI	HtoAB
101023	rosared01	122	453	121	13	50.0	0.267108
100696	mauerjo01	158	592	157	10	66.0	0.265203
100249	escobed01	127	409	107	12	58.0	0.261614
100459	hicksaa01	97	352	90	11	33.0	0.255682
100994	robinsh01	83	180	45	0	16.0	0.250000
100915	plouftr01	152	573	140	22	86.0	0.244328
101164	suzukku01	131	433	104	5	50.0	0.240185
100488	hunteto01	139	521	125	22	81.0	0.239923
100215	doziebr01	157	628	148	28	77.0	0.235669
101069	santada01	91	261	56	0	21.0	0.214559

In [18]:

df_min_2015 = df_min_2015.reindex(columns=['playerID', 'HtoAB',  'AB', 'H', 'HR', 'RBI', 'G'])
df_min_2015.head()

Out[18]:

	playerID	HtoAB	AB	H	HR	RBI	G
100696	mauerjo01	0.265203	592	157	10	66.0	158
100215	doziebr01	0.235669	628	148	28	77.0	157
100915	plouftr01	0.244328	573	140	22	86.0	152
100488	hunteto01	0.239923	521	125	22	81.0	139
101164	suzukku01	0.240185	433	104	5	50.0	131

Finally, we can return our DataFrame back to its original columns (and order) by reindexing again. Notice, also that we can effectively perform a drop() by doing this, though the syntax with reindex() is more verbose.

In [19]:

df_min_2015 = df_min_2015.reindex(columns=['playerID', 'G',  'AB', 'H', 'HR', 'RBI'])
df_min_2015.head()

Out[19]:

	playerID	G	AB	H	HR	RBI
100696	mauerjo01	158	592	157	10	66.0
100215	doziebr01	157	628	148	28	77.0
100915	plouftr01	152	573	140	22	86.0
100488	hunteto01	139	521	125	22	81.0
101164	suzukku01	131	433	104	5	50.0

Adding and dropping rows¶

Adding rows can be achieved using loc[] and setting the new index to a dictionary of values using the column labels as keys.

In [20]:

df_min_2015.loc[200000] = \
    {   'playerID': 'keith01',
        'RBI': '0',
        'G': '0',
        'H': '0',
        'HR': '0',
        'AB': '0' }
    
df_min_2015.tail()

Out[20]:

	playerID	G	AB	H	RBI
100917	polanjo01	4	10	3	1
99954	bernido01	4	5	1	2
100564	keplema01	3	7	1	0
100729	meyeral01	2	0	0	0
200000	keith01	0	0	0	0

It is also the same with lists and tuples.

In [21]:

df_min_2015.loc[200000] = ('keith01', 1, 1, 1, 1, 1)
df_min_2015.loc[200001] = ['keith02', 1, 1, 1, 1, 1]

df_min_2015.tail()

Out[21]:

	playerID	G	AB	H	HR	RBI
99954	bernido01	4	5	1	0	2
100564	keplema01	3	7	1	0	0
100729	meyeral01	2	0	0	0	0
200000	keith01	1	1	1	1	1
200001	keith02	1	1	1	1	1

Note that we can drop a number of rows at a time by passing a list of the indices we'd like dropped.

In [22]:

df_min_2015 = df_min_2015.drop([200000, 200001], axis=0)
df_min_2015.tail()

Out[22]:

	playerID	G	AB	H	RBI
101189	thielca01	6	0	0	0
100917	polanjo01	4	10	3	1
99954	bernido01	4	5	1	2
100564	keplema01	3	7	1	0
100729	meyeral01	2	0	0	0

Similar results can be achieved using append(). With append, you can append, Series, DataFrames and/or a list of these.

In [23]:

df_min_2015.append(
    pd.Series( 
     {'playerID': 'keith01', 
             'G': 0, 
             'AB': 0, 
             'H':0, 
             'HR': 0, 
             'RBI': 0}, name='200000')).tail()

Out[23]:

	playerID	G	AB	H	RBI
100917	polanjo01	4	10	3	1
99954	bernido01	4	5	1	2
100564	keplema01	3	7	1	0
100729	meyeral01	2	0	0	0
200000	keith01	0	0	0	0

In [24]:

df_min_2015[:5].append(df_min_2015[-5:])

Out[24]:

	playerID	G	AB	H	HR	RBI
100696	mauerjo01	158	592	157	10	66
100215	doziebr01	157	628	148	28	77
100915	plouftr01	152	573	140	22	86
100488	hunteto01	139	521	125	22	81
101164	suzukku01	131	433	104	5	50
101189	thielca01	6	0	0	0	0
100917	polanjo01	4	10	3	0	1
99954	bernido01	4	5	1	0	2
100564	keplema01	3	7	1	0	0
100729	meyeral01	2	0	0	0	0

In [25]:

df_min_2015[:5].append([df_min_2015[10:12], df_min_2015[-5:]])

Out[25]:

	playerID	G	AB	H	HR	RBI
100696	mauerjo01	158	592	157	10	66
100215	doziebr01	157	628	148	28	77
100915	plouftr01	152	573	140	22	86
100488	hunteto01	139	521	125	22	81
101164	suzukku01	131	433	104	5	50
101067	sanomi01	80	279	75	18	52
100816	nunezed02	72	188	53	4	20
101189	thielca01	6	0	0	0	0
100917	polanjo01	4	10	3	0	1
99954	bernido01	4	5	1	0	2
100564	keplema01	3	7	1	0	0
100729	meyeral01	2	0	0	0	0

The same result can be achieved with pd.concat(), where the defaut axis is 0.

In [26]:

pd.concat([df_min_2015[:5], 
          df_min_2015[-5:]], axis=0)

Out[26]:

	playerID	G	AB	H	HR	RBI
100696	mauerjo01	158	592	157	10	66
100215	doziebr01	157	628	148	28	77
100915	plouftr01	152	573	140	22	86
100488	hunteto01	139	521	125	22	81
101164	suzukku01	131	433	104	5	50
101189	thielca01	6	0	0	0	0
100917	polanjo01	4	10	3	0	1
99954	bernido01	4	5	1	0	2
100564	keplema01	3	7	1	0	0
100729	meyeral01	2	0	0	0	0

But we can use concat() to make a column-wise concatenation using axis=1 (columns).

In [27]:

pd.concat([df_min_2015[:5], 
          df_min_2015[-5:]], axis=1)

Out[27]:

	playerID	G	AB	H	HR	RBI	playerID	G	AB	H	HR	RBI
99954	NaN	NaN	NaN	NaN	NaN	NaN	bernido01	4	5	1	0	2
100215	doziebr01	157	628	148	28	77	NaN	NaN	NaN	NaN	NaN	NaN
100488	hunteto01	139	521	125	22	81	NaN	NaN	NaN	NaN	NaN	NaN
100564	NaN	NaN	NaN	NaN	NaN	NaN	keplema01	3	7	1	0	0
100696	mauerjo01	158	592	157	10	66	NaN	NaN	NaN	NaN	NaN	NaN
100729	NaN	NaN	NaN	NaN	NaN	NaN	meyeral01	2	0	0	0	0
100915	plouftr01	152	573	140	22	86	NaN	NaN	NaN	NaN	NaN	NaN
100917	NaN	NaN	NaN	NaN	NaN	NaN	polanjo01	4	10	3	0	1
101164	suzukku01	131	433	104	5	50	NaN	NaN	NaN	NaN	NaN	NaN
101189	NaN	NaN	NaN	NaN	NaN	NaN	thielca01	6	0	0	0	0

We can see that the indices are being considered in the concatenation and row indices are being joined. This behavior can be controlled via the join parameter, which we'll leave for the reader to explore.

One last thing we might want to do in an operation like this is to reset the index. To do so, we might start with ignoring the column index using the ignore_index=True so we can set it later to something more appropriate after the concatenation.

In [28]:

pd.concat([df_min_2015[:5], 
          df_min_2015[-5:]], axis=1, ignore_index=True)

Out[28]:

	0	1	2	3	4	5	6	7	8	9	10	11
99954	NaN	NaN	NaN	NaN	NaN	NaN	bernido01	4	5	1	0	2
100215	doziebr01	157	628	148	28	77	NaN	NaN	NaN	NaN	NaN	NaN
100488	hunteto01	139	521	125	22	81	NaN	NaN	NaN	NaN	NaN	NaN
100564	NaN	NaN	NaN	NaN	NaN	NaN	keplema01	3	7	1	0	0
100696	mauerjo01	158	592	157	10	66	NaN	NaN	NaN	NaN	NaN	NaN
100729	NaN	NaN	NaN	NaN	NaN	NaN	meyeral01	2	0	0	0	0
100915	plouftr01	152	573	140	22	86	NaN	NaN	NaN	NaN	NaN	NaN
100917	NaN	NaN	NaN	NaN	NaN	NaN	polanjo01	4	10	3	0	1
101164	suzukku01	131	433	104	5	50	NaN	NaN	NaN	NaN	NaN	NaN
101189	NaN	NaN	NaN	NaN	NaN	NaN	thielca01	6	0	0	0	0

Advanced indexing¶

Pandas provides the ability to build more complex indices allowing for highly flexible and natural data access.

We will cover the basics of through the MultiIndex object and will the the remaining exploration to the reader.

Let's get the players on the Washington Nationals who played 100 or more games in 2015 and 2016.

In [29]:

df_was = df[(df.yearID > 2014) & (df.teamID=='WAS') & (df.G > 99)]

In [30]:

df_was.head()

Out[30]:

	playerID	yearID	stint	teamID	lgID	G	AB	R	H	2B	...	RBI	SB	CS	BB	SO	IBB	HBP	SH	SF	GIDP
100193	desmoia01	2015	1	WAS	NL	156	583	69	136	27	...	62.0	13.0	5.0	45	187.0	0.0	3.0	6.0	4.0	9.0
100250	escobyu01	2015	1	WAS	NL	139	535	75	168	25	...	56.0	2.0	2.0	45	70.0	0.0	8.0	1.0	2.0	24.0
100251	espinda01	2015	1	WAS	NL	118	367	59	88	21	...	37.0	5.0	2.0	33	106.0	5.0	6.0	3.0	3.0	6.0
100422	harpebr03	2015	1	WAS	NL	153	521	118	172	38	...	99.0	6.0	4.0	124	131.0	15.0	5.0	0.0	4.0	15.0
100950	ramoswi01	2015	1	WAS	NL	128	475	41	109	16	...	68.0	0.0	0.0	21	101.0	2.0	0.0	0.0	8.0	16.0

5 rows × 22 columns

One obvious problem if we were to access the data here by player and year, we have to build a much more involved query and even more so if we needed to ignore data.

We are going to create a hierarchical index or MultiIndex to solve this problem. We'll take take liberty to drop columns we don't need (teamID, ldID, stint) and reorganize the index hierarchically.

We will use MultiIndex using a tuple of the data we need and provide the index first by player, then by year. To do this we'll just grab all the player IDs and zip them with the year. This will look something like this:

In [31]:

tuple(
zip(
    df_was[['playerID','yearID']].sort_values(by='playerID')['playerID'],
    df_was[['playerID','yearID']].sort_values(by='playerID')['yearID']
)
)

Out[31]:

(('desmoia01', 2015),
 ('escobyu01', 2015),
 ('espinda01', 2015),
 ('espinda01', 2016),
 ('harpebr03', 2015),
 ('harpebr03', 2016),
 ('murphda08', 2016),
 ('ramoswi01', 2015),
 ('ramoswi01', 2016),
 ('rendoan01', 2016),
 ('reverbe01', 2016),
 ('robincl01', 2015),
 ('robincl01', 2016),
 ('taylomi02', 2015),
 ('werthja01', 2016),
 ('zimmery01', 2016))

In [32]:

# create an index to be used over the data we're interested in
idx = \
    pd.MultiIndex.from_tuples(
        tuple(
            zip(
                df_was[['playerID','yearID']].sort_values(by='playerID')['playerID'],
                df_was[['playerID','yearID']].sort_values(by='playerID')['yearID']))
    )
idx

Out[32]:

MultiIndex(levels=[['desmoia01', 'escobyu01', 'espinda01', 'harpebr03', 'murphda08', 'ramoswi01', 'rendoan01', 'reverbe01', 'robincl01', 'taylomi02', 'werthja01', 'zimmery01'], [2015, 2016]],
           labels=[[0, 1, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8, 8, 9, 10, 11], [0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1]])

Notice now that we have two levels in our row axis (axis 0) and we will now use that index to build the hierachically indexed DataFrame.

In [33]:

# sorting the indices is critical for lining up the data in the tuples
df_was = df_was.sort_values(by=['playerID']).\
            set_index(idx).\
            drop(['playerID', 'yearID', 'teamID', 'lgID', 'stint'], axis=1)
df_was

Out[33]:

		G	AB	R	H	2B	3B	HR	RBI	SB	CS	BB	SO	IBB	HBP	SH	SF	GIDP
desmoia01	2015	156	583	69	136	27	2	19	62.0	13.0	5.0	45	187.0	0.0	3.0	6.0	4.0	9.0
escobyu01	2015	139	535	75	168	25	1	9	56.0	2.0	2.0	45	70.0	0.0	8.0	1.0	2.0	24.0
espinda01	2015	118	367	59	88	21	1	13	37.0	5.0	2.0	33	106.0	5.0	6.0	3.0	3.0	6.0
espinda01	2016	157	516	66	108	15	0	24	72.0	9.0	2.0	54	174.0	12.0	20.0	7.0	4.0	4.0
harpebr03	2015	153	521	118	172	38	1	42	99.0	6.0	4.0	124	131.0	15.0	5.0	0.0	4.0	15.0
harpebr03	2016	147	506	84	123	24	2	24	86.0	21.0	10.0	108	117.0	20.0	3.0	0.0	10.0	11.0
murphda08	2016	142	531	88	184	47	5	25	104.0	5.0	3.0	35	57.0	10.0	8.0	0.0	8.0	4.0
ramoswi01	2015	128	475	41	109	16	0	15	68.0	0.0	0.0	21	101.0	2.0	0.0	0.0	8.0	16.0
ramoswi01	2016	131	482	58	148	25	0	22	80.0	0.0	0.0	35	79.0	2.0	2.0	0.0	4.0	17.0
rendoan01	2016	156	567	91	153	38	2	20	85.0	12.0	6.0	65	117.0	2.0	7.0	0.0	8.0	5.0
reverbe01	2016	103	350	44	76	9	7	2	24.0	14.0	5.0	18	34.0	0.0	3.0	2.0	2.0	12.0
robincl01	2015	126	309	44	84	15	1	10	34.0	0.0	0.0	37	52.0	4.0	5.0	0.0	1.0	6.0
robincl01	2016	104	196	16	46	4	0	5	26.0	0.0	0.0	20	38.0	0.0	2.0	1.0	5.0	4.0
taylomi02	2015	138	472	49	108	15	2	14	63.0	16.0	3.0	35	158.0	9.0	1.0	1.0	2.0	5.0
werthja01	2016	143	525	84	128	28	0	21	69.0	5.0	1.0	71	139.0	0.0	4.0	0.0	6.0	17.0
zimmery01	2016	115	427	60	93	18	1	15	46.0	4.0	1.0	29	104.0	1.0	5.0	0.0	6.0	12.0

In [34]:

df_was.loc[('robincl01', ),['G', 'AB', 'H', 'SO']]

Out[34]:

	G	AB	H	SO
2015	126	309	84	52.0
2016	104	196	46	38.0

In [35]:

df_was.loc[('robincl01', 2016),['G', 'AB', 'H', 'SO']]

Out[35]:

G     104.0
AB    196.0
H      46.0
SO     38.0
Name: (robincl01, 2016), dtype: float64

For the sake of the example, let's take the DataFrame for all rows of data past 2016 and create a multi-index using year, league, team and player as the groupings of the index.

In [36]:

df.head()

Out[36]:

	playerID	yearID	stint	teamID	lgID	G	AB	R	H	2B	...	RBI	SB	CS	BB	SO	IBB	HBP	SH	SF	GIDP
0	abercda01	1871	1	TRO	NaN	1	4	0	0	0	...	0.0	0.0	0.0	0	0.0	NaN	NaN	NaN	NaN	NaN
1	addybo01	1871	1	RC1	NaN	25	118	30	32	6	...	13.0	8.0	1.0	4	0.0	NaN	NaN	NaN	NaN	NaN
2	allisar01	1871	1	CL1	NaN	29	137	28	40	4	...	19.0	3.0	1.0	2	5.0	NaN	NaN	NaN	NaN	NaN
3	allisdo01	1871	1	WS3	NaN	27	133	28	44	10	...	27.0	1.0	1.0	0	2.0	NaN	NaN	NaN	NaN	NaN
4	ansonca01	1871	1	RC1	NaN	25	120	29	39	11	...	16.0	6.0	2.0	2	1.0	NaN	NaN	NaN	NaN	NaN

5 rows × 22 columns

In [37]:

df_mi = df[df.yearID>2006].copy()
idx_labels = ['yearID', 'lgID', 'teamID', 'playerID']

tuple(
    zip(
        df_mi[idx_labels]\
            .sort_values(idx_labels)['yearID'],

        df_mi[idx_labels]\
            .sort_values(idx_labels)['lgID'],

        df_mi[idx_labels]\
            .sort_values(idx_labels)['teamID'],

        df_mi[idx_labels]\
            .sort_values(idx_labels)['playerID']))[-10:]

Out[37]:

((2016, 'NL', 'WAS', 'rzepcma01'),
 (2016, 'NL', 'WAS', 'scherma01'),
 (2016, 'NL', 'WAS', 'severpe01'),
 (2016, 'NL', 'WAS', 'solissa01'),
 (2016, 'NL', 'WAS', 'strasst01'),
 (2016, 'NL', 'WAS', 'taylomi02'),
 (2016, 'NL', 'WAS', 'treinbl01'),
 (2016, 'NL', 'WAS', 'turnetr01'),
 (2016, 'NL', 'WAS', 'werthja01'),
 (2016, 'NL', 'WAS', 'zimmery01'))

In [38]:

idx = \
    pd.MultiIndex.from_tuples(
        tuple(
            zip(
                df_mi[idx_labels]\
                    .sort_values(idx_labels)['yearID'],
                
                df_mi[idx_labels]\
                    .sort_values(idx_labels)['lgID'],
                
                df_mi[idx_labels]\
                    .sort_values(idx_labels)['teamID'],
                
                df_mi[idx_labels]\
                    .sort_values(idx_labels)['playerID']))
    )

In [39]:

df_mi = df_mi.sort_values(['yearID', 'teamID']).set_index(idx)#.drop(['playerID', 'yearID', 'teamID', 'stint'], axis=1)

In [40]:

df_mi.head()

Out[40]:

				playerID	yearID	stint	teamID	lgID	G	AB	R	H	2B	...	RBI	SB	CS	BB	SO	IBB	HBP	SH	SF	GIDP
2007	AL	BAL	baezda01	bardebr01	2007	1	ARI	NL	8	12	0	1	0	...	0.0	0.0	0.0	0	3.0	0.0	0.0	0.0	0.0	0.0
			bakopa01	bonifem01	2007	1	ARI	NL	11	23	2	5	1	...	2.0	0.0	1.0	4	3.0	0.0	0.0	0.0	0.0	0.0
			bedarer01	byrneer01	2007	1	ARI	NL	160	626	103	179	30	...	83.0	50.0	7.0	57	98.0	5.0	10.0	1.0	4.0	12.0
			bellro01	callaal01	2007	1	ARI	NL	56	144	10	31	8	...	7.0	1.0	1.0	9	14.0	0.0	1.0	1.0	1.0	8.0
			birkiku01	choatra01	2007	1	ARI	NL	2	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 22 columns

In [41]:

df_mi.tail()

Out[41]:

				playerID	yearID	stint	teamID	lgID	G	AB	R	H	2B	...	RBI	SB	CS	BB	SO	IBB	HBP	SF	GIDP
2016	NL	WAS	taylomi02	taylomi02	2016	1	WAS	NL	76	221	28	51	11	...	16.0	14.0	3.0	14	77.0	0.0	1.0	1.0	2.0
			treinbl01	treinbl01	2016	1	WAS	NL	73	0	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0
			turnetr01	turnetr01	2016	1	WAS	NL	73	307	53	105	14	...	40.0	33.0	6.0	14	59.0	0.0	1.0	2.0	1.0
			werthja01	werthja01	2016	1	WAS	NL	143	525	84	128	28	...	69.0	5.0	1.0	71	139.0	0.0	4.0	6.0	17.0
			zimmery01	zimmery01	2016	1	WAS	NL	115	427	60	93	18	...	46.0	4.0	1.0	29	104.0	1.0	5.0	6.0	12.0

5 rows × 22 columns

Now we can use this multi-index to out advantage, using the tuple of the index values we want and restricting the columns to just the data of interest.

In [42]:

df_mi.loc[(2007, 'AL', 'TOR'), ['G', 'AB']].head()

Out[42]:

	G	AB
accarje01	152	509
adamsru01	62	1
banksjo01	26	5
burneaj01	8	14
chacigu01	65	0

196 KiB Raw Blame History

Table of Contents¶

Manipulating DataFrames¶

More Selecting¶

The convenient [] operator (again)¶

Selecting data by . selector on column and index name¶

Boolean selecting¶

Sorting¶

DataFrame manipulation¶

Adding and dropping columns¶

Adding and dropping rows¶

Advanced indexing¶

196 KiB

Raw Blame History

The convenient `[]` operator (again)¶

Selecting data by `.` selector on column and index name¶