diff --git a/nb/2_importing_data.ipynb b/nb/2_importing_data.ipynb new file mode 100644 index 0000000..946af15 --- /dev/null +++ b/nb/2_importing_data.ipynb @@ -0,0 +1,5578 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "** NAVIGATION **\n", + "\n", + "**Got Pandas? _Practical Data Wrangling with Pandas_**\n", + "\n", + "* [Introduction](./0_introduction.ipynb)\n", + "1. [Data Structures](./1_data_structures.ipynb)\n", + "2. **Importing Data**\n", + "3. [Manipulating DataFrames](./3_dataframe_operations.ipynb)\n", + "4. [Wrap Up](4_wrapping_up.ipynb)\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "toc": "true" + }, + "source": [ + "# Table of Contents\n", + "

1  Importing Data in Pandas
1.1  Importing Pandas
1.2  Loading CSV and Excel
1.2.1  CSV
1.2.2  Accessing column data by label
1.3  Excel
1.4  JSON
1.5  SQL
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Importing Data in Pandas\n", + "\n", + "Pandas supports a number of data formats out of the box including:\n", + "\n", + "* CSV, Excel\n", + "* JSON\n", + "* HDF5\n", + "* SQL databases\n", + "* and others\n", + "\n", + "The major benefit for using Pandas to load these data is that it provides a simple, consistent mechanism for each of them and loads them directly into the Pandas DataFrame in a single operation, reducing the need to go elsewhere to perform the same operations with more code or overhead.\n", + "\n", + "Pandas I/O supports loading these data formats directly from local storage or using a URL containing such data. The convenience being that the resource string used can be either a local/network file string or a URL.\n", + "\n", + "**NOTEBOOK OBJECTIVES**\n", + "\n", + "In this notebook we'll:\n", + "\n", + "* load a local and remote csv file, \n", + "* load Excel datafile,\n", + "* load JSON data,\n", + "* load data via SQL queries." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Importing Pandas\n", + "\n", + "You will most often load the Pandas library with the following line:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "import pandas as pd" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Loading CSV and Excel\n", + "\n", + "### CSV\n", + "\n", + "CSV files are still a staple in data file formats. They're portable, flexible, flat, usually easy to parse and ubiquitous. We will begin by showing how to use Pandas to load CSV directly into a DataFrame.\n", + "\n", + "**DATA SOURCE**\n", + "\n", + "US Baseball Statistics Archive by Sean Lahman (CCBY-SA 3.0):\n", + "\n", + "* [http://seanlahman.com/baseball-archive/statistics/](http://seanlahman.com/baseball-archive/statistics/)\n", + "* [https://github.com/chadwickbureau/baseballdatabank](https://github.com/chadwickbureau/baseballdatabank)\n", + "\n", + "We have put the dataset for [batting data](./datasets/Batting.csv) into our local `datasets` folder.\n", + "\n", + "Loading this into a Pandas DataFrame will require us to use the [`read_csv`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv) function, which will attempt to load the CSV data directly into the DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "df = pd.read_csv(\"./datasets/Batting.csv\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If we inspect this DataFrame, will get exactly what we expect -- each line corresponding to the row in file. __NOTE__: where there are missing values, Pandas will automatically fill the data with `NaN`." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDyearIDstintteamIDlgIDGABRH2B...RBISBCSBBSOIBBHBPSHSFGIDP
0abercda0118711TRONaN14000...0.00.00.000.0NaNNaNNaNNaNNaN
1addybo0118711RC1NaN2511830326...13.08.01.040.0NaNNaNNaNNaNNaN
2allisar0118711CL1NaN2913728404...19.03.01.025.0NaNNaNNaNNaNNaN
3allisdo0118711WS3NaN27133284410...27.01.01.002.0NaNNaNNaNNaNNaN
4ansonca0118711RC1NaN25120293911...16.06.02.021.0NaNNaNNaNNaNNaN
5armstbo0118711FW1NaN12499112...5.00.01.001.0NaNNaNNaNNaNNaN
6barkeal0118711RC1NaN14010...2.00.00.010.0NaNNaNNaNNaNNaN
7barnero0118711BS1NaN31157666310...34.011.06.0131.0NaNNaNNaNNaNNaN
8barrebi0118711FW1NaN15111...1.00.00.000.0NaNNaNNaNNaNNaN
9barrofr0118711BS1NaN188613132...11.01.00.000.0NaNNaNNaNNaNNaN
10bassjo0118711CL1NaN228918271...18.00.01.034.0NaNNaNNaNNaNNaN
11battijo0118711CL1NaN13000...0.00.00.010.0NaNNaNNaNNaNNaN
12bealsto0118711WS3NaN1036670...1.02.00.020.0NaNNaNNaNNaNNaN
13beaveed0118711TRONaN315760...5.02.00.000.0NaNNaNNaNNaNNaN
14bechtge0118711PH1NaN209424339...21.04.00.022.0NaNNaNNaNNaNNaN
15bellast0118711TRONaN2912826323...23.04.04.092.0NaNNaNNaNNaNNaN
16berkena0118711PH1NaN14000...0.00.00.003.0NaNNaNNaNNaNNaN
17berryto0118711PH1NaN14010...0.00.00.000.0NaNNaNNaNNaNNaN
18berthha0118711WS3NaN177317171...8.03.01.042.0NaNNaNNaNNaNNaN
19biermch0118711FW1NaN12000...0.00.00.010.0NaNNaNNaNNaNNaN
20birdge0118711RC1NaN2510619282...13.01.00.032.0NaNNaNNaNNaNNaN
21birdsda0118711BS1NaN2915251463...24.06.00.044.0NaNNaNNaNNaNNaN
22brainas0118711WS3NaN3013424304...21.04.00.072.0NaNNaNNaNNaNNaN
23brannmi0118711CH1NaN314210...0.00.00.000.0NaNNaNNaNNaNNaN
24burrohe0118711WS3NaN126311152...14.00.00.011.0NaNNaNNaNNaNNaN
25careyto0118711FW1NaN198716202...10.05.00.021.0NaNNaNNaNNaNNaN
26carleji0118711CL1NaN2912731328...18.02.01.083.0NaNNaNNaNNaNNaN
27conefr0118711BS1NaN197717203...16.012.01.082.0NaNNaNNaNNaNNaN
28connone0118711TRONaN733670...2.00.00.000.0NaNNaNNaNNaNNaN
29cravebi0118711TRONaN2711826388...26.06.03.030.0NaNNaNNaNNaNNaN
..................................................................
102786wittgni0120161MIANL480000...0.00.00.000.00.00.00.00.00.0
102787wolteto0120161COLNL71205275315...30.04.01.02153.02.00.04.00.01.0
102788wongko0120161SLNNL12131339757...23.07.00.03452.02.09.00.05.03.0
102789woodal0220161LANNL1516240...2.00.00.019.00.00.02.00.00.0
102790woodbl0120161CINNL702000...0.00.00.002.00.00.00.00.00.0
102791woodtr0120161CHNNL8111020...1.00.00.015.00.00.00.00.00.0
102792worleva0120161BALAL350000...0.00.00.000.00.00.00.00.00.0
102793worthda0120161HOUAL1639472...1.00.00.016.00.00.00.00.01.0
102794wrighda0320161NYNNL3713718318...14.03.02.02655.00.00.00.00.00.0
102795wrighda0420161CINNL45000...0.00.00.002.00.00.01.00.00.0
102796wrighda0420162LAAAL50000...0.00.00.000.00.00.00.00.00.0
102797wrighmi0120161BALAL180000...0.00.00.000.00.00.00.00.00.0
102798wrighst0120161BOSAL254000...0.00.00.003.00.00.00.00.00.0
102799yateski0120161NYAAL410000...0.00.00.000.00.00.00.00.00.0
102800yelicch0120161MIANL1555787817238...98.09.04.072138.04.04.00.05.020.0
102801ynoaga0120161NYNNL103000...0.00.00.000.00.00.00.00.00.0
102802ynoami0120161CHAAL230000...0.00.00.000.00.00.00.00.00.0
102803ynoara0120161COLNL35000...0.00.00.002.00.00.00.00.00.0
102804youngch0320161KCAAL341000...0.00.00.000.00.00.00.00.00.0
102805youngch0420161BOSAL76203295618...24.04.02.02150.00.03.00.00.04.0
102806younger0320161NYAAL61200...0.01.00.000.00.00.00.00.00.0
102807youngma0320161ATLNL80000...0.00.00.000.00.00.00.00.00.0
102808zastrro0120161CHNNL83000...0.00.00.002.00.00.00.00.00.0
102809zieglbr0120161ARINL360000...0.00.00.000.00.00.00.00.00.0
102810zieglbr0120162BOSAL330000...0.00.00.000.00.00.00.00.00.0
102811zimmejo0220161DETAL194010...0.00.00.002.00.00.01.00.00.0
102812zimmery0120161WASNL115427609318...46.04.01.029104.01.05.00.06.012.0
102813zobribe0120161CHNNL1475239414231...76.06.04.09682.06.04.04.04.017.0
102814zuninmi0120161SEAAL5516416347...31.00.00.02165.00.06.00.01.00.0
102815zychto0120161SEAAL120000...0.00.00.000.00.00.00.00.00.0
\n", + "

102816 rows × 22 columns

\n", + "
" + ], + "text/plain": [ + " playerID yearID stint teamID lgID G AB R H 2B ... \\\n", + "0 abercda01 1871 1 TRO NaN 1 4 0 0 0 ... \n", + "1 addybo01 1871 1 RC1 NaN 25 118 30 32 6 ... \n", + "2 allisar01 1871 1 CL1 NaN 29 137 28 40 4 ... \n", + "3 allisdo01 1871 1 WS3 NaN 27 133 28 44 10 ... \n", + "4 ansonca01 1871 1 RC1 NaN 25 120 29 39 11 ... \n", + "5 armstbo01 1871 1 FW1 NaN 12 49 9 11 2 ... \n", + "6 barkeal01 1871 1 RC1 NaN 1 4 0 1 0 ... \n", + "7 barnero01 1871 1 BS1 NaN 31 157 66 63 10 ... \n", + "8 barrebi01 1871 1 FW1 NaN 1 5 1 1 1 ... \n", + "9 barrofr01 1871 1 BS1 NaN 18 86 13 13 2 ... \n", + "10 bassjo01 1871 1 CL1 NaN 22 89 18 27 1 ... \n", + "11 battijo01 1871 1 CL1 NaN 1 3 0 0 0 ... \n", + "12 bealsto01 1871 1 WS3 NaN 10 36 6 7 0 ... \n", + "13 beaveed01 1871 1 TRO NaN 3 15 7 6 0 ... \n", + "14 bechtge01 1871 1 PH1 NaN 20 94 24 33 9 ... \n", + "15 bellast01 1871 1 TRO NaN 29 128 26 32 3 ... \n", + "16 berkena01 1871 1 PH1 NaN 1 4 0 0 0 ... \n", + "17 berryto01 1871 1 PH1 NaN 1 4 0 1 0 ... \n", + "18 berthha01 1871 1 WS3 NaN 17 73 17 17 1 ... \n", + "19 biermch01 1871 1 FW1 NaN 1 2 0 0 0 ... \n", + "20 birdge01 1871 1 RC1 NaN 25 106 19 28 2 ... \n", + "21 birdsda01 1871 1 BS1 NaN 29 152 51 46 3 ... \n", + "22 brainas01 1871 1 WS3 NaN 30 134 24 30 4 ... \n", + "23 brannmi01 1871 1 CH1 NaN 3 14 2 1 0 ... \n", + "24 burrohe01 1871 1 WS3 NaN 12 63 11 15 2 ... \n", + "25 careyto01 1871 1 FW1 NaN 19 87 16 20 2 ... \n", + "26 carleji01 1871 1 CL1 NaN 29 127 31 32 8 ... \n", + "27 conefr01 1871 1 BS1 NaN 19 77 17 20 3 ... \n", + "28 connone01 1871 1 TRO NaN 7 33 6 7 0 ... \n", + "29 cravebi01 1871 1 TRO NaN 27 118 26 38 8 ... \n", + "... ... ... ... ... ... ... ... .. ... .. ... \n", + "102786 wittgni01 2016 1 MIA NL 48 0 0 0 0 ... \n", + "102787 wolteto01 2016 1 COL NL 71 205 27 53 15 ... \n", + "102788 wongko01 2016 1 SLN NL 121 313 39 75 7 ... \n", + "102789 woodal02 2016 1 LAN NL 15 16 2 4 0 ... \n", + "102790 woodbl01 2016 1 CIN NL 70 2 0 0 0 ... \n", + "102791 woodtr01 2016 1 CHN NL 81 11 0 2 0 ... \n", + "102792 worleva01 2016 1 BAL AL 35 0 0 0 0 ... \n", + "102793 worthda01 2016 1 HOU AL 16 39 4 7 2 ... \n", + "102794 wrighda03 2016 1 NYN NL 37 137 18 31 8 ... \n", + "102795 wrighda04 2016 1 CIN NL 4 5 0 0 0 ... \n", + "102796 wrighda04 2016 2 LAA AL 5 0 0 0 0 ... \n", + "102797 wrighmi01 2016 1 BAL AL 18 0 0 0 0 ... \n", + "102798 wrighst01 2016 1 BOS AL 25 4 0 0 0 ... \n", + "102799 yateski01 2016 1 NYA AL 41 0 0 0 0 ... \n", + "102800 yelicch01 2016 1 MIA NL 155 578 78 172 38 ... \n", + "102801 ynoaga01 2016 1 NYN NL 10 3 0 0 0 ... \n", + "102802 ynoami01 2016 1 CHA AL 23 0 0 0 0 ... \n", + "102803 ynoara01 2016 1 COL NL 3 5 0 0 0 ... \n", + "102804 youngch03 2016 1 KCA AL 34 1 0 0 0 ... \n", + "102805 youngch04 2016 1 BOS AL 76 203 29 56 18 ... \n", + "102806 younger03 2016 1 NYA AL 6 1 2 0 0 ... \n", + "102807 youngma03 2016 1 ATL NL 8 0 0 0 0 ... \n", + "102808 zastrro01 2016 1 CHN NL 8 3 0 0 0 ... \n", + "102809 zieglbr01 2016 1 ARI NL 36 0 0 0 0 ... \n", + "102810 zieglbr01 2016 2 BOS AL 33 0 0 0 0 ... \n", + "102811 zimmejo02 2016 1 DET AL 19 4 0 1 0 ... \n", + "102812 zimmery01 2016 1 WAS NL 115 427 60 93 18 ... \n", + "102813 zobribe01 2016 1 CHN NL 147 523 94 142 31 ... \n", + "102814 zuninmi01 2016 1 SEA AL 55 164 16 34 7 ... \n", + "102815 zychto01 2016 1 SEA AL 12 0 0 0 0 ... \n", + "\n", + " RBI SB CS BB SO IBB HBP SH SF GIDP \n", + "0 0.0 0.0 0.0 0 0.0 NaN NaN NaN NaN NaN \n", + "1 13.0 8.0 1.0 4 0.0 NaN NaN NaN NaN NaN \n", + "2 19.0 3.0 1.0 2 5.0 NaN NaN NaN NaN NaN \n", + "3 27.0 1.0 1.0 0 2.0 NaN NaN NaN NaN NaN \n", + "4 16.0 6.0 2.0 2 1.0 NaN NaN NaN NaN NaN \n", + "5 5.0 0.0 1.0 0 1.0 NaN NaN NaN NaN NaN \n", + "6 2.0 0.0 0.0 1 0.0 NaN NaN NaN NaN NaN \n", + "7 34.0 11.0 6.0 13 1.0 NaN NaN NaN NaN NaN \n", + "8 1.0 0.0 0.0 0 0.0 NaN NaN NaN NaN NaN \n", + "9 11.0 1.0 0.0 0 0.0 NaN NaN NaN NaN NaN \n", + "10 18.0 0.0 1.0 3 4.0 NaN NaN NaN NaN NaN \n", + "11 0.0 0.0 0.0 1 0.0 NaN NaN NaN NaN NaN \n", + "12 1.0 2.0 0.0 2 0.0 NaN NaN NaN NaN NaN \n", + "13 5.0 2.0 0.0 0 0.0 NaN NaN NaN NaN NaN \n", + "14 21.0 4.0 0.0 2 2.0 NaN NaN NaN NaN NaN \n", + "15 23.0 4.0 4.0 9 2.0 NaN NaN NaN NaN NaN \n", + "16 0.0 0.0 0.0 0 3.0 NaN NaN NaN NaN NaN \n", + "17 0.0 0.0 0.0 0 0.0 NaN NaN NaN NaN NaN \n", + "18 8.0 3.0 1.0 4 2.0 NaN NaN NaN NaN NaN \n", + "19 0.0 0.0 0.0 1 0.0 NaN NaN NaN NaN NaN \n", + "20 13.0 1.0 0.0 3 2.0 NaN NaN NaN NaN NaN \n", + "21 24.0 6.0 0.0 4 4.0 NaN NaN NaN NaN NaN \n", + "22 21.0 4.0 0.0 7 2.0 NaN NaN NaN NaN NaN \n", + "23 0.0 0.0 0.0 0 0.0 NaN NaN NaN NaN NaN \n", + "24 14.0 0.0 0.0 1 1.0 NaN NaN NaN NaN NaN \n", + "25 10.0 5.0 0.0 2 1.0 NaN NaN NaN NaN NaN \n", + "26 18.0 2.0 1.0 8 3.0 NaN NaN NaN NaN NaN \n", + "27 16.0 12.0 1.0 8 2.0 NaN NaN NaN NaN NaN \n", + "28 2.0 0.0 0.0 0 0.0 NaN NaN NaN NaN NaN \n", + "29 26.0 6.0 3.0 3 0.0 NaN NaN NaN NaN NaN \n", + "... ... ... ... .. ... ... ... ... ... ... \n", + "102786 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "102787 30.0 4.0 1.0 21 53.0 2.0 0.0 4.0 0.0 1.0 \n", + "102788 23.0 7.0 0.0 34 52.0 2.0 9.0 0.0 5.0 3.0 \n", + "102789 2.0 0.0 0.0 1 9.0 0.0 0.0 2.0 0.0 0.0 \n", + "102790 0.0 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0 \n", + "102791 1.0 0.0 0.0 1 5.0 0.0 0.0 0.0 0.0 0.0 \n", + "102792 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "102793 1.0 0.0 0.0 1 6.0 0.0 0.0 0.0 0.0 1.0 \n", + "102794 14.0 3.0 2.0 26 55.0 0.0 0.0 0.0 0.0 0.0 \n", + "102795 0.0 0.0 0.0 0 2.0 0.0 0.0 1.0 0.0 0.0 \n", + "102796 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "102797 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "102798 0.0 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0 \n", + "102799 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "102800 98.0 9.0 4.0 72 138.0 4.0 4.0 0.0 5.0 20.0 \n", + "102801 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "102802 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "102803 0.0 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0 \n", + "102804 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "102805 24.0 4.0 2.0 21 50.0 0.0 3.0 0.0 0.0 4.0 \n", + "102806 0.0 1.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "102807 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "102808 0.0 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0 \n", + "102809 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "102810 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "102811 0.0 0.0 0.0 0 2.0 0.0 0.0 1.0 0.0 0.0 \n", + "102812 46.0 4.0 1.0 29 104.0 1.0 5.0 0.0 6.0 12.0 \n", + "102813 76.0 6.0 4.0 96 82.0 6.0 4.0 4.0 4.0 17.0 \n", + "102814 31.0 0.0 0.0 21 65.0 0.0 6.0 0.0 1.0 0.0 \n", + "102815 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "\n", + "[102816 rows x 22 columns]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will soon learn that Pandas, supports some typical \"Pythonic\" use cases for accesing data. The first we will encounter is with `len()`. We can get the size of this dataset (in rows) with the standard Python `len()` function, which will return exactly what we expect." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "102816" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(df)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Every DataFrame will have a `columns` attribute, which contains the _column index_ for our dataset. Thus, getting the length of that attribute returns, again, what we expect." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Index(['playerID', 'yearID', 'stint', 'teamID', 'lgID', 'G', 'AB', 'R', 'H',\n", + " '2B', '3B', 'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'IBB', 'HBP', 'SH',\n", + " 'SF', 'GIDP'],\n", + " dtype='object')" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.columns" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "22" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(df.columns)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If we want both column and row counts [`DataFrame.shape`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shape.html#pandas.DataFrame.shape) will return the tuple to do this:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(102816, 22)" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Which returns what we expect (yet again).\n", + "\n", + "Much like Python slicing of lists, if we want to the first _n_ rows of data, we can use the shorthand:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDyearIDstintteamIDlgIDGABRH2B...RBISBCSBBSOIBBHBPSHSFGIDP
0abercda0118711TRONaN14000...0.00.00.000.0NaNNaNNaNNaNNaN
1addybo0118711RC1NaN2511830326...13.08.01.040.0NaNNaNNaNNaNNaN
2allisar0118711CL1NaN2913728404...19.03.01.025.0NaNNaNNaNNaNNaN
3allisdo0118711WS3NaN27133284410...27.01.01.002.0NaNNaNNaNNaNNaN
4ansonca0118711RC1NaN25120293911...16.06.02.021.0NaNNaNNaNNaNNaN
5armstbo0118711FW1NaN12499112...5.00.01.001.0NaNNaNNaNNaNNaN
6barkeal0118711RC1NaN14010...2.00.00.010.0NaNNaNNaNNaNNaN
7barnero0118711BS1NaN31157666310...34.011.06.0131.0NaNNaNNaNNaNNaN
8barrebi0118711FW1NaN15111...1.00.00.000.0NaNNaNNaNNaNNaN
9barrofr0118711BS1NaN188613132...11.01.00.000.0NaNNaNNaNNaNNaN
\n", + "

10 rows × 22 columns

\n", + "
" + ], + "text/plain": [ + " playerID yearID stint teamID lgID G AB R H 2B ... RBI \\\n", + "0 abercda01 1871 1 TRO NaN 1 4 0 0 0 ... 0.0 \n", + "1 addybo01 1871 1 RC1 NaN 25 118 30 32 6 ... 13.0 \n", + "2 allisar01 1871 1 CL1 NaN 29 137 28 40 4 ... 19.0 \n", + "3 allisdo01 1871 1 WS3 NaN 27 133 28 44 10 ... 27.0 \n", + "4 ansonca01 1871 1 RC1 NaN 25 120 29 39 11 ... 16.0 \n", + "5 armstbo01 1871 1 FW1 NaN 12 49 9 11 2 ... 5.0 \n", + "6 barkeal01 1871 1 RC1 NaN 1 4 0 1 0 ... 2.0 \n", + "7 barnero01 1871 1 BS1 NaN 31 157 66 63 10 ... 34.0 \n", + "8 barrebi01 1871 1 FW1 NaN 1 5 1 1 1 ... 1.0 \n", + "9 barrofr01 1871 1 BS1 NaN 18 86 13 13 2 ... 11.0 \n", + "\n", + " SB CS BB SO IBB HBP SH SF GIDP \n", + "0 0.0 0.0 0 0.0 NaN NaN NaN NaN NaN \n", + "1 8.0 1.0 4 0.0 NaN NaN NaN NaN NaN \n", + "2 3.0 1.0 2 5.0 NaN NaN NaN NaN NaN \n", + "3 1.0 1.0 0 2.0 NaN NaN NaN NaN NaN \n", + "4 6.0 2.0 2 1.0 NaN NaN NaN NaN NaN \n", + "5 0.0 1.0 0 1.0 NaN NaN NaN NaN NaN \n", + "6 0.0 0.0 1 0.0 NaN NaN NaN NaN NaN \n", + "7 11.0 6.0 13 1.0 NaN NaN NaN NaN NaN \n", + "8 0.0 0.0 0 0.0 NaN NaN NaN NaN NaN \n", + "9 1.0 0.0 0 0.0 NaN NaN NaN NaN NaN \n", + "\n", + "[10 rows x 22 columns]" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[:10]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Or just like slicing a list, we can do more complex slicing:" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDyearIDstintteamIDlgIDGABRH2B...RBISBCSBBSOIBBHBPSHSFGIDP
0abercda0118711TRONaN14000...0.00.00.000.0NaNNaNNaNNaNNaN
5armstbo0118711FW1NaN12499112...5.00.01.001.0NaNNaNNaNNaNNaN
10bassjo0118711CL1NaN228918271...18.00.01.034.0NaNNaNNaNNaNNaN
15bellast0118711TRONaN2912826323...23.04.04.092.0NaNNaNNaNNaNNaN
20birdge0118711RC1NaN2510619282...13.01.00.032.0NaNNaNNaNNaNNaN
25careyto0118711FW1NaN198716202...10.05.00.021.0NaNNaNNaNNaNNaN
30cuthbne0118711PH1NaN2815047377...30.016.02.0102.0NaNNaNNaNNaNNaN
35ewellge0118711CL1NaN13000...0.00.00.000.0NaNNaNNaNNaNNaN
40flowedi0118711TRONaN2110539335...18.08.02.040.0NaNNaNNaNNaNNaN
45fulmech0118711RC1NaN166311171...3.00.00.051.0NaNNaNNaNNaNNaN
\n", + "

10 rows × 22 columns

\n", + "
" + ], + "text/plain": [ + " playerID yearID stint teamID lgID G AB R H 2B ... RBI \\\n", + "0 abercda01 1871 1 TRO NaN 1 4 0 0 0 ... 0.0 \n", + "5 armstbo01 1871 1 FW1 NaN 12 49 9 11 2 ... 5.0 \n", + "10 bassjo01 1871 1 CL1 NaN 22 89 18 27 1 ... 18.0 \n", + "15 bellast01 1871 1 TRO NaN 29 128 26 32 3 ... 23.0 \n", + "20 birdge01 1871 1 RC1 NaN 25 106 19 28 2 ... 13.0 \n", + "25 careyto01 1871 1 FW1 NaN 19 87 16 20 2 ... 10.0 \n", + "30 cuthbne01 1871 1 PH1 NaN 28 150 47 37 7 ... 30.0 \n", + "35 ewellge01 1871 1 CL1 NaN 1 3 0 0 0 ... 0.0 \n", + "40 flowedi01 1871 1 TRO NaN 21 105 39 33 5 ... 18.0 \n", + "45 fulmech01 1871 1 RC1 NaN 16 63 11 17 1 ... 3.0 \n", + "\n", + " SB CS BB SO IBB HBP SH SF GIDP \n", + "0 0.0 0.0 0 0.0 NaN NaN NaN NaN NaN \n", + "5 0.0 1.0 0 1.0 NaN NaN NaN NaN NaN \n", + "10 0.0 1.0 3 4.0 NaN NaN NaN NaN NaN \n", + "15 4.0 4.0 9 2.0 NaN NaN NaN NaN NaN \n", + "20 1.0 0.0 3 2.0 NaN NaN NaN NaN NaN \n", + "25 5.0 0.0 2 1.0 NaN NaN NaN NaN NaN \n", + "30 16.0 2.0 10 2.0 NaN NaN NaN NaN NaN \n", + "35 0.0 0.0 0 0.0 NaN NaN NaN NaN NaN \n", + "40 8.0 2.0 4 0.0 NaN NaN NaN NaN NaN \n", + "45 0.0 0.0 5 1.0 NaN NaN NaN NaN NaN \n", + "\n", + "[10 rows x 22 columns]" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[:50:5]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Accessing column data by label\n", + "\n", + "One of the nice things about Pandas is that we can reference the columns of data by their names (or labels). For example, we have a `yearID` label, `teamID` label, `G` label for game counts, and so on. For our dataset to learn what the labels are in detail see the documentation for the provided links." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 1871\n", + "1 1871\n", + "2 1871\n", + "3 1871\n", + "4 1871\n", + "5 1871\n", + "6 1871\n", + "7 1871\n", + "8 1871\n", + "9 1871\n", + "Name: yearID, dtype: int64" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.yearID[:10]" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "102806 6\n", + "102807 8\n", + "102808 8\n", + "102809 36\n", + "102810 33\n", + "102811 19\n", + "102812 115\n", + "102813 147\n", + "102814 55\n", + "102815 12\n", + "Name: G, dtype: int64" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.G[-10:]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's say we want all the player data for the [Washington Nationals](https://www.mlb.com/nationals) from 2015 and 2016 where a player played in 100 or more games:" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDyearIDstintteamIDlgIDGABRH2B...RBISBCSBBSOIBBHBPSHSFGIDP
100193desmoia0120151WASNL1565836913627...62.013.05.045187.00.03.06.04.09.0
100250escobyu0120151WASNL1395357516825...56.02.02.04570.00.08.01.02.024.0
100251espinda0120151WASNL118367598821...37.05.02.033106.05.06.03.03.06.0
100422harpebr0320151WASNL15352111817238...99.06.04.0124131.015.05.00.04.015.0
100950ramoswi0120151WASNL1284754110916...68.00.00.021101.02.00.00.08.016.0
100993robincl0120151WASNL126309448415...34.00.00.03752.04.05.00.01.06.0
101176taylomi0220151WASNL1384724910815...63.016.03.035158.09.01.01.02.05.0
101725espinda0120161WASNL1575166610815...72.09.02.054174.012.020.07.04.04.0
101895harpebr0320161WASNL1475068412324...86.021.010.0108117.020.03.00.010.011.0
102245murphda0820161WASNL1425318818447...104.05.03.03557.010.08.00.08.04.0
102429ramoswi0120161WASNL1314825814825...80.00.00.03579.02.02.00.04.017.0
102449rendoan0120161WASNL1565679115338...85.012.06.065117.02.07.00.08.05.0
102451reverbe0120161WASNL10335044769...24.014.05.01834.00.03.02.02.012.0
102472robincl0120161WASNL10419616464...26.00.00.02038.00.02.01.05.04.0
102763werthja0120161WASNL1435258412828...69.05.01.071139.00.04.00.06.017.0
102812zimmery0120161WASNL115427609318...46.04.01.029104.01.05.00.06.012.0
\n", + "

16 rows × 22 columns

\n", + "
" + ], + "text/plain": [ + " playerID yearID stint teamID lgID G AB R H 2B ... \\\n", + "100193 desmoia01 2015 1 WAS NL 156 583 69 136 27 ... \n", + "100250 escobyu01 2015 1 WAS NL 139 535 75 168 25 ... \n", + "100251 espinda01 2015 1 WAS NL 118 367 59 88 21 ... \n", + "100422 harpebr03 2015 1 WAS NL 153 521 118 172 38 ... \n", + "100950 ramoswi01 2015 1 WAS NL 128 475 41 109 16 ... \n", + "100993 robincl01 2015 1 WAS NL 126 309 44 84 15 ... \n", + "101176 taylomi02 2015 1 WAS NL 138 472 49 108 15 ... \n", + "101725 espinda01 2016 1 WAS NL 157 516 66 108 15 ... \n", + "101895 harpebr03 2016 1 WAS NL 147 506 84 123 24 ... \n", + "102245 murphda08 2016 1 WAS NL 142 531 88 184 47 ... \n", + "102429 ramoswi01 2016 1 WAS NL 131 482 58 148 25 ... \n", + "102449 rendoan01 2016 1 WAS NL 156 567 91 153 38 ... \n", + "102451 reverbe01 2016 1 WAS NL 103 350 44 76 9 ... \n", + "102472 robincl01 2016 1 WAS NL 104 196 16 46 4 ... \n", + "102763 werthja01 2016 1 WAS NL 143 525 84 128 28 ... \n", + "102812 zimmery01 2016 1 WAS NL 115 427 60 93 18 ... \n", + "\n", + " RBI SB CS BB SO IBB HBP SH SF GIDP \n", + "100193 62.0 13.0 5.0 45 187.0 0.0 3.0 6.0 4.0 9.0 \n", + "100250 56.0 2.0 2.0 45 70.0 0.0 8.0 1.0 2.0 24.0 \n", + "100251 37.0 5.0 2.0 33 106.0 5.0 6.0 3.0 3.0 6.0 \n", + "100422 99.0 6.0 4.0 124 131.0 15.0 5.0 0.0 4.0 15.0 \n", + "100950 68.0 0.0 0.0 21 101.0 2.0 0.0 0.0 8.0 16.0 \n", + "100993 34.0 0.0 0.0 37 52.0 4.0 5.0 0.0 1.0 6.0 \n", + "101176 63.0 16.0 3.0 35 158.0 9.0 1.0 1.0 2.0 5.0 \n", + "101725 72.0 9.0 2.0 54 174.0 12.0 20.0 7.0 4.0 4.0 \n", + "101895 86.0 21.0 10.0 108 117.0 20.0 3.0 0.0 10.0 11.0 \n", + "102245 104.0 5.0 3.0 35 57.0 10.0 8.0 0.0 8.0 4.0 \n", + "102429 80.0 0.0 0.0 35 79.0 2.0 2.0 0.0 4.0 17.0 \n", + "102449 85.0 12.0 6.0 65 117.0 2.0 7.0 0.0 8.0 5.0 \n", + "102451 24.0 14.0 5.0 18 34.0 0.0 3.0 2.0 2.0 12.0 \n", + "102472 26.0 0.0 0.0 20 38.0 0.0 2.0 1.0 5.0 4.0 \n", + "102763 69.0 5.0 1.0 71 139.0 0.0 4.0 0.0 6.0 17.0 \n", + "102812 46.0 4.0 1.0 29 104.0 1.0 5.0 0.0 6.0 12.0 \n", + "\n", + "[16 rows x 22 columns]" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_was = df[(df.yearID > 2014) & (df.teamID=='WAS') & (df.G > 99)]\n", + "df_was" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We'll put all these things in motion later, but for now put a thumbnail on this for future reference. __NOTE__: we'll need to access the dataset that crosswalks the `PlayerID` with the actual player name and vitals, but we'll leave that as an exercise for the interested (hint: take a look [in this dataset](https://github.com/chadwickbureau/baseballdatabank/blob/master/core/People.csv))." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Excel\n", + "\n", + "Loading Excel data is nearly as easy as CSV data. This time we'll use a different data source and show how to access it in a slightly different manner. Instead of the _local_ file source, we will use a _remote URL_ for the resource. This will show us exactly how easy it is to seamlessly interchange various data resources. \n", + "\n", + "**DATA SOURCES**\n", + "\n", + "* [US Bureau of Transportation Statistics | Airline Employment Data](https://www.bts.gov/newsroom/may-2017-passenger-airline-employment-data) which includes data for year-over-year percentage change in employment for workers in the passenger airline industry\n", + "\n", + "To read data from the data set we will access it by URL and use the [`pandas.read_excel()` method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html#pandas.read_excel) note we're using the `sheetname=None` parameter to read each sheet to be assigned its own key in a dictionary for easy lookup by sheet name." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "xl = pd.read_excel(\n", + " \"https://www.bts.gov/sites/bts.dot.gov/files/docs/newsroom/206581/airline-employment-press-tables-web.xlsx\",\n", + " sheetname=None)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice now, if we want to access the _sheet_ called `Table1` we can easily do this in a Pythonic way much like any other dictionary. The result is the DataFrame representation of that _sheet_." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Table 1: Yearly Change in Scheduled Passenger Airline Full-time Equivalent Employees* by Airline GroupUnnamed: 1Unnamed: 2Unnamed: 3Unnamed: 4Unnamed: 5
0Most recent 13 months - percent change from sa...NaNNaNNaNNaNNaN
1NaNNetwork AirlinesLow-Cost AirlinesRegional AirlinesOther AirlinesAll Passenger Airlines **
2May 2015 - May 20162.310.70.29.33.7
3Jun 2015 - Jun 20162.3110.910.63.9
4Jul 2015 - Jul 20162.411.33.311.24.3
5Aug 2015 - Aug 20162.5113.311.94.3
6Sep 2015 - Sep 20162.610.62.9134.3
7Oct 2015 - Oct 20162.710.30.312.74
8Nov 2015 - Nov 20162.39.80.213.53.7
9Dec 2015 - Dec 20162.49.50.213.73.7
10Jan 2016 - Jan 20172.39.71.912.73.9
11Feb 2016 - Feb 20172.49.42.411.83.9
12Mar 2016 - Mar 20172.79.1211.74
13Apr 2016 - Apr 20172.68.52.110.73.9
14May 2016 - May 20172.48.32.54.23.6
15Source: Bureau of Transportation StatisticsNaNNaNNaNNaNNaN
16* Full-time Equivalent Employee (FTE) calculat...NaNNaNNaNNaNNaN
17** Includes network, low-cost, regional and ot...NaNNaNNaNNaNNaN
18Note: Percent changes based on numbers prior t...NaNNaNNaNNaNNaN
19Note: See Table 2 for all passenger airlines, ...NaNNaNNaNNaNNaN
\n", + "
" + ], + "text/plain": [ + " Table 1: Yearly Change in Scheduled Passenger Airline Full-time Equivalent Employees* by Airline Group \\\n", + "0 Most recent 13 months - percent change from sa... \n", + "1 NaN \n", + "2 May 2015 - May 2016 \n", + "3 Jun 2015 - Jun 2016 \n", + "4 Jul 2015 - Jul 2016 \n", + "5 Aug 2015 - Aug 2016 \n", + "6 Sep 2015 - Sep 2016 \n", + "7 Oct 2015 - Oct 2016 \n", + "8 Nov 2015 - Nov 2016 \n", + "9 Dec 2015 - Dec 2016 \n", + "10 Jan 2016 - Jan 2017 \n", + "11 Feb 2016 - Feb 2017 \n", + "12 Mar 2016 - Mar 2017 \n", + "13 Apr 2016 - Apr 2017 \n", + "14 May 2016 - May 2017 \n", + "15 Source: Bureau of Transportation Statistics \n", + "16 * Full-time Equivalent Employee (FTE) calculat... \n", + "17 ** Includes network, low-cost, regional and ot... \n", + "18 Note: Percent changes based on numbers prior t... \n", + "19 Note: See Table 2 for all passenger airlines, ... \n", + "\n", + " Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 \\\n", + "0 NaN NaN NaN NaN \n", + "1 Network Airlines Low-Cost Airlines Regional Airlines Other Airlines \n", + "2 2.3 10.7 0.2 9.3 \n", + "3 2.3 11 0.9 10.6 \n", + "4 2.4 11.3 3.3 11.2 \n", + "5 2.5 11 3.3 11.9 \n", + "6 2.6 10.6 2.9 13 \n", + "7 2.7 10.3 0.3 12.7 \n", + "8 2.3 9.8 0.2 13.5 \n", + "9 2.4 9.5 0.2 13.7 \n", + "10 2.3 9.7 1.9 12.7 \n", + "11 2.4 9.4 2.4 11.8 \n", + "12 2.7 9.1 2 11.7 \n", + "13 2.6 8.5 2.1 10.7 \n", + "14 2.4 8.3 2.5 4.2 \n", + "15 NaN NaN NaN NaN \n", + "16 NaN NaN NaN NaN \n", + "17 NaN NaN NaN NaN \n", + "18 NaN NaN NaN NaN \n", + "19 NaN NaN NaN NaN \n", + "\n", + " Unnamed: 5 \n", + "0 NaN \n", + "1 All Passenger Airlines ** \n", + "2 3.7 \n", + "3 3.9 \n", + "4 4.3 \n", + "5 4.3 \n", + "6 4.3 \n", + "7 4 \n", + "8 3.7 \n", + "9 3.7 \n", + "10 3.9 \n", + "11 3.9 \n", + "12 4 \n", + "13 3.9 \n", + "14 3.6 \n", + "15 NaN \n", + "16 NaN \n", + "17 NaN \n", + "18 NaN \n", + "19 NaN " + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "xl_tbl1 = xl['Table1']\n", + "xl_tbl1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One problem we have here is that the data is not exactly as clean as we want it to be. We'll spend more time talking about the `iloc`() method in the next section, but for now, let's get a flavor for how we might clean this up so it is more usable." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " Table 1: Yearly Change in Scheduled Passenger Airline Full-time Equivalent Employees* by Airline Group\n", + "2 May 2015 - May 2016 \n", + "3 Jun 2015 - Jun 2016 \n", + "4 Jul 2015 - Jul 2016 \n", + "5 Aug 2015 - Aug 2016 \n", + "6 Sep 2015 - Sep 2016 \n", + "7 Oct 2015 - Oct 2016 \n", + "8 Nov 2015 - Nov 2016 \n", + "9 Dec 2015 - Dec 2016 \n", + "10 Jan 2016 - Jan 2017 \n", + "11 Feb 2016 - Feb 2017 \n", + "12 Mar 2016 - Mar 2017 \n", + "13 Apr 2016 - Apr 2017 \n", + "14 May 2016 - May 2017 \n", + "Unnamed: 1 Network Airlines\n", + "Unnamed: 2 Low-Cost Airlines\n", + "Unnamed: 3 Regional Airlines\n", + "Unnamed: 4 Other Airlines\n", + "Unnamed: 5 All Passenger Airlines **\n", + "Name: 1, dtype: object\n" + ] + } + ], + "source": [ + "# lets select the (row) index \n", + "idx = xl_tbl1.iloc[2:15, 0:1]\n", + "\n", + "# lets select the (col) index\n", + "col = xl_tbl1.iloc[1,1:]\n", + "\n", + "print(idx)\n", + "print(col)" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Index(['May 2015 - May 2016', 'Jun 2015 - Jun 2016', 'Jul 2015 - Jul 2016',\n", + " 'Aug 2015 - Aug 2016', 'Sep 2015 - Sep 2016', 'Oct 2015 - Oct 2016',\n", + " 'Nov 2015 - Nov 2016', 'Dec 2015 - Dec 2016', 'Jan 2016 - Jan 2017',\n", + " 'Feb 2016 - Feb 2017', 'Mar 2016 - Mar 2017', 'Apr 2016 - Apr 2017',\n", + " 'May 2016 - May 2017'],\n", + " dtype='object')" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# we'll create the index object\n", + "idxs = pd.Index([v[0] for v in idx.values])\n", + "idxs" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['Network Airlines',\n", + " 'Low-Cost Airlines',\n", + " 'Regional Airlines',\n", + " 'Other Airlines',\n", + " 'All Passenger Airlines **']" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# set the columns\n", + "cols = [v for v in col.values]\n", + "cols" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[2.3, 10.7, 0.2, 9.3, 3.7],\n", + " [2.3, 11, 0.9, 10.6, 3.9],\n", + " [2.4, 11.3, 3.3, 11.2, 4.3],\n", + " [2.5, 11, 3.3, 11.9, 4.3],\n", + " [2.6, 10.6, 2.9, 13, 4.3],\n", + " [2.7, 10.3, 0.3, 12.7, 4],\n", + " [2.3, 9.8, 0.2, 13.5, 3.7],\n", + " [2.4, 9.5, 0.2, 13.7, 3.7],\n", + " [2.3, 9.7, 1.9, 12.7, 3.9],\n", + " [2.4, 9.4, 2.4, 11.8, 3.9],\n", + " [2.7, 9.1, 2, 11.7, 4],\n", + " [2.6, 8.5, 2.1, 10.7, 3.9],\n", + " [2.4, 8.3, 2.5, 4.2, 3.6]], dtype=object)" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# now for the data\n", + "data = xl_tbl1.iloc[2:15,1:].values\n", + "data" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Network AirlinesLow-Cost AirlinesRegional AirlinesOther AirlinesAll Passenger Airlines **
May 2015 - May 20162.310.70.29.33.7
Jun 2015 - Jun 20162.3110.910.63.9
Jul 2015 - Jul 20162.411.33.311.24.3
Aug 2015 - Aug 20162.5113.311.94.3
Sep 2015 - Sep 20162.610.62.9134.3
Oct 2015 - Oct 20162.710.30.312.74
Nov 2015 - Nov 20162.39.80.213.53.7
Dec 2015 - Dec 20162.49.50.213.73.7
Jan 2016 - Jan 20172.39.71.912.73.9
Feb 2016 - Feb 20172.49.42.411.83.9
Mar 2016 - Mar 20172.79.1211.74
Apr 2016 - Apr 20172.68.52.110.73.9
May 2016 - May 20172.48.32.54.23.6
\n", + "
" + ], + "text/plain": [ + " Network Airlines Low-Cost Airlines Regional Airlines \\\n", + "May 2015 - May 2016 2.3 10.7 0.2 \n", + "Jun 2015 - Jun 2016 2.3 11 0.9 \n", + "Jul 2015 - Jul 2016 2.4 11.3 3.3 \n", + "Aug 2015 - Aug 2016 2.5 11 3.3 \n", + "Sep 2015 - Sep 2016 2.6 10.6 2.9 \n", + "Oct 2015 - Oct 2016 2.7 10.3 0.3 \n", + "Nov 2015 - Nov 2016 2.3 9.8 0.2 \n", + "Dec 2015 - Dec 2016 2.4 9.5 0.2 \n", + "Jan 2016 - Jan 2017 2.3 9.7 1.9 \n", + "Feb 2016 - Feb 2017 2.4 9.4 2.4 \n", + "Mar 2016 - Mar 2017 2.7 9.1 2 \n", + "Apr 2016 - Apr 2017 2.6 8.5 2.1 \n", + "May 2016 - May 2017 2.4 8.3 2.5 \n", + "\n", + " Other Airlines All Passenger Airlines ** \n", + "May 2015 - May 2016 9.3 3.7 \n", + "Jun 2015 - Jun 2016 10.6 3.9 \n", + "Jul 2015 - Jul 2016 11.2 4.3 \n", + "Aug 2015 - Aug 2016 11.9 4.3 \n", + "Sep 2015 - Sep 2016 13 4.3 \n", + "Oct 2015 - Oct 2016 12.7 4 \n", + "Nov 2015 - Nov 2016 13.5 3.7 \n", + "Dec 2015 - Dec 2016 13.7 3.7 \n", + "Jan 2016 - Jan 2017 12.7 3.9 \n", + "Feb 2016 - Feb 2017 11.8 3.9 \n", + "Mar 2016 - Mar 2017 11.7 4 \n", + "Apr 2016 - Apr 2017 10.7 3.9 \n", + "May 2016 - May 2017 4.2 3.6 " + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# putting it all together ...\n", + "df_tbl1 = pd.DataFrame(data=xl_tbl1.iloc[2:15,1:].values,\n", + " columns=[v for v in col.values], \n", + " index=pd.Index([v[0] for v in idx.values]))\n", + "df_tbl1" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "May 2015 - May 2016 2.3\n", + "Jun 2015 - Jun 2016 2.3\n", + "Jul 2015 - Jul 2016 2.4\n", + "Aug 2015 - Aug 2016 2.5\n", + "Sep 2015 - Sep 2016 2.6\n", + "Oct 2015 - Oct 2016 2.7\n", + "Nov 2015 - Nov 2016 2.3\n", + "Dec 2015 - Dec 2016 2.4\n", + "Jan 2016 - Jan 2017 2.3\n", + "Feb 2016 - Feb 2017 2.4\n", + "Mar 2016 - Mar 2017 2.7\n", + "Apr 2016 - Apr 2017 2.6\n", + "May 2016 - May 2017 2.4\n", + "Name: Network Airlines, dtype: object" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_tbl1['Network Airlines']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## JSON\n", + "\n", + "JSON has become a standard format format for many web data sources. It is succinct, readable and very portable -- there are libraries in nearly every modern language that can parse JSON, Python being no exception. We'll load a remote JSON data source to demonstrate remote access as well as the capabilities of using Pandas to load such a source.\n", + "\n", + "**JSON DATA SOURCE**\n", + "\n", + "* [Quotes for developers](https://github.com/fortrabbit/quotes) by _fortrabbit_\n", + "\n", + "If we haven't noticed the pattern yet, loading JSON data will come as no surprise via the [`pandas.read_json()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html#pandas.read_json).\n", + "\n", + "With JSON data you may get the best results with relatively _flat_ JSON objects. If you need to obtain different results (or you're getting results that are not as expected), you might instead into the `orient` parameter to get different resulting DataFrames. We'll load the data as-is and reshape our DataFrame for some extra practice." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
authortext
0Martin GoldingAlways code as if the guy who ends up maintain...
1UnknownAll computers wait at the same speed.
2UnknownA misplaced decimal point will always end up w...
3UnknownA good programmer looks both ways before cross...
4UnknownA computer program does what you tell it to do...
5Unknown\"Intel Inside\" is a Government Warning require...
6Arthur GodfreyCommon sense gets a lot of credit that belongs...
7UnknownChuck Norris doesn’t go hunting. Chuck Norris ...
8UnknownChuck Norris counted to infinity... twice.
9UnknownC is quirky, flawed, and an enormous success.
10UnknownBeta is Latin for still doesn’t work.
11UnknownASCII stupid question, get a stupid ANSI!
12UnknownArtificial Intelligence usually beats natural ...
13Ted NelsonAny fool can use a computer. Many do.
14UnknownHey! It compiles! Ship it!
15Martin Luther King JuniorHate cannot drive out hate; only love can do t...
16UnknownGuns don’t kill people. Chuck Norris kills peo...
17UnknownGod is real, unless declared integer.
18John JohnsonFirst, solve the problem. Then, write the code.
19Oscar WildeExperience is the name everyone gives to their...
20Miguel de IcazaEvery piece of software written today is likel...
21UnknownComputers make very fast, very accurate mistakes.
22UnknownComputers do not solve problems, they execute ...
23UnknownI have NOT lost my mind—I have it backed up on...
24UnknownIf brute force doesn’t solve your problems, th...
25UnknownIt works on my machine.
26UnknownJava is, in many ways, C++??.
27UnknownKeyboard not found...Press any key to continue.
28UnknownLife would be so much easier if we only had th...
29UnknownMac users swear by their Mac, PC users swear a...
.........
159Paul GrahamOO programming offers a sustainable way to wri...
160Nikita PopovRuby is rubbish! PHP is phpantastic!
161Douglas AdamsSo long and thanks for all the fish!
162CiceroIf I had more time, I would have written a sho...
163Jeff AtwoodThe best reaction to \"this is confusing, where...
164Jeff AtwoodThe older I get, the more I believe that the o...
165Douglas Crockford\"That hardly ever happens\" is another way of s...
166Anna DebenhamHello, PHP, my old friend.
167Melvin ConwayOrganizations which design systems are constra...
168Melvin ConwayIn design, complexity is toxic.
169Jeffrey ZeldmanGood is the enemy of great, but great is the e...
170Rick LemonsDon't make the user provide information that t...
171Donald E. KnuthYou're bound to be unhappy if you optimize eve...
172Anna NachesaIf the programmers like each other, they play ...
173Edsger W. DijkstraSimplicity is prerequisite for reliability.
174Jordi BoggianoFocus on WHY instead of WHAT in your code will...
175Andrei HerasimchukThe best engineers I know are artists at heart...
176Barry BoehmPoor management can increase software costs mo...
177Daniel BryantIf you can't deploy your services independentl...
178Daniel BryantIf you can't deploy your services independentl...
179Jeff AtwoodNo one hates software more than software devel...
180Robert C. MartinThe proper use of comments is to compensate fo...
181Cory HouseCode is like humor. When you have to explain i...
182Steve MaguireFix the cause, not the symptom.
183David Heinemeier HanssonProgrammers are constantly making things more ...
184Linus TorvaldsPeople will realize that software is not a pro...
185Ron FeinDesign is choosing how you will fail.
186Steve JobsFocus is saying no to 1000 good ideas.
187Ron JeffriesCode never lies, comments sometimes do.
188UnknownBe careful with each other, so you can be dang...
\n", + "

189 rows × 2 columns

\n", + "
" + ], + "text/plain": [ + " author \\\n", + "0 Martin Golding \n", + "1 Unknown \n", + "2 Unknown \n", + "3 Unknown \n", + "4 Unknown \n", + "5 Unknown \n", + "6 Arthur Godfrey \n", + "7 Unknown \n", + "8 Unknown \n", + "9 Unknown \n", + "10 Unknown \n", + "11 Unknown \n", + "12 Unknown \n", + "13 Ted Nelson \n", + "14 Unknown \n", + "15 Martin Luther King Junior \n", + "16 Unknown \n", + "17 Unknown \n", + "18 John Johnson \n", + "19 Oscar Wilde \n", + "20 Miguel de Icaza \n", + "21 Unknown \n", + "22 Unknown \n", + "23 Unknown \n", + "24 Unknown \n", + "25 Unknown \n", + "26 Unknown \n", + "27 Unknown \n", + "28 Unknown \n", + "29 Unknown \n", + ".. ... \n", + "159 Paul Graham \n", + "160 Nikita Popov \n", + "161 Douglas Adams \n", + "162 Cicero \n", + "163 Jeff Atwood \n", + "164 Jeff Atwood \n", + "165 Douglas Crockford \n", + "166 Anna Debenham \n", + "167 Melvin Conway \n", + "168 Melvin Conway \n", + "169 Jeffrey Zeldman \n", + "170 Rick Lemons \n", + "171 Donald E. Knuth \n", + "172 Anna Nachesa \n", + "173 Edsger W. Dijkstra \n", + "174 Jordi Boggiano \n", + "175 Andrei Herasimchuk \n", + "176 Barry Boehm \n", + "177 Daniel Bryant \n", + "178 Daniel Bryant \n", + "179 Jeff Atwood \n", + "180 Robert C. Martin \n", + "181 Cory House \n", + "182 Steve Maguire \n", + "183 David Heinemeier Hansson \n", + "184 Linus Torvalds \n", + "185 Ron Fein \n", + "186 Steve Jobs \n", + "187 Ron Jeffries \n", + "188 Unknown \n", + "\n", + " text \n", + "0 Always code as if the guy who ends up maintain... \n", + "1 All computers wait at the same speed. \n", + "2 A misplaced decimal point will always end up w... \n", + "3 A good programmer looks both ways before cross... \n", + "4 A computer program does what you tell it to do... \n", + "5 \"Intel Inside\" is a Government Warning require... \n", + "6 Common sense gets a lot of credit that belongs... \n", + "7 Chuck Norris doesn’t go hunting. Chuck Norris ... \n", + "8 Chuck Norris counted to infinity... twice. \n", + "9 C is quirky, flawed, and an enormous success. \n", + "10 Beta is Latin for still doesn’t work. \n", + "11 ASCII stupid question, get a stupid ANSI! \n", + "12 Artificial Intelligence usually beats natural ... \n", + "13 Any fool can use a computer. Many do. \n", + "14 Hey! It compiles! Ship it! \n", + "15 Hate cannot drive out hate; only love can do t... \n", + "16 Guns don’t kill people. Chuck Norris kills peo... \n", + "17 God is real, unless declared integer. \n", + "18 First, solve the problem. Then, write the code. \n", + "19 Experience is the name everyone gives to their... \n", + "20 Every piece of software written today is likel... \n", + "21 Computers make very fast, very accurate mistakes. \n", + "22 Computers do not solve problems, they execute ... \n", + "23 I have NOT lost my mind—I have it backed up on... \n", + "24 If brute force doesn’t solve your problems, th... \n", + "25 It works on my machine. \n", + "26 Java is, in many ways, C++??. \n", + "27 Keyboard not found...Press any key to continue. \n", + "28 Life would be so much easier if we only had th... \n", + "29 Mac users swear by their Mac, PC users swear a... \n", + ".. ... \n", + "159 OO programming offers a sustainable way to wri... \n", + "160 Ruby is rubbish! PHP is phpantastic! \n", + "161 So long and thanks for all the fish! \n", + "162 If I had more time, I would have written a sho... \n", + "163 The best reaction to \"this is confusing, where... \n", + "164 The older I get, the more I believe that the o... \n", + "165 \"That hardly ever happens\" is another way of s... \n", + "166 Hello, PHP, my old friend. \n", + "167 Organizations which design systems are constra... \n", + "168 In design, complexity is toxic. \n", + "169 Good is the enemy of great, but great is the e... \n", + "170 Don't make the user provide information that t... \n", + "171 You're bound to be unhappy if you optimize eve... \n", + "172 If the programmers like each other, they play ... \n", + "173 Simplicity is prerequisite for reliability. \n", + "174 Focus on WHY instead of WHAT in your code will... \n", + "175 The best engineers I know are artists at heart... \n", + "176 Poor management can increase software costs mo... \n", + "177 If you can't deploy your services independentl... \n", + "178 If you can't deploy your services independentl... \n", + "179 No one hates software more than software devel... \n", + "180 The proper use of comments is to compensate fo... \n", + "181 Code is like humor. When you have to explain i... \n", + "182 Fix the cause, not the symptom. \n", + "183 Programmers are constantly making things more ... \n", + "184 People will realize that software is not a pro... \n", + "185 Design is choosing how you will fail. \n", + "186 Focus is saying no to 1000 good ideas. \n", + "187 Code never lies, comments sometimes do. \n", + "188 Be careful with each other, so you can be dang... \n", + "\n", + "[189 rows x 2 columns]" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = pd.read_json(\n", + " \"https://raw.githubusercontent.com/fortrabbit/quotes/master/quotes.json\")\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Though not a best practice, say we wanted to set the author as the index and the quote of the text the value. In this dataset, we're going to have repeated index values, and it might make sense if we wanted to access the data this way, but be _very careful doing this in practice_." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
text
author
Martin GoldingAlways code as if the guy who ends up maintain...
UnknownAll computers wait at the same speed.
UnknownA misplaced decimal point will always end up w...
UnknownA good programmer looks both ways before cross...
UnknownA computer program does what you tell it to do...
Unknown\"Intel Inside\" is a Government Warning require...
Arthur GodfreyCommon sense gets a lot of credit that belongs...
UnknownChuck Norris doesn’t go hunting. Chuck Norris ...
UnknownChuck Norris counted to infinity... twice.
UnknownC is quirky, flawed, and an enormous success.
UnknownBeta is Latin for still doesn’t work.
UnknownASCII stupid question, get a stupid ANSI!
UnknownArtificial Intelligence usually beats natural ...
Ted NelsonAny fool can use a computer. Many do.
UnknownHey! It compiles! Ship it!
Martin Luther King JuniorHate cannot drive out hate; only love can do t...
UnknownGuns don’t kill people. Chuck Norris kills peo...
UnknownGod is real, unless declared integer.
John JohnsonFirst, solve the problem. Then, write the code.
Oscar WildeExperience is the name everyone gives to their...
Miguel de IcazaEvery piece of software written today is likel...
UnknownComputers make very fast, very accurate mistakes.
UnknownComputers do not solve problems, they execute ...
UnknownI have NOT lost my mind—I have it backed up on...
UnknownIf brute force doesn’t solve your problems, th...
UnknownIt works on my machine.
UnknownJava is, in many ways, C++??.
UnknownKeyboard not found...Press any key to continue.
UnknownLife would be so much easier if we only had th...
UnknownMac users swear by their Mac, PC users swear a...
......
Paul GrahamOO programming offers a sustainable way to wri...
Nikita PopovRuby is rubbish! PHP is phpantastic!
Douglas AdamsSo long and thanks for all the fish!
CiceroIf I had more time, I would have written a sho...
Jeff AtwoodThe best reaction to \"this is confusing, where...
Jeff AtwoodThe older I get, the more I believe that the o...
Douglas Crockford\"That hardly ever happens\" is another way of s...
Anna DebenhamHello, PHP, my old friend.
Melvin ConwayOrganizations which design systems are constra...
Melvin ConwayIn design, complexity is toxic.
Jeffrey ZeldmanGood is the enemy of great, but great is the e...
Rick LemonsDon't make the user provide information that t...
Donald E. KnuthYou're bound to be unhappy if you optimize eve...
Anna NachesaIf the programmers like each other, they play ...
Edsger W. DijkstraSimplicity is prerequisite for reliability.
Jordi BoggianoFocus on WHY instead of WHAT in your code will...
Andrei HerasimchukThe best engineers I know are artists at heart...
Barry BoehmPoor management can increase software costs mo...
Daniel BryantIf you can't deploy your services independentl...
Daniel BryantIf you can't deploy your services independentl...
Jeff AtwoodNo one hates software more than software devel...
Robert C. MartinThe proper use of comments is to compensate fo...
Cory HouseCode is like humor. When you have to explain i...
Steve MaguireFix the cause, not the symptom.
David Heinemeier HanssonProgrammers are constantly making things more ...
Linus TorvaldsPeople will realize that software is not a pro...
Ron FeinDesign is choosing how you will fail.
Steve JobsFocus is saying no to 1000 good ideas.
Ron JeffriesCode never lies, comments sometimes do.
UnknownBe careful with each other, so you can be dang...
\n", + "

189 rows × 1 columns

\n", + "
" + ], + "text/plain": [ + " text\n", + "author \n", + "Martin Golding Always code as if the guy who ends up maintain...\n", + "Unknown All computers wait at the same speed.\n", + "Unknown A misplaced decimal point will always end up w...\n", + "Unknown A good programmer looks both ways before cross...\n", + "Unknown A computer program does what you tell it to do...\n", + "Unknown \"Intel Inside\" is a Government Warning require...\n", + "Arthur Godfrey Common sense gets a lot of credit that belongs...\n", + "Unknown Chuck Norris doesn’t go hunting. Chuck Norris ...\n", + "Unknown Chuck Norris counted to infinity... twice.\n", + "Unknown C is quirky, flawed, and an enormous success.\n", + "Unknown Beta is Latin for still doesn’t work.\n", + "Unknown ASCII stupid question, get a stupid ANSI!\n", + "Unknown Artificial Intelligence usually beats natural ...\n", + "Ted Nelson Any fool can use a computer. Many do.\n", + "Unknown Hey! It compiles! Ship it!\n", + "Martin Luther King Junior Hate cannot drive out hate; only love can do t...\n", + "Unknown Guns don’t kill people. Chuck Norris kills peo...\n", + "Unknown God is real, unless declared integer.\n", + "John Johnson First, solve the problem. Then, write the code.\n", + "Oscar Wilde Experience is the name everyone gives to their...\n", + "Miguel de Icaza Every piece of software written today is likel...\n", + "Unknown Computers make very fast, very accurate mistakes.\n", + "Unknown Computers do not solve problems, they execute ...\n", + "Unknown I have NOT lost my mind—I have it backed up on...\n", + "Unknown If brute force doesn’t solve your problems, th...\n", + "Unknown It works on my machine.\n", + "Unknown Java is, in many ways, C++??.\n", + "Unknown Keyboard not found...Press any key to continue.\n", + "Unknown Life would be so much easier if we only had th...\n", + "Unknown Mac users swear by their Mac, PC users swear a...\n", + "... ...\n", + "Paul Graham OO programming offers a sustainable way to wri...\n", + "Nikita Popov Ruby is rubbish! PHP is phpantastic!\n", + "Douglas Adams So long and thanks for all the fish!\n", + "Cicero If I had more time, I would have written a sho...\n", + "Jeff Atwood The best reaction to \"this is confusing, where...\n", + "Jeff Atwood The older I get, the more I believe that the o...\n", + "Douglas Crockford \"That hardly ever happens\" is another way of s...\n", + "Anna Debenham Hello, PHP, my old friend.\n", + "Melvin Conway Organizations which design systems are constra...\n", + "Melvin Conway In design, complexity is toxic.\n", + "Jeffrey Zeldman Good is the enemy of great, but great is the e...\n", + "Rick Lemons Don't make the user provide information that t...\n", + "Donald E. Knuth You're bound to be unhappy if you optimize eve...\n", + "Anna Nachesa If the programmers like each other, they play ...\n", + "Edsger W. Dijkstra Simplicity is prerequisite for reliability.\n", + "Jordi Boggiano Focus on WHY instead of WHAT in your code will...\n", + "Andrei Herasimchuk The best engineers I know are artists at heart...\n", + "Barry Boehm Poor management can increase software costs mo...\n", + "Daniel Bryant If you can't deploy your services independentl...\n", + "Daniel Bryant If you can't deploy your services independentl...\n", + "Jeff Atwood No one hates software more than software devel...\n", + "Robert C. Martin The proper use of comments is to compensate fo...\n", + "Cory House Code is like humor. When you have to explain i...\n", + "Steve Maguire Fix the cause, not the symptom.\n", + "David Heinemeier Hansson Programmers are constantly making things more ...\n", + "Linus Torvalds People will realize that software is not a pro...\n", + "Ron Fein Design is choosing how you will fail.\n", + "Steve Jobs Focus is saying no to 1000 good ideas.\n", + "Ron Jeffries Code never lies, comments sometimes do.\n", + "Unknown Be careful with each other, so you can be dang...\n", + "\n", + "[189 rows x 1 columns]" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1 = df.set_index(df['author']).drop('author', axis=1)\n", + "df1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Though we haven't talked about it, there is a very interesting and useful mechanism for filtering data using the [`apply()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html#pandas.Series.apply) method. In this case, we're going to write a cute anonymous function that finds all the quotes by the author `Unknown` with `java` in the quote." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
text
author
UnknownJava is, in many ways, C++??.
\n", + "
" + ], + "text/plain": [ + " text\n", + "author \n", + "Unknown Java is, in many ways, C++??." + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1.loc[\"Unknown\"][df1.loc[\"Unknown\"][\"text\"]\n", + " .apply(lambda v: \"jav\" in v.lower())]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## SQL" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "collapsed": true + }, + "source": [ + "Loading SQL data into a DataFrame is also supported by Pandas. You might need to take a look at the [SQLAlchemy](http://www.sqlalchemy.org/) and the [documentation on creating database engines](http://docs.sqlalchemy.org/en/latest/core/engines.html), as this is the framework supported directly by Pandas.\n", + "\n", + "**SQL DATA SOURCE**\n", + "\n", + "* [Jeopardy! Data Analysis](https://github.com/cmohamma/jeopardy) - a sqlite database by _cmohamma_\n", + "\n", + "This file contains a number of tables that contain the Jeopardy! game clues, players, wins, categories, etc. We will only use a fraction of the data to demonstrate the SQL capabilities.\n", + "\n", + "Our example will use a [SQLite database](https://sqlite.org/) so we can demonstrate the example in a standalone context. We'll show reading a table in full using the [`read_sql_table()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html#pandas.read_sql_table) and then how to do ad hoc queries using [`read_sql_query()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_query.html#pandas.read_sql_query)." + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "from sqlalchemy import create_engine\n", + "engine = create_engine('sqlite:///datasets/database.sqlite')\n", + "\n", + "with engine.connect() as conn, conn.begin():\n", + " data = pd.read_sql_table('final', conn)" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
game_idclue_idvaluecategorycluestrike1strike2strike3answer
028016720100HIGH ROLLERSAfter an 1891 roulette run, Charles Wells was ...What is Atlantic City?What is Las Vegas?What is Monaco?Monte Carlo
142925403100OH, CRAPS!The combo that totals one shy of \"boxcars\"What is 11?What is 10?What is 9?5 & 6
286651549100ROCK & POPIt was the last decade in which Cher didn't ha...What are the 1980s?What are the 1970s?What are the 1990s?1950s
3101860582100LET'S HAVE A BALLSink it & you've scratchedUm...What is the pinball?What is the 8-ball?the cue ball
4106963644100WHAT A YEAR!Dewaele won the Tour de France, Coco Chanel wa...What is 1933?What is 1987?What is 1927?1929
5147384364100EUROPEAN HISTORYA former Socialist, he formed the anti-Communi...Who was Lenin?Who was Franco?Who was Hitler?Benito Mussolini
6163593864100CHRISTIANITYAccording to tradition, Dismas & Gestas were t...Who are the thieves?What is Cavalry?What is Mt. Olive?Calvary
74166242419100NAME THE DECADEPaul Revere & William Dawes warn colonists tha...What is the 16th century?What is the 18th century?What is the 18th century?the 1770s
81126679200ODD ALPHABETSIn alphabet radio code, \"B\" is Bravo and \"F\" s...What's the Flamingo?What's a Fandango?What's the Flamenco? - you have it written the...Foxtrot
935420984200SPORTSA filly becomes a mare at this ageWhat is 3?What is 1?What is 2?4
\n", + "
" + ], + "text/plain": [ + " game_id clue_id value category \\\n", + "0 280 16720 100 HIGH ROLLERS \n", + "1 429 25403 100 OH, CRAPS! \n", + "2 866 51549 100 ROCK & POP \n", + "3 1018 60582 100 LET'S HAVE A BALL \n", + "4 1069 63644 100 WHAT A YEAR! \n", + "5 1473 84364 100 EUROPEAN HISTORY \n", + "6 1635 93864 100 CHRISTIANITY \n", + "7 4166 242419 100 NAME THE DECADE \n", + "8 112 6679 200 ODD ALPHABETS \n", + "9 354 20984 200 SPORTS \n", + "\n", + " clue \\\n", + "0 After an 1891 roulette run, Charles Wells was ... \n", + "1 The combo that totals one shy of \"boxcars\" \n", + "2 It was the last decade in which Cher didn't ha... \n", + "3 Sink it & you've scratched \n", + "4 Dewaele won the Tour de France, Coco Chanel wa... \n", + "5 A former Socialist, he formed the anti-Communi... \n", + "6 According to tradition, Dismas & Gestas were t... \n", + "7 Paul Revere & William Dawes warn colonists tha... \n", + "8 In alphabet radio code, \"B\" is Bravo and \"F\" s... \n", + "9 A filly becomes a mare at this age \n", + "\n", + " strike1 strike2 \\\n", + "0 What is Atlantic City? What is Las Vegas? \n", + "1 What is 11? What is 10? \n", + "2 What are the 1980s? What are the 1970s? \n", + "3 Um... What is the pinball? \n", + "4 What is 1933? What is 1987? \n", + "5 Who was Lenin? Who was Franco? \n", + "6 Who are the thieves? What is Cavalry? \n", + "7 What is the 16th century? What is the 18th century? \n", + "8 What's the Flamingo? What's a Fandango? \n", + "9 What is 3? What is 1? \n", + "\n", + " strike3 answer \n", + "0 What is Monaco? Monte Carlo \n", + "1 What is 9? 5 & 6 \n", + "2 What are the 1990s? 1950s \n", + "3 What is the 8-ball? the cue ball \n", + "4 What is 1927? 1929 \n", + "5 Who was Hitler? Benito Mussolini \n", + "6 What is Mt. Olive? Calvary \n", + "7 What is the 18th century? the 1770s \n", + "8 What's the Flamenco? - you have it written the... Foxtrot \n", + "9 What is 2? 4 " + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data[:10]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now say we want to find out the distribution of occupations of players over the years. When we look into the `players` table, we can see we can create a query that allows for us to aggregate these occupations easily. \n", + "\n", + "Using [`read_sql_query()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_query.html#pandas.read_sql_query) we can get the job done and dump this into a DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "query = \"\"\"\n", + " SELECT occupation, count(occupation) as freq FROM players\n", + " WHERE occupation != ''\n", + " GROUP BY occupation \n", + " ORDER BY count(occupation) DESC \n", + " \"\"\"\n", + "\n", + "with engine.connect() as conn, conn.begin():\n", + " occupation_data = pd.read_sql_query(query, conn)" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
occupationfreq
0attorney380
1senior228
2graduate student212
3writer176
4teacher159
5junior158
6law student120
7lawyer112
8homemaker101
9actor97
\n", + "
" + ], + "text/plain": [ + " occupation freq\n", + "0 attorney 380\n", + "1 senior 228\n", + "2 graduate student 212\n", + "3 writer 176\n", + "4 teacher 159\n", + "5 junior 158\n", + "6 law student 120\n", + "7 lawyer 112\n", + "8 homemaker 101\n", + "9 actor 97" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "occupation_data[:10]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If we look closely, we can see that there are many occupations that are the same, but labeled differently. For example, \"attorney\" and \"lawyer\", or the various kinds of \"teachers\". Thus, if we just look at the frequency from above, we might be deceived in thinking that these frequencies are correct for the groupings that make sense at a slightly higher level of granularity than has been captured.\n", + "\n", + "So let's do some data munging with Pandas and see how we might group all the \"teachers\" together.\n", + "\n", + "To to this we'll need to do a few things:\n", + "\n", + "* find all occupations that have `\"teach\"` in them (or `\"teacher\"` if you'd like)\n", + "* remove all of those from the data frame\n", + "* add just the aggregate and apply the generic label \"teacher\"\n", + "* as a bonus, we'll generate the percentages as an additional column\n", + "\n", + "Let's get going!\n", + "\n", + "We are going to make use of a nice convenience attribution `str` of the `Series` object. It operates much like the `String` object in Python and has a [`contains()`](http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling) method, which will allow us to determine if the substring we're looking for is contained as a substring in any of the values of the Series. These methods are indeed very useful to have!" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "freq_all_occupations = occupation_data.freq.sum()\n", + "\n", + "combined_teacher_freq = \\\n", + " occupation_data[\n", + " occupation_data['occupation']\n", + " .str.contains('teach')]\\\n", + " .sum()" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "occupation teacherhigh school teacherhigh school English ...\n", + "freq 830\n", + "dtype: object" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "combined_teacher_freq" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice the occupation is the concatenation of all those teachers. We want to change that to a single label `\"teacher\"`." + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "combined_teacher_freq['occupation'] = 'teacher'" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "occupation teacher\n", + "freq 830\n", + "dtype: object" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "combined_teacher_freq" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We now need only append the data to our original DataFrame:" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "occupation_data = \\\n", + " occupation_data[\n", + " ~occupation_data['occupation']\n", + " .str.contains('teach')] \\\n", + " .append(combined_teacher_freq, ignore_index=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
occupationfreq
4205writer for an online magazine1
4206writer's assistant1
4207writer-producer1
4208writing instructor1
4209yoga instructor1
4210yogurt franchise operator1
4211youth ministry consultant1
4212zoo docent1
4213zoo educator1
4214teacher830
\n", + "
" + ], + "text/plain": [ + " occupation freq\n", + "4205 writer for an online magazine 1\n", + "4206 writer's assistant 1\n", + "4207 writer-producer 1\n", + "4208 writing instructor 1\n", + "4209 yoga instructor 1\n", + "4210 yogurt franchise operator 1\n", + "4211 youth ministry consultant 1\n", + "4212 zoo docent 1\n", + "4213 zoo educator 1\n", + "4214 teacher 830" + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "occupation_data[-10:]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's add the percentage column and call it `pct`:" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "occupation_data['pct'] = occupation_data['freq']/occupation_data.freq.sum()" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
occupationfreqpct
4214teacher8300.078905
0attorney3800.036125
1senior2280.021675
2graduate student2120.020154
3writer1760.016732
4junior1580.015020
5law student1200.011408
6lawyer1120.010647
7homemaker1010.009602
8actor970.009221
\n", + "
" + ], + "text/plain": [ + " occupation freq pct\n", + "4214 teacher 830 0.078905\n", + "0 attorney 380 0.036125\n", + "1 senior 228 0.021675\n", + "2 graduate student 212 0.020154\n", + "3 writer 176 0.016732\n", + "4 junior 158 0.015020\n", + "5 law student 120 0.011408\n", + "6 lawyer 112 0.010647\n", + "7 homemaker 101 0.009602\n", + "8 actor 97 0.009221" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "occupation_data.sort_values(by='pct', ascending=False)[:10]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can explore how you might make a more complex filter by looking at [`apply`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html#pandas.DataFrame.apply), [`applymap`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.applymap.html#pandas.DataFrame.applymap) and [`aggregate`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.aggregate.html#pandas.DataFrame.aggregate). Ξ" + ] + } + ], + "metadata": { + "anaconda-cloud": {}, + "kernelspec": { + "display_name": "Python [default]", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.1" + }, + "toc": { + "colors": { + "hover_highlight": "#DAA520", + "navigate_num": "#000000", + "navigate_text": "#333333", + "running_highlight": "#FF0000", + "selected_highlight": "#FFD700", + "sidebar_border": "#EEEEEE", + "wrapper_background": "#FFFFFF" + }, + "moveMenuLeft": true, + "nav_menu": { + "height": "160px", + "width": "251px" + }, + "navigate_menu": true, + "number_sections": false, + "sideBar": true, + "threshold": 4, + "toc_cell": true, + "toc_section_display": "block", + "toc_window_display": false, + "widenNotebook": false + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/nb/3_dataframe_operations.ipynb b/nb/3_dataframe_operations.ipynb new file mode 100644 index 0000000..9628c5c --- /dev/null +++ b/nb/3_dataframe_operations.ipynb @@ -0,0 +1,6528 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "** NAVIGATION **\n", + "\n", + "**Got Pandas? _Practical Data Wrangling with Pandas_**\n", + "\n", + "* [Introduction](./0_introduction.ipynb)\n", + "1. [Data Structures](./1_data_structures.ipynb)\n", + "2. [Importing Data](./2_importing_data.ipynb)\n", + "3. **Manipulating DataFrames**\n", + "4. [Wrap Up](./4_wrapping_up.ipynb)\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "toc": "true" + }, + "source": [ + "# Table of Contents\n", + "

1  Manipulating DataFrames
1.1  More Selecting
1.1.1  The convenient [] operator (again)
1.1.2  Selecting data by . selector on column and index name
1.1.3  Boolean selecting
1.2  Sorting
1.3  DataFrame manipulation
1.3.1  Adding and dropping columns
1.3.2  Adding and dropping rows
1.4  Advanced indexing
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Manipulating DataFrames" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will review our terminology for a quick moment:\n", + "* **index** : the column and row indices of your Series or DataFrame, the index for each of these may be hiearchical\n", + " * row index : the index along the horizontal dimension, and typically used as the primary index\n", + " * column index : the index along the vertical dimension\n", + " \n", + " \n", + "* **axis** : the numeric designation for the _column_ or _row_ indices; typically `0` is the _column-axis_ and `1` is the _row-axis_. When dealing with multi-indices, the hierarchy within the axis are referred to as _levels_ and accessed similarly \n", + " \n", + " \n", + "**NOTEBOOK OBJECTIVES**\n", + "\n", + "In this notebook we'll:\n", + "\n", + "* explore more complex slicing and selecting, \n", + "* look at DataFrame concatenation and appending,\n", + "* explore Multi-Indices / hierarchical indexing in Pandas." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## More Selecting\n", + "In the example for this section, we're going to go back to our Baseball data set and load the batting statistics into a DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "# get the data for players in 2015-16 who played in 100 or more games\n", + "df = pd.read_csv(\"./datasets/Batting.csv\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### The convenient `[]` operator (_again_)\n", + "\n", + "As before basic slice selections can be made with the syntax similar to that found in lists using the convenience of the `[]` operator. For example, obtaining the first 5 rows of our data, or the last 15." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDyearIDstintteamIDlgIDGABRH2B...RBISBCSBBSOIBBHBPSHSFGIDP
0abercda0118711TRONaN14000...0.00.00.000.0NaNNaNNaNNaNNaN
1addybo0118711RC1NaN2511830326...13.08.01.040.0NaNNaNNaNNaNNaN
2allisar0118711CL1NaN2913728404...19.03.01.025.0NaNNaNNaNNaNNaN
3allisdo0118711WS3NaN27133284410...27.01.01.002.0NaNNaNNaNNaNNaN
4ansonca0118711RC1NaN25120293911...16.06.02.021.0NaNNaNNaNNaNNaN
\n", + "

5 rows × 22 columns

\n", + "
" + ], + "text/plain": [ + " playerID yearID stint teamID lgID G AB R H 2B ... RBI SB \\\n", + "0 abercda01 1871 1 TRO NaN 1 4 0 0 0 ... 0.0 0.0 \n", + "1 addybo01 1871 1 RC1 NaN 25 118 30 32 6 ... 13.0 8.0 \n", + "2 allisar01 1871 1 CL1 NaN 29 137 28 40 4 ... 19.0 3.0 \n", + "3 allisdo01 1871 1 WS3 NaN 27 133 28 44 10 ... 27.0 1.0 \n", + "4 ansonca01 1871 1 RC1 NaN 25 120 29 39 11 ... 16.0 6.0 \n", + "\n", + " CS BB SO IBB HBP SH SF GIDP \n", + "0 0.0 0 0.0 NaN NaN NaN NaN NaN \n", + "1 1.0 4 0.0 NaN NaN NaN NaN NaN \n", + "2 1.0 2 5.0 NaN NaN NaN NaN NaN \n", + "3 1.0 0 2.0 NaN NaN NaN NaN NaN \n", + "4 2.0 2 1.0 NaN NaN NaN NaN NaN \n", + "\n", + "[5 rows x 22 columns]" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[:5]" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDyearIDstintteamIDlgIDGABRH2B...RBISBCSBBSOIBBHBPSHSFGIDP
102801ynoaga0120161NYNNL103000...0.00.00.000.00.00.00.00.00.0
102802ynoami0120161CHAAL230000...0.00.00.000.00.00.00.00.00.0
102803ynoara0120161COLNL35000...0.00.00.002.00.00.00.00.00.0
102804youngch0320161KCAAL341000...0.00.00.000.00.00.00.00.00.0
102805youngch0420161BOSAL76203295618...24.04.02.02150.00.03.00.00.04.0
102806younger0320161NYAAL61200...0.01.00.000.00.00.00.00.00.0
102807youngma0320161ATLNL80000...0.00.00.000.00.00.00.00.00.0
102808zastrro0120161CHNNL83000...0.00.00.002.00.00.00.00.00.0
102809zieglbr0120161ARINL360000...0.00.00.000.00.00.00.00.00.0
102810zieglbr0120162BOSAL330000...0.00.00.000.00.00.00.00.00.0
102811zimmejo0220161DETAL194010...0.00.00.002.00.00.01.00.00.0
102812zimmery0120161WASNL115427609318...46.04.01.029104.01.05.00.06.012.0
102813zobribe0120161CHNNL1475239414231...76.06.04.09682.06.04.04.04.017.0
102814zuninmi0120161SEAAL5516416347...31.00.00.02165.00.06.00.01.00.0
102815zychto0120161SEAAL120000...0.00.00.000.00.00.00.00.00.0
\n", + "

15 rows × 22 columns

\n", + "
" + ], + "text/plain": [ + " playerID yearID stint teamID lgID G AB R H 2B ... \\\n", + "102801 ynoaga01 2016 1 NYN NL 10 3 0 0 0 ... \n", + "102802 ynoami01 2016 1 CHA AL 23 0 0 0 0 ... \n", + "102803 ynoara01 2016 1 COL NL 3 5 0 0 0 ... \n", + "102804 youngch03 2016 1 KCA AL 34 1 0 0 0 ... \n", + "102805 youngch04 2016 1 BOS AL 76 203 29 56 18 ... \n", + "102806 younger03 2016 1 NYA AL 6 1 2 0 0 ... \n", + "102807 youngma03 2016 1 ATL NL 8 0 0 0 0 ... \n", + "102808 zastrro01 2016 1 CHN NL 8 3 0 0 0 ... \n", + "102809 zieglbr01 2016 1 ARI NL 36 0 0 0 0 ... \n", + "102810 zieglbr01 2016 2 BOS AL 33 0 0 0 0 ... \n", + "102811 zimmejo02 2016 1 DET AL 19 4 0 1 0 ... \n", + "102812 zimmery01 2016 1 WAS NL 115 427 60 93 18 ... \n", + "102813 zobribe01 2016 1 CHN NL 147 523 94 142 31 ... \n", + "102814 zuninmi01 2016 1 SEA AL 55 164 16 34 7 ... \n", + "102815 zychto01 2016 1 SEA AL 12 0 0 0 0 ... \n", + "\n", + " RBI SB CS BB SO IBB HBP SH SF GIDP \n", + "102801 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "102802 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "102803 0.0 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0 \n", + "102804 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "102805 24.0 4.0 2.0 21 50.0 0.0 3.0 0.0 0.0 4.0 \n", + "102806 0.0 1.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "102807 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "102808 0.0 0.0 0.0 0 2.0 0.0 0.0 0.0 0.0 0.0 \n", + "102809 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "102810 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "102811 0.0 0.0 0.0 0 2.0 0.0 0.0 1.0 0.0 0.0 \n", + "102812 46.0 4.0 1.0 29 104.0 1.0 5.0 0.0 6.0 12.0 \n", + "102813 76.0 6.0 4.0 96 82.0 6.0 4.0 4.0 4.0 17.0 \n", + "102814 31.0 0.0 0.0 21 65.0 0.0 6.0 0.0 1.0 0.0 \n", + "102815 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "\n", + "[15 rows x 22 columns]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[-15:]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We mostly worked on _row slicing_ with the `[]` selector, but if we pass a _column label_ or **list** of the columns we'd like, say the `RBI` and `G` (games played) data, we get mostly what we'd expect:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "df[\"RBI\"][:5]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "df[[\"RBI\", \"G\"]][:10]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Selecting data by `.` selector on column and index name\n", + "\n", + "We can obtain _column_ data by column labels (note that the column index was loaded for us when we read the file into the DataFrame). For example to get all the `RBI` data:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 0.0\n", + "1 13.0\n", + "2 19.0\n", + "3 27.0\n", + "4 16.0\n", + "5 5.0\n", + "6 2.0\n", + "7 34.0\n", + "8 1.0\n", + "9 11.0\n", + "Name: RBI, dtype: float64" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.RBI[:10]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Similarly, we can pass a **list** of the columns we'd like, so let's get the `RBI` and `G` (games played) data:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
RBIG
00.01
113.025
219.029
327.027
416.025
55.012
62.01
734.031
81.01
911.018
\n", + "
" + ], + "text/plain": [ + " RBI G\n", + "0 0.0 1\n", + "1 13.0 25\n", + "2 19.0 29\n", + "3 27.0 27\n", + "4 16.0 25\n", + "5 5.0 12\n", + "6 2.0 1\n", + "7 34.0 31\n", + "8 1.0 1\n", + "9 11.0 18" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[[\"RBI\", \"G\"]][:10]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Boolean selecting\n", + "We have yet to make more complex selections beyond index values. Now we're ready to introduce selecting by boolean value. With this kinds of selection, we're going to as Pandas to give us the Series or DataFrame that represents the _boolean_ values of what we want, then we will allow `iloc` to reduce the resulting Series or DataFrame to what we're looking for. Let's see this in action.\n", + "\n", + "Say we want to find all items in our DataFrame where `yearID` is `2015` or\n", + "\n", + "```\n", + "df.yearID == 2015\n", + "```\n", + "\n", + "Let's first see what this does." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 False\n", + "1 False\n", + "2 False\n", + "3 False\n", + "4 False\n", + "5 False\n", + "6 False\n", + "7 False\n", + "8 False\n", + "9 False\n", + "10 False\n", + "11 False\n", + "12 False\n", + "13 False\n", + "14 False\n", + "15 False\n", + "16 False\n", + "17 False\n", + "18 False\n", + "19 False\n", + "20 False\n", + "21 False\n", + "22 False\n", + "23 False\n", + "24 False\n", + "25 False\n", + "26 False\n", + "27 False\n", + "28 False\n", + "29 False\n", + " ... \n", + "102786 False\n", + "102787 False\n", + "102788 False\n", + "102789 False\n", + "102790 False\n", + "102791 False\n", + "102792 False\n", + "102793 False\n", + "102794 False\n", + "102795 False\n", + "102796 False\n", + "102797 False\n", + "102798 False\n", + "102799 False\n", + "102800 False\n", + "102801 False\n", + "102802 False\n", + "102803 False\n", + "102804 False\n", + "102805 False\n", + "102806 False\n", + "102807 False\n", + "102808 False\n", + "102809 False\n", + "102810 False\n", + "102811 False\n", + "102812 False\n", + "102813 False\n", + "102814 False\n", + "102815 False\n", + "Name: yearID, Length: 102816, dtype: bool" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.yearID == 2015" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We're returned the Series that contains a `True` or `False` given our _boolean_ query. We need now pass this _boolean_ Series into `loc` and we will see the outcome." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDyearIDstintteamIDlgIDGABRH2B...RBISBCSBBSOIBBHBPSHSFGIDP
99847aardsda0120151ATLNL331000...0.00.00.001.00.00.00.00.00.0
99848abadfe0120151OAKAL620000...0.00.00.000.00.00.00.00.00.0
99849abreujo0220151CHAAL1546138817834...101.00.00.039140.011.015.00.01.016.0
99850achteaj0120151MINAL110000...0.00.00.000.00.00.00.00.00.0
99851ackledu0120151SEAAL8518622408...19.02.02.01438.00.01.03.03.03.0
99852ackledu0120152NYAAL23526153...11.00.00.047.00.00.00.01.00.0
99853adamecr0120151COLNL26534131...3.00.01.0311.01.01.01.00.00.0
99854adamsau0120151CLEAL281000...0.00.00.000.00.00.00.00.01.0
99855adamsma0120151SLNNL6017514429...24.01.00.01041.01.00.00.01.01.0
99856adcocna0120151CINNL130000...0.00.00.000.00.00.00.00.00.0
\n", + "

10 rows × 22 columns

\n", + "
" + ], + "text/plain": [ + " playerID yearID stint teamID lgID G AB R H 2B ... \\\n", + "99847 aardsda01 2015 1 ATL NL 33 1 0 0 0 ... \n", + "99848 abadfe01 2015 1 OAK AL 62 0 0 0 0 ... \n", + "99849 abreujo02 2015 1 CHA AL 154 613 88 178 34 ... \n", + "99850 achteaj01 2015 1 MIN AL 11 0 0 0 0 ... \n", + "99851 ackledu01 2015 1 SEA AL 85 186 22 40 8 ... \n", + "99852 ackledu01 2015 2 NYA AL 23 52 6 15 3 ... \n", + "99853 adamecr01 2015 1 COL NL 26 53 4 13 1 ... \n", + "99854 adamsau01 2015 1 CLE AL 28 1 0 0 0 ... \n", + "99855 adamsma01 2015 1 SLN NL 60 175 14 42 9 ... \n", + "99856 adcocna01 2015 1 CIN NL 13 0 0 0 0 ... \n", + "\n", + " RBI SB CS BB SO IBB HBP SH SF GIDP \n", + "99847 0.0 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0 \n", + "99848 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "99849 101.0 0.0 0.0 39 140.0 11.0 15.0 0.0 1.0 16.0 \n", + "99850 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "99851 19.0 2.0 2.0 14 38.0 0.0 1.0 3.0 3.0 3.0 \n", + "99852 11.0 0.0 0.0 4 7.0 0.0 0.0 0.0 1.0 0.0 \n", + "99853 3.0 0.0 1.0 3 11.0 1.0 1.0 1.0 0.0 0.0 \n", + "99854 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 1.0 \n", + "99855 24.0 1.0 0.0 10 41.0 1.0 0.0 0.0 1.0 1.0 \n", + "99856 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "\n", + "[10 rows x 22 columns]" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.loc[df.yearID == 2015][:10] # note we're restricting the return to just the first 10 values" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now what if we wanted the restrict this further by team. Say we wanted to see only the [Minesota Twins](https://www.mlb.com/twins) player data for 2015. That is\n", + "\n", + "```\n", + "df.yearID == 2015\n", + "AND\n", + "df.teamID == \"MIN\"\n", + "```\n", + "\n", + "We simply put these in parethesis and use the `&` operator." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDyearIDstintteamIDlgIDGABRH2B...RBISBCSBBSOIBBHBPSHSFGIDP
99850achteaj0120151MINAL110000...0.00.00.000.00.00.00.00.00.0
99891arciaos0120151MINAL19586160...8.00.00.0415.04.02.00.01.02.0
99954bernido0120151MINAL45111...2.00.00.013.00.00.00.00.00.0
99988boyerbl0120151MINAL680000...0.00.00.000.00.00.00.00.00.0
100030buxtoby0120151MINAL4612916277...6.02.02.0644.00.01.02.00.01.0
100139cottsne0120152MINAL170000...0.00.00.000.00.00.00.00.00.0
100215doziebr0120151MINAL15762810114839...77.012.04.061148.02.07.00.08.010.0
100221duensbr0120151MINAL551000...0.00.00.000.00.00.00.00.00.0
100222duffety0120151MINAL100000...0.00.00.000.00.00.00.00.00.0
100249escobed0120151MINAL1274094810731...58.02.03.02886.01.02.02.05.07.0
\n", + "

10 rows × 22 columns

\n", + "
" + ], + "text/plain": [ + " playerID yearID stint teamID lgID G AB R H 2B ... \\\n", + "99850 achteaj01 2015 1 MIN AL 11 0 0 0 0 ... \n", + "99891 arciaos01 2015 1 MIN AL 19 58 6 16 0 ... \n", + "99954 bernido01 2015 1 MIN AL 4 5 1 1 1 ... \n", + "99988 boyerbl01 2015 1 MIN AL 68 0 0 0 0 ... \n", + "100030 buxtoby01 2015 1 MIN AL 46 129 16 27 7 ... \n", + "100139 cottsne01 2015 2 MIN AL 17 0 0 0 0 ... \n", + "100215 doziebr01 2015 1 MIN AL 157 628 101 148 39 ... \n", + "100221 duensbr01 2015 1 MIN AL 55 1 0 0 0 ... \n", + "100222 duffety01 2015 1 MIN AL 10 0 0 0 0 ... \n", + "100249 escobed01 2015 1 MIN AL 127 409 48 107 31 ... \n", + "\n", + " RBI SB CS BB SO IBB HBP SH SF GIDP \n", + "99850 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "99891 8.0 0.0 0.0 4 15.0 4.0 2.0 0.0 1.0 2.0 \n", + "99954 2.0 0.0 0.0 1 3.0 0.0 0.0 0.0 0.0 0.0 \n", + "99988 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "100030 6.0 2.0 2.0 6 44.0 0.0 1.0 2.0 0.0 1.0 \n", + "100139 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "100215 77.0 12.0 4.0 61 148.0 2.0 7.0 0.0 8.0 10.0 \n", + "100221 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "100222 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 \n", + "100249 58.0 2.0 3.0 28 86.0 1.0 2.0 2.0 5.0 7.0 \n", + "\n", + "[10 rows x 22 columns]" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.loc[(df.yearID == 2015) & (df.teamID == \"MIN\")].head(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now what if we wanted to restrict a subset of columns. This is easy with `iloc[]` ... we will just use our boolean expression as above for the _row selection_ and then the list of columns for our _column selection_ (in this case a much smaller subset of data)." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDGABHHRRBI
99850achteaj01110000.0
99891arciaos0119581628.0
99954bernido0145102.0
99988boyerbl01680000.0
100030buxtoby01461292726.0
100139cottsne01170000.0
100215doziebr011576281482877.0
100221duensbr01551000.0
100222duffety01100000.0
100249escobed011274091071258.0
100270fienca01620000.0
100302fryerer011522502.0
100333gibsoky01325100.0
100373grahajr01390000.0
100455herrmch014510315210.0
100459hicksaa0197352901133.0
100486hugheph01273000.0
100488hunteto011395211252281.0
100521jepseke01290000.0
100564keplema0137100.0
100696mauerjo011585921571066.0
100701maytr01483000.0
100729meyeral0120000.0
100737milonto01242000.0
100807nolasri0193000.0
100816nunezed027218853420.0
100837orourry01280000.0
100872pelfrmi01303200.0
100895perkigl01600000.0
100915plouftr011525731402286.0
100917polanjo01410301.0
100925pressry01270000.0
100994robinsh018318045016.0
101023rosared011224531211350.0
101067sanomi0180279751852.0
101069santada019126156021.0
101072santaer01170000.0
101079schafjo0227691505.0
101144staufti01130000.0
101164suzukku01131433104550.0
101189thielca0160000.0
101193thompaa01410000.0
101203tonkimi01260000.0
101240vargake015817542517.0
\n", + "
" + ], + "text/plain": [ + " playerID G AB H HR RBI\n", + "99850 achteaj01 11 0 0 0 0.0\n", + "99891 arciaos01 19 58 16 2 8.0\n", + "99954 bernido01 4 5 1 0 2.0\n", + "99988 boyerbl01 68 0 0 0 0.0\n", + "100030 buxtoby01 46 129 27 2 6.0\n", + "100139 cottsne01 17 0 0 0 0.0\n", + "100215 doziebr01 157 628 148 28 77.0\n", + "100221 duensbr01 55 1 0 0 0.0\n", + "100222 duffety01 10 0 0 0 0.0\n", + "100249 escobed01 127 409 107 12 58.0\n", + "100270 fienca01 62 0 0 0 0.0\n", + "100302 fryerer01 15 22 5 0 2.0\n", + "100333 gibsoky01 32 5 1 0 0.0\n", + "100373 grahajr01 39 0 0 0 0.0\n", + "100455 herrmch01 45 103 15 2 10.0\n", + "100459 hicksaa01 97 352 90 11 33.0\n", + "100486 hugheph01 27 3 0 0 0.0\n", + "100488 hunteto01 139 521 125 22 81.0\n", + "100521 jepseke01 29 0 0 0 0.0\n", + "100564 keplema01 3 7 1 0 0.0\n", + "100696 mauerjo01 158 592 157 10 66.0\n", + "100701 maytr01 48 3 0 0 0.0\n", + "100729 meyeral01 2 0 0 0 0.0\n", + "100737 milonto01 24 2 0 0 0.0\n", + "100807 nolasri01 9 3 0 0 0.0\n", + "100816 nunezed02 72 188 53 4 20.0\n", + "100837 orourry01 28 0 0 0 0.0\n", + "100872 pelfrmi01 30 3 2 0 0.0\n", + "100895 perkigl01 60 0 0 0 0.0\n", + "100915 plouftr01 152 573 140 22 86.0\n", + "100917 polanjo01 4 10 3 0 1.0\n", + "100925 pressry01 27 0 0 0 0.0\n", + "100994 robinsh01 83 180 45 0 16.0\n", + "101023 rosared01 122 453 121 13 50.0\n", + "101067 sanomi01 80 279 75 18 52.0\n", + "101069 santada01 91 261 56 0 21.0\n", + "101072 santaer01 17 0 0 0 0.0\n", + "101079 schafjo02 27 69 15 0 5.0\n", + "101144 staufti01 13 0 0 0 0.0\n", + "101164 suzukku01 131 433 104 5 50.0\n", + "101189 thielca01 6 0 0 0 0.0\n", + "101193 thompaa01 41 0 0 0 0.0\n", + "101203 tonkimi01 26 0 0 0 0.0\n", + "101240 vargake01 58 175 42 5 17.0" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.loc[(df.yearID == 2015) & (df.teamID == \"MIN\"),\\\n", + " ['playerID', 'G', 'AB', 'H', 'HR', 'RBI']]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Sorting" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Sorting is facilitated by the [`sort_values()` method](). By default, sorting is done in _ascending order_, specify the parameter `ascending=False` to get descending order." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDGABHHRRBI
100696mauerjo011585921571066.0
100215doziebr011576281482877.0
100915plouftr011525731402286.0
100488hunteto011395211252281.0
101164suzukku01131433104550.0
100249escobed011274091071258.0
101023rosared011224531211350.0
100459hicksaa0197352901133.0
101069santada019126156021.0
100994robinsh018318045016.0
101067sanomi0180279751852.0
100816nunezed027218853420.0
99988boyerbl01680000.0
100270fienca01620000.0
100895perkigl01600000.0
101240vargake015817542517.0
100221duensbr01551000.0
100701maytr01483000.0
100030buxtoby01461292726.0
100455herrmch014510315210.0
\n", + "
" + ], + "text/plain": [ + " playerID G AB H HR RBI\n", + "100696 mauerjo01 158 592 157 10 66.0\n", + "100215 doziebr01 157 628 148 28 77.0\n", + "100915 plouftr01 152 573 140 22 86.0\n", + "100488 hunteto01 139 521 125 22 81.0\n", + "101164 suzukku01 131 433 104 5 50.0\n", + "100249 escobed01 127 409 107 12 58.0\n", + "101023 rosared01 122 453 121 13 50.0\n", + "100459 hicksaa01 97 352 90 11 33.0\n", + "101069 santada01 91 261 56 0 21.0\n", + "100994 robinsh01 83 180 45 0 16.0\n", + "101067 sanomi01 80 279 75 18 52.0\n", + "100816 nunezed02 72 188 53 4 20.0\n", + "99988 boyerbl01 68 0 0 0 0.0\n", + "100270 fienca01 62 0 0 0 0.0\n", + "100895 perkigl01 60 0 0 0 0.0\n", + "101240 vargake01 58 175 42 5 17.0\n", + "100221 duensbr01 55 1 0 0 0.0\n", + "100701 maytr01 48 3 0 0 0.0\n", + "100030 buxtoby01 46 129 27 2 6.0\n", + "100455 herrmch01 45 103 15 2 10.0" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_min_2015 = df.loc[(df.yearID == 2015) & (df.teamID == \"MIN\"),\\\n", + " ['playerID', 'G', 'AB', 'H', 'HR', 'RBI']]\\\n", + " .sort_values('G', ascending=False)\n", + "df_min_2015.head(20)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We may also do a _multi-sort_ by passing in the list of _columns_ we want sorted. This will sort in the order of the columns provided. For example," + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDGABHHRRBI
101189thielca0160000.0
99954bernido0145102.0
100917polanjo01410301.0
100564keplema0137100.0
100729meyeral0120000.0
\n", + "
" + ], + "text/plain": [ + " playerID G AB H HR RBI\n", + "101189 thielca01 6 0 0 0 0.0\n", + "99954 bernido01 4 5 1 0 2.0\n", + "100917 polanjo01 4 10 3 0 1.0\n", + "100564 keplema01 3 7 1 0 0.0\n", + "100729 meyeral01 2 0 0 0 0.0" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.loc[(df.yearID == 2015) & (df.teamID == \"MIN\"),\\\n", + " ['playerID', 'G', 'AB', 'H', 'HR', 'RBI']]\\\n", + " .sort_values(['G', 'HR'], ascending=False).tail()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## DataFrame manipulation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Adding and dropping columns" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDGABHHRRBIHtoAB
100696mauerjo011585921571066.00
100215doziebr011576281482877.00
100915plouftr011525731402286.00
100488hunteto011395211252281.00
101164suzukku01131433104550.00
\n", + "
" + ], + "text/plain": [ + " playerID G AB H HR RBI HtoAB\n", + "100696 mauerjo01 158 592 157 10 66.0 0\n", + "100215 doziebr01 157 628 148 28 77.0 0\n", + "100915 plouftr01 152 573 140 22 86.0 0\n", + "100488 hunteto01 139 521 125 22 81.0 0\n", + "101164 suzukku01 131 433 104 5 50.0 0" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_min_2015.loc[:,'HtoAB'] = 0\n", + "df_min_2015.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDGABHHRRBI
100696mauerjo011585921571066.0
100215doziebr011576281482877.0
100915plouftr011525731402286.0
100488hunteto011395211252281.0
101164suzukku01131433104550.0
\n", + "
" + ], + "text/plain": [ + " playerID G AB H HR RBI\n", + "100696 mauerjo01 158 592 157 10 66.0\n", + "100215 doziebr01 157 628 148 28 77.0\n", + "100915 plouftr01 152 573 140 22 86.0\n", + "100488 hunteto01 139 521 125 22 81.0\n", + "101164 suzukku01 131 433 104 5 50.0" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_min_2015 = df_min_2015.drop('HtoAB', axis=1)\n", + "df_min_2015.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "100696 157\n", + "100215 148\n", + "100915 140\n", + "100488 125\n", + "101164 104\n", + "100249 107\n", + "101023 121\n", + "100459 90\n", + "101069 56\n", + "100994 45\n", + "Name: H, dtype: int64" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_min_2015.H.head(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "df_min_2015.loc[:,'HtoAB'] = 0\n", + "df_min_2015.loc[:,'HtoAB'] = [v.H/v.AB \n", + " if v.AB > 0 else 0 \n", + " for r, v in df_min_2015.iterrows()]" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDGABHHRRBIHtoAB
100696mauerjo011585921571066.00.265203
100215doziebr011576281482877.00.235669
100915plouftr011525731402286.00.244328
100488hunteto011395211252281.00.239923
101164suzukku01131433104550.00.240185
100249escobed011274091071258.00.261614
101023rosared011224531211350.00.267108
100459hicksaa0197352901133.00.255682
101069santada019126156021.00.214559
100994robinsh018318045016.00.250000
\n", + "
" + ], + "text/plain": [ + " playerID G AB H HR RBI HtoAB\n", + "100696 mauerjo01 158 592 157 10 66.0 0.265203\n", + "100215 doziebr01 157 628 148 28 77.0 0.235669\n", + "100915 plouftr01 152 573 140 22 86.0 0.244328\n", + "100488 hunteto01 139 521 125 22 81.0 0.239923\n", + "101164 suzukku01 131 433 104 5 50.0 0.240185\n", + "100249 escobed01 127 409 107 12 58.0 0.261614\n", + "101023 rosared01 122 453 121 13 50.0 0.267108\n", + "100459 hicksaa01 97 352 90 11 33.0 0.255682\n", + "101069 santada01 91 261 56 0 21.0 0.214559\n", + "100994 robinsh01 83 180 45 0 16.0 0.250000" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_min_2015.head(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDGABHHRRBIHtoAB
101023rosared011224531211350.00.267108
100696mauerjo011585921571066.00.265203
100249escobed011274091071258.00.261614
100459hicksaa0197352901133.00.255682
100994robinsh018318045016.00.250000
100915plouftr011525731402286.00.244328
101164suzukku01131433104550.00.240185
100488hunteto011395211252281.00.239923
100215doziebr011576281482877.00.235669
101069santada019126156021.00.214559
\n", + "
" + ], + "text/plain": [ + " playerID G AB H HR RBI HtoAB\n", + "101023 rosared01 122 453 121 13 50.0 0.267108\n", + "100696 mauerjo01 158 592 157 10 66.0 0.265203\n", + "100249 escobed01 127 409 107 12 58.0 0.261614\n", + "100459 hicksaa01 97 352 90 11 33.0 0.255682\n", + "100994 robinsh01 83 180 45 0 16.0 0.250000\n", + "100915 plouftr01 152 573 140 22 86.0 0.244328\n", + "101164 suzukku01 131 433 104 5 50.0 0.240185\n", + "100488 hunteto01 139 521 125 22 81.0 0.239923\n", + "100215 doziebr01 157 628 148 28 77.0 0.235669\n", + "101069 santada01 91 261 56 0 21.0 0.214559" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_min_2015[df_min_2015.G>80].sort_values('HtoAB', ascending=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDHtoABABHHRRBIG
100696mauerjo010.2652035921571066.0158
100215doziebr010.2356696281482877.0157
100915plouftr010.2443285731402286.0152
100488hunteto010.2399235211252281.0139
101164suzukku010.240185433104550.0131
\n", + "
" + ], + "text/plain": [ + " playerID HtoAB AB H HR RBI G\n", + "100696 mauerjo01 0.265203 592 157 10 66.0 158\n", + "100215 doziebr01 0.235669 628 148 28 77.0 157\n", + "100915 plouftr01 0.244328 573 140 22 86.0 152\n", + "100488 hunteto01 0.239923 521 125 22 81.0 139\n", + "101164 suzukku01 0.240185 433 104 5 50.0 131" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_min_2015 = df_min_2015.reindex(columns=['playerID', 'HtoAB', 'AB', 'H', 'HR', 'RBI', 'G'])\n", + "df_min_2015.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, we can return our DataFrame back to its original columns (and order) by reindexing again. Notice, also that we can effectively perform a `drop()` by doing this, though the syntax with `reindex()` is more verbose." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDGABHHRRBI
100696mauerjo011585921571066.0
100215doziebr011576281482877.0
100915plouftr011525731402286.0
100488hunteto011395211252281.0
101164suzukku01131433104550.0
\n", + "
" + ], + "text/plain": [ + " playerID G AB H HR RBI\n", + "100696 mauerjo01 158 592 157 10 66.0\n", + "100215 doziebr01 157 628 148 28 77.0\n", + "100915 plouftr01 152 573 140 22 86.0\n", + "100488 hunteto01 139 521 125 22 81.0\n", + "101164 suzukku01 131 433 104 5 50.0" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_min_2015 = df_min_2015.reindex(columns=['playerID', 'G', 'AB', 'H', 'HR', 'RBI'])\n", + "df_min_2015.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Adding and dropping rows" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Adding rows can be achieved using `loc[]` and setting the new index to a dictionary of values using the column labels as keys." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDGABHHRRBI
100917polanjo01410301
99954bernido0145102
100564keplema0137100
100729meyeral0120000
200000keith0100000
\n", + "
" + ], + "text/plain": [ + " playerID G AB H HR RBI\n", + "100917 polanjo01 4 10 3 0 1\n", + "99954 bernido01 4 5 1 0 2\n", + "100564 keplema01 3 7 1 0 0\n", + "100729 meyeral01 2 0 0 0 0\n", + "200000 keith01 0 0 0 0 0" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_min_2015.loc[200000] = \\\n", + " { 'playerID': 'keith01',\n", + " 'RBI': '0',\n", + " 'G': '0',\n", + " 'H': '0',\n", + " 'HR': '0',\n", + " 'AB': '0' }\n", + " \n", + "df_min_2015.tail()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is also the same with lists and tuples." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDGABHHRRBI
99954bernido0145102
100564keplema0137100
100729meyeral0120000
200000keith0111111
200001keith0211111
\n", + "
" + ], + "text/plain": [ + " playerID G AB H HR RBI\n", + "99954 bernido01 4 5 1 0 2\n", + "100564 keplema01 3 7 1 0 0\n", + "100729 meyeral01 2 0 0 0 0\n", + "200000 keith01 1 1 1 1 1\n", + "200001 keith02 1 1 1 1 1" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_min_2015.loc[200000] = ('keith01', 1, 1, 1, 1, 1)\n", + "df_min_2015.loc[200001] = ['keith02', 1, 1, 1, 1, 1]\n", + "\n", + "df_min_2015.tail()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that we can drop a number of rows at a time by passing a list of the indices we'd like dropped." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDGABHHRRBI
101189thielca0160000
100917polanjo01410301
99954bernido0145102
100564keplema0137100
100729meyeral0120000
\n", + "
" + ], + "text/plain": [ + " playerID G AB H HR RBI\n", + "101189 thielca01 6 0 0 0 0\n", + "100917 polanjo01 4 10 3 0 1\n", + "99954 bernido01 4 5 1 0 2\n", + "100564 keplema01 3 7 1 0 0\n", + "100729 meyeral01 2 0 0 0 0" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_min_2015 = df_min_2015.drop([200000, 200001], axis=0)\n", + "df_min_2015.tail()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Similar results can be achieved using [`append()`](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.append.html#pandas.DataFrame.append). With append, you can append, Series, DataFrames and/or a list of these." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDGABHHRRBI
100917polanjo01410301
99954bernido0145102
100564keplema0137100
100729meyeral0120000
200000keith0100000
\n", + "
" + ], + "text/plain": [ + " playerID G AB H HR RBI\n", + "100917 polanjo01 4 10 3 0 1\n", + "99954 bernido01 4 5 1 0 2\n", + "100564 keplema01 3 7 1 0 0\n", + "100729 meyeral01 2 0 0 0 0\n", + "200000 keith01 0 0 0 0 0" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_min_2015.append(\n", + " pd.Series( \n", + " {'playerID': 'keith01', \n", + " 'G': 0, \n", + " 'AB': 0, \n", + " 'H':0, \n", + " 'HR': 0, \n", + " 'RBI': 0}, name='200000')).tail()" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDGABHHRRBI
100696mauerjo011585921571066
100215doziebr011576281482877
100915plouftr011525731402286
100488hunteto011395211252281
101164suzukku01131433104550
101189thielca0160000
100917polanjo01410301
99954bernido0145102
100564keplema0137100
100729meyeral0120000
\n", + "
" + ], + "text/plain": [ + " playerID G AB H HR RBI\n", + "100696 mauerjo01 158 592 157 10 66\n", + "100215 doziebr01 157 628 148 28 77\n", + "100915 plouftr01 152 573 140 22 86\n", + "100488 hunteto01 139 521 125 22 81\n", + "101164 suzukku01 131 433 104 5 50\n", + "101189 thielca01 6 0 0 0 0\n", + "100917 polanjo01 4 10 3 0 1\n", + "99954 bernido01 4 5 1 0 2\n", + "100564 keplema01 3 7 1 0 0\n", + "100729 meyeral01 2 0 0 0 0" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_min_2015[:5].append(df_min_2015[-5:])" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDGABHHRRBI
100696mauerjo011585921571066
100215doziebr011576281482877
100915plouftr011525731402286
100488hunteto011395211252281
101164suzukku01131433104550
101067sanomi0180279751852
100816nunezed027218853420
101189thielca0160000
100917polanjo01410301
99954bernido0145102
100564keplema0137100
100729meyeral0120000
\n", + "
" + ], + "text/plain": [ + " playerID G AB H HR RBI\n", + "100696 mauerjo01 158 592 157 10 66\n", + "100215 doziebr01 157 628 148 28 77\n", + "100915 plouftr01 152 573 140 22 86\n", + "100488 hunteto01 139 521 125 22 81\n", + "101164 suzukku01 131 433 104 5 50\n", + "101067 sanomi01 80 279 75 18 52\n", + "100816 nunezed02 72 188 53 4 20\n", + "101189 thielca01 6 0 0 0 0\n", + "100917 polanjo01 4 10 3 0 1\n", + "99954 bernido01 4 5 1 0 2\n", + "100564 keplema01 3 7 1 0 0\n", + "100729 meyeral01 2 0 0 0 0" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_min_2015[:5].append([df_min_2015[10:12], df_min_2015[-5:]])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The same result can be achieved with [`pd.concat()`](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.concat.html#pandas.concat), where the defaut `axis` is `0`." + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDGABHHRRBI
100696mauerjo011585921571066
100215doziebr011576281482877
100915plouftr011525731402286
100488hunteto011395211252281
101164suzukku01131433104550
101189thielca0160000
100917polanjo01410301
99954bernido0145102
100564keplema0137100
100729meyeral0120000
\n", + "
" + ], + "text/plain": [ + " playerID G AB H HR RBI\n", + "100696 mauerjo01 158 592 157 10 66\n", + "100215 doziebr01 157 628 148 28 77\n", + "100915 plouftr01 152 573 140 22 86\n", + "100488 hunteto01 139 521 125 22 81\n", + "101164 suzukku01 131 433 104 5 50\n", + "101189 thielca01 6 0 0 0 0\n", + "100917 polanjo01 4 10 3 0 1\n", + "99954 bernido01 4 5 1 0 2\n", + "100564 keplema01 3 7 1 0 0\n", + "100729 meyeral01 2 0 0 0 0" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.concat([df_min_2015[:5], \n", + " df_min_2015[-5:]], axis=0)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "But we can use `concat()` to make a _column-wise_ concatenation using `axis=1` (columns). " + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDGABHHRRBIplayerIDGABHHRRBI
99954NaNNaNNaNNaNNaNNaNbernido0145102
100215doziebr011576281482877NaNNaNNaNNaNNaNNaN
100488hunteto011395211252281NaNNaNNaNNaNNaNNaN
100564NaNNaNNaNNaNNaNNaNkeplema0137100
100696mauerjo011585921571066NaNNaNNaNNaNNaNNaN
100729NaNNaNNaNNaNNaNNaNmeyeral0120000
100915plouftr011525731402286NaNNaNNaNNaNNaNNaN
100917NaNNaNNaNNaNNaNNaNpolanjo01410301
101164suzukku01131433104550NaNNaNNaNNaNNaNNaN
101189NaNNaNNaNNaNNaNNaNthielca0160000
\n", + "
" + ], + "text/plain": [ + " playerID G AB H HR RBI playerID G AB H HR RBI\n", + "99954 NaN NaN NaN NaN NaN NaN bernido01 4 5 1 0 2\n", + "100215 doziebr01 157 628 148 28 77 NaN NaN NaN NaN NaN NaN\n", + "100488 hunteto01 139 521 125 22 81 NaN NaN NaN NaN NaN NaN\n", + "100564 NaN NaN NaN NaN NaN NaN keplema01 3 7 1 0 0\n", + "100696 mauerjo01 158 592 157 10 66 NaN NaN NaN NaN NaN NaN\n", + "100729 NaN NaN NaN NaN NaN NaN meyeral01 2 0 0 0 0\n", + "100915 plouftr01 152 573 140 22 86 NaN NaN NaN NaN NaN NaN\n", + "100917 NaN NaN NaN NaN NaN NaN polanjo01 4 10 3 0 1\n", + "101164 suzukku01 131 433 104 5 50 NaN NaN NaN NaN NaN NaN\n", + "101189 NaN NaN NaN NaN NaN NaN thielca01 6 0 0 0 0" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.concat([df_min_2015[:5], \n", + " df_min_2015[-5:]], axis=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can see that the indices are being considered in the concatenation and row indices are being joined. This behavior can be controlled via the `join` parameter, which we'll leave [for the reader to explore](http://pandas.pydata.org/pandas-docs/version/0.17.0/merging.html#concatenating-objects).\n", + "\n", + "One last thing we might want to do in an operation like this is to reset the index. To do so, we might start with ignoring the column index using the `ignore_index=True` so we can set it later to something more appropriate after the concatenation." + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
01234567891011
99954NaNNaNNaNNaNNaNNaNbernido0145102
100215doziebr011576281482877NaNNaNNaNNaNNaNNaN
100488hunteto011395211252281NaNNaNNaNNaNNaNNaN
100564NaNNaNNaNNaNNaNNaNkeplema0137100
100696mauerjo011585921571066NaNNaNNaNNaNNaNNaN
100729NaNNaNNaNNaNNaNNaNmeyeral0120000
100915plouftr011525731402286NaNNaNNaNNaNNaNNaN
100917NaNNaNNaNNaNNaNNaNpolanjo01410301
101164suzukku01131433104550NaNNaNNaNNaNNaNNaN
101189NaNNaNNaNNaNNaNNaNthielca0160000
\n", + "
" + ], + "text/plain": [ + " 0 1 2 3 4 5 6 7 8 9 10 11\n", + "99954 NaN NaN NaN NaN NaN NaN bernido01 4 5 1 0 2\n", + "100215 doziebr01 157 628 148 28 77 NaN NaN NaN NaN NaN NaN\n", + "100488 hunteto01 139 521 125 22 81 NaN NaN NaN NaN NaN NaN\n", + "100564 NaN NaN NaN NaN NaN NaN keplema01 3 7 1 0 0\n", + "100696 mauerjo01 158 592 157 10 66 NaN NaN NaN NaN NaN NaN\n", + "100729 NaN NaN NaN NaN NaN NaN meyeral01 2 0 0 0 0\n", + "100915 plouftr01 152 573 140 22 86 NaN NaN NaN NaN NaN NaN\n", + "100917 NaN NaN NaN NaN NaN NaN polanjo01 4 10 3 0 1\n", + "101164 suzukku01 131 433 104 5 50 NaN NaN NaN NaN NaN NaN\n", + "101189 NaN NaN NaN NaN NaN NaN thielca01 6 0 0 0 0" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.concat([df_min_2015[:5], \n", + " df_min_2015[-5:]], axis=1, ignore_index=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Advanced indexing\n", + "Pandas provides the ability to build more complex indices allowing for highly flexible and natural data access.\n", + "\n", + "We will cover the basics of through the [`MultiIndex`](http://pandas.pydata.org/pandas-docs/version/0.17.0/advanced.html#hierarchical-indexing-multiindex) object and will the the remaining exploration to the reader." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's get the players on the Washington Nationals who played 100 or more games in 2015 and 2016." + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "df_was = df[(df.yearID > 2014) & (df.teamID=='WAS') & (df.G > 99)]" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDyearIDstintteamIDlgIDGABRH2B...RBISBCSBBSOIBBHBPSHSFGIDP
100193desmoia0120151WASNL1565836913627...62.013.05.045187.00.03.06.04.09.0
100250escobyu0120151WASNL1395357516825...56.02.02.04570.00.08.01.02.024.0
100251espinda0120151WASNL118367598821...37.05.02.033106.05.06.03.03.06.0
100422harpebr0320151WASNL15352111817238...99.06.04.0124131.015.05.00.04.015.0
100950ramoswi0120151WASNL1284754110916...68.00.00.021101.02.00.00.08.016.0
\n", + "

5 rows × 22 columns

\n", + "
" + ], + "text/plain": [ + " playerID yearID stint teamID lgID G AB R H 2B ... \\\n", + "100193 desmoia01 2015 1 WAS NL 156 583 69 136 27 ... \n", + "100250 escobyu01 2015 1 WAS NL 139 535 75 168 25 ... \n", + "100251 espinda01 2015 1 WAS NL 118 367 59 88 21 ... \n", + "100422 harpebr03 2015 1 WAS NL 153 521 118 172 38 ... \n", + "100950 ramoswi01 2015 1 WAS NL 128 475 41 109 16 ... \n", + "\n", + " RBI SB CS BB SO IBB HBP SH SF GIDP \n", + "100193 62.0 13.0 5.0 45 187.0 0.0 3.0 6.0 4.0 9.0 \n", + "100250 56.0 2.0 2.0 45 70.0 0.0 8.0 1.0 2.0 24.0 \n", + "100251 37.0 5.0 2.0 33 106.0 5.0 6.0 3.0 3.0 6.0 \n", + "100422 99.0 6.0 4.0 124 131.0 15.0 5.0 0.0 4.0 15.0 \n", + "100950 68.0 0.0 0.0 21 101.0 2.0 0.0 0.0 8.0 16.0 \n", + "\n", + "[5 rows x 22 columns]" + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_was.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One obvious problem if we were to access the data here by player and year, we have to build a much more involved query and even more so if we needed to ignore data.\n", + "\n", + "We are going to create a _hierarchical index_ or _MultiIndex_ to solve this problem. We'll take take liberty to drop columns we don't need (`teamID`, `ldID`, `stint`) and reorganize the index hierarchically.\n", + "\n", + "We will use `MultiIndex` using a _tuple_ of the data we need and provide the index first by _player_, then by _year_. To do this we'll just grab all the player IDs and `zip` them with the year. This will look something like this:" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(('desmoia01', 2015),\n", + " ('escobyu01', 2015),\n", + " ('espinda01', 2015),\n", + " ('espinda01', 2016),\n", + " ('harpebr03', 2015),\n", + " ('harpebr03', 2016),\n", + " ('murphda08', 2016),\n", + " ('ramoswi01', 2015),\n", + " ('ramoswi01', 2016),\n", + " ('rendoan01', 2016),\n", + " ('reverbe01', 2016),\n", + " ('robincl01', 2015),\n", + " ('robincl01', 2016),\n", + " ('taylomi02', 2015),\n", + " ('werthja01', 2016),\n", + " ('zimmery01', 2016))" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "tuple(\n", + "zip(\n", + " df_was[['playerID','yearID']].sort_values(by='playerID')['playerID'],\n", + " df_was[['playerID','yearID']].sort_values(by='playerID')['yearID']\n", + ")\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "MultiIndex(levels=[['desmoia01', 'escobyu01', 'espinda01', 'harpebr03', 'murphda08', 'ramoswi01', 'rendoan01', 'reverbe01', 'robincl01', 'taylomi02', 'werthja01', 'zimmery01'], [2015, 2016]],\n", + " labels=[[0, 1, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8, 8, 9, 10, 11], [0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1]])" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# create an index to be used over the data we're interested in\n", + "idx = \\\n", + " pd.MultiIndex.from_tuples(\n", + " tuple(\n", + " zip(\n", + " df_was[['playerID','yearID']].sort_values(by='playerID')['playerID'],\n", + " df_was[['playerID','yearID']].sort_values(by='playerID')['yearID']))\n", + " )\n", + "idx" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice now that we have two _levels_ in our _row axis_ (axis 0) and we will now use that index to build the hierachically indexed DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
GABRH2B3BHRRBISBCSBBSOIBBHBPSHSFGIDP
desmoia012015156583691362721962.013.05.045187.00.03.06.04.09.0
escobyu01201513953575168251956.02.02.04570.00.08.01.02.024.0
espinda01201511836759882111337.05.02.033106.05.06.03.03.06.0
2016157516661081502472.09.02.054174.012.020.07.04.04.0
harpebr0320151535211181723814299.06.04.0124131.015.05.00.04.015.0
2016147506841232422486.021.010.0108117.020.03.00.010.011.0
murphda0820161425318818447525104.05.03.03557.010.08.00.08.04.0
ramoswi012015128475411091601568.00.00.021101.02.00.00.08.016.0
2016131482581482502280.00.00.03579.02.02.00.04.017.0
rendoan012016156567911533822085.012.06.065117.02.07.00.08.05.0
reverbe012016103350447697224.014.05.01834.00.03.02.02.012.0
robincl01201512630944841511034.00.00.03752.04.05.00.01.06.0
2016104196164640526.00.00.02038.00.02.01.05.04.0
taylomi022015138472491081521463.016.03.035158.09.01.01.02.05.0
werthja012016143525841282802169.05.01.071139.00.04.00.06.017.0
zimmery01201611542760931811546.04.01.029104.01.05.00.06.012.0
\n", + "
" + ], + "text/plain": [ + " G AB R H 2B 3B HR RBI SB CS BB SO \\\n", + "desmoia01 2015 156 583 69 136 27 2 19 62.0 13.0 5.0 45 187.0 \n", + "escobyu01 2015 139 535 75 168 25 1 9 56.0 2.0 2.0 45 70.0 \n", + "espinda01 2015 118 367 59 88 21 1 13 37.0 5.0 2.0 33 106.0 \n", + " 2016 157 516 66 108 15 0 24 72.0 9.0 2.0 54 174.0 \n", + "harpebr03 2015 153 521 118 172 38 1 42 99.0 6.0 4.0 124 131.0 \n", + " 2016 147 506 84 123 24 2 24 86.0 21.0 10.0 108 117.0 \n", + "murphda08 2016 142 531 88 184 47 5 25 104.0 5.0 3.0 35 57.0 \n", + "ramoswi01 2015 128 475 41 109 16 0 15 68.0 0.0 0.0 21 101.0 \n", + " 2016 131 482 58 148 25 0 22 80.0 0.0 0.0 35 79.0 \n", + "rendoan01 2016 156 567 91 153 38 2 20 85.0 12.0 6.0 65 117.0 \n", + "reverbe01 2016 103 350 44 76 9 7 2 24.0 14.0 5.0 18 34.0 \n", + "robincl01 2015 126 309 44 84 15 1 10 34.0 0.0 0.0 37 52.0 \n", + " 2016 104 196 16 46 4 0 5 26.0 0.0 0.0 20 38.0 \n", + "taylomi02 2015 138 472 49 108 15 2 14 63.0 16.0 3.0 35 158.0 \n", + "werthja01 2016 143 525 84 128 28 0 21 69.0 5.0 1.0 71 139.0 \n", + "zimmery01 2016 115 427 60 93 18 1 15 46.0 4.0 1.0 29 104.0 \n", + "\n", + " IBB HBP SH SF GIDP \n", + "desmoia01 2015 0.0 3.0 6.0 4.0 9.0 \n", + "escobyu01 2015 0.0 8.0 1.0 2.0 24.0 \n", + "espinda01 2015 5.0 6.0 3.0 3.0 6.0 \n", + " 2016 12.0 20.0 7.0 4.0 4.0 \n", + "harpebr03 2015 15.0 5.0 0.0 4.0 15.0 \n", + " 2016 20.0 3.0 0.0 10.0 11.0 \n", + "murphda08 2016 10.0 8.0 0.0 8.0 4.0 \n", + "ramoswi01 2015 2.0 0.0 0.0 8.0 16.0 \n", + " 2016 2.0 2.0 0.0 4.0 17.0 \n", + "rendoan01 2016 2.0 7.0 0.0 8.0 5.0 \n", + "reverbe01 2016 0.0 3.0 2.0 2.0 12.0 \n", + "robincl01 2015 4.0 5.0 0.0 1.0 6.0 \n", + " 2016 0.0 2.0 1.0 5.0 4.0 \n", + "taylomi02 2015 9.0 1.0 1.0 2.0 5.0 \n", + "werthja01 2016 0.0 4.0 0.0 6.0 17.0 \n", + "zimmery01 2016 1.0 5.0 0.0 6.0 12.0 " + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# sorting the indices is critical for lining up the data in the tuples\n", + "df_was = df_was.sort_values(by=['playerID']).\\\n", + " set_index(idx).\\\n", + " drop(['playerID', 'yearID', 'teamID', 'lgID', 'stint'], axis=1)\n", + "df_was" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
GABHSO
20151263098452.0
20161041964638.0
\n", + "
" + ], + "text/plain": [ + " G AB H SO\n", + "2015 126 309 84 52.0\n", + "2016 104 196 46 38.0" + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_was.loc[('robincl01', ),['G', 'AB', 'H', 'SO']]" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "G 104.0\n", + "AB 196.0\n", + "H 46.0\n", + "SO 38.0\n", + "Name: (robincl01, 2016), dtype: float64" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_was.loc[('robincl01', 2016),['G', 'AB', 'H', 'SO']]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For the sake of the example, let's take the DataFrame for all rows of data past 2016 and create a multi-index using year, league, team and player as the groupings of the index." + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDyearIDstintteamIDlgIDGABRH2B...RBISBCSBBSOIBBHBPSHSFGIDP
0abercda0118711TRONaN14000...0.00.00.000.0NaNNaNNaNNaNNaN
1addybo0118711RC1NaN2511830326...13.08.01.040.0NaNNaNNaNNaNNaN
2allisar0118711CL1NaN2913728404...19.03.01.025.0NaNNaNNaNNaNNaN
3allisdo0118711WS3NaN27133284410...27.01.01.002.0NaNNaNNaNNaNNaN
4ansonca0118711RC1NaN25120293911...16.06.02.021.0NaNNaNNaNNaNNaN
\n", + "

5 rows × 22 columns

\n", + "
" + ], + "text/plain": [ + " playerID yearID stint teamID lgID G AB R H 2B ... RBI SB \\\n", + "0 abercda01 1871 1 TRO NaN 1 4 0 0 0 ... 0.0 0.0 \n", + "1 addybo01 1871 1 RC1 NaN 25 118 30 32 6 ... 13.0 8.0 \n", + "2 allisar01 1871 1 CL1 NaN 29 137 28 40 4 ... 19.0 3.0 \n", + "3 allisdo01 1871 1 WS3 NaN 27 133 28 44 10 ... 27.0 1.0 \n", + "4 ansonca01 1871 1 RC1 NaN 25 120 29 39 11 ... 16.0 6.0 \n", + "\n", + " CS BB SO IBB HBP SH SF GIDP \n", + "0 0.0 0 0.0 NaN NaN NaN NaN NaN \n", + "1 1.0 4 0.0 NaN NaN NaN NaN NaN \n", + "2 1.0 2 5.0 NaN NaN NaN NaN NaN \n", + "3 1.0 0 2.0 NaN NaN NaN NaN NaN \n", + "4 2.0 2 1.0 NaN NaN NaN NaN NaN \n", + "\n", + "[5 rows x 22 columns]" + ] + }, + "execution_count": 36, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "((2016, 'NL', 'WAS', 'rzepcma01'),\n", + " (2016, 'NL', 'WAS', 'scherma01'),\n", + " (2016, 'NL', 'WAS', 'severpe01'),\n", + " (2016, 'NL', 'WAS', 'solissa01'),\n", + " (2016, 'NL', 'WAS', 'strasst01'),\n", + " (2016, 'NL', 'WAS', 'taylomi02'),\n", + " (2016, 'NL', 'WAS', 'treinbl01'),\n", + " (2016, 'NL', 'WAS', 'turnetr01'),\n", + " (2016, 'NL', 'WAS', 'werthja01'),\n", + " (2016, 'NL', 'WAS', 'zimmery01'))" + ] + }, + "execution_count": 37, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_mi = df[df.yearID>2006].copy()\n", + "idx_labels = ['yearID', 'lgID', 'teamID', 'playerID']\n", + "\n", + "tuple(\n", + " zip(\n", + " df_mi[idx_labels]\\\n", + " .sort_values(idx_labels)['yearID'],\n", + "\n", + " df_mi[idx_labels]\\\n", + " .sort_values(idx_labels)['lgID'],\n", + "\n", + " df_mi[idx_labels]\\\n", + " .sort_values(idx_labels)['teamID'],\n", + "\n", + " df_mi[idx_labels]\\\n", + " .sort_values(idx_labels)['playerID']))[-10:]" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "idx = \\\n", + " pd.MultiIndex.from_tuples(\n", + " tuple(\n", + " zip(\n", + " df_mi[idx_labels]\\\n", + " .sort_values(idx_labels)['yearID'],\n", + " \n", + " df_mi[idx_labels]\\\n", + " .sort_values(idx_labels)['lgID'],\n", + " \n", + " df_mi[idx_labels]\\\n", + " .sort_values(idx_labels)['teamID'],\n", + " \n", + " df_mi[idx_labels]\\\n", + " .sort_values(idx_labels)['playerID']))\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "df_mi = df_mi.sort_values(['yearID', 'teamID']).set_index(idx)#.drop(['playerID', 'yearID', 'teamID', 'stint'], axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDyearIDstintteamIDlgIDGABRH2B...RBISBCSBBSOIBBHBPSHSFGIDP
2007ALBALbaezda01bardebr0120071ARINL812010...0.00.00.003.00.00.00.00.00.0
bakopa01bonifem0120071ARINL1123251...2.00.01.043.00.00.00.00.00.0
bedarer01byrneer0120071ARINL16062610317930...83.050.07.05798.05.010.01.04.012.0
bellro01callaal0120071ARINL5614410318...7.01.01.0914.00.01.01.01.08.0
birkiku01choatra0120071ARINL20000...0.00.00.000.00.00.00.00.00.0
\n", + "

5 rows × 22 columns

\n", + "
" + ], + "text/plain": [ + " playerID yearID stint teamID lgID G AB R \\\n", + "2007 AL BAL baezda01 bardebr01 2007 1 ARI NL 8 12 0 \n", + " bakopa01 bonifem01 2007 1 ARI NL 11 23 2 \n", + " bedarer01 byrneer01 2007 1 ARI NL 160 626 103 \n", + " bellro01 callaal01 2007 1 ARI NL 56 144 10 \n", + " birkiku01 choatra01 2007 1 ARI NL 2 0 0 \n", + "\n", + " H 2B ... RBI SB CS BB SO IBB HBP \\\n", + "2007 AL BAL baezda01 1 0 ... 0.0 0.0 0.0 0 3.0 0.0 0.0 \n", + " bakopa01 5 1 ... 2.0 0.0 1.0 4 3.0 0.0 0.0 \n", + " bedarer01 179 30 ... 83.0 50.0 7.0 57 98.0 5.0 10.0 \n", + " bellro01 31 8 ... 7.0 1.0 1.0 9 14.0 0.0 1.0 \n", + " birkiku01 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 \n", + "\n", + " SH SF GIDP \n", + "2007 AL BAL baezda01 0.0 0.0 0.0 \n", + " bakopa01 0.0 0.0 0.0 \n", + " bedarer01 1.0 4.0 12.0 \n", + " bellro01 1.0 1.0 8.0 \n", + " birkiku01 0.0 0.0 0.0 \n", + "\n", + "[5 rows x 22 columns]" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_mi.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
playerIDyearIDstintteamIDlgIDGABRH2B...RBISBCSBBSOIBBHBPSHSFGIDP
2016NLWAStaylomi02taylomi0220161WASNL76221285111...16.014.03.01477.00.01.00.01.02.0
treinbl01treinbl0120161WASNL730000...0.00.00.000.00.00.00.00.00.0
turnetr01turnetr0120161WASNL733075310514...40.033.06.01459.00.01.00.02.01.0
werthja01werthja0120161WASNL1435258412828...69.05.01.071139.00.04.00.06.017.0
zimmery01zimmery0120161WASNL115427609318...46.04.01.029104.01.05.00.06.012.0
\n", + "

5 rows × 22 columns

\n", + "
" + ], + "text/plain": [ + " playerID yearID stint teamID lgID G AB R \\\n", + "2016 NL WAS taylomi02 taylomi02 2016 1 WAS NL 76 221 28 \n", + " treinbl01 treinbl01 2016 1 WAS NL 73 0 0 \n", + " turnetr01 turnetr01 2016 1 WAS NL 73 307 53 \n", + " werthja01 werthja01 2016 1 WAS NL 143 525 84 \n", + " zimmery01 zimmery01 2016 1 WAS NL 115 427 60 \n", + "\n", + " H 2B ... RBI SB CS BB SO IBB HBP \\\n", + "2016 NL WAS taylomi02 51 11 ... 16.0 14.0 3.0 14 77.0 0.0 1.0 \n", + " treinbl01 0 0 ... 0.0 0.0 0.0 0 0.0 0.0 0.0 \n", + " turnetr01 105 14 ... 40.0 33.0 6.0 14 59.0 0.0 1.0 \n", + " werthja01 128 28 ... 69.0 5.0 1.0 71 139.0 0.0 4.0 \n", + " zimmery01 93 18 ... 46.0 4.0 1.0 29 104.0 1.0 5.0 \n", + "\n", + " SH SF GIDP \n", + "2016 NL WAS taylomi02 0.0 1.0 2.0 \n", + " treinbl01 0.0 0.0 0.0 \n", + " turnetr01 0.0 2.0 1.0 \n", + " werthja01 0.0 6.0 17.0 \n", + " zimmery01 0.0 6.0 12.0 \n", + "\n", + "[5 rows x 22 columns]" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_mi.tail()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can use this multi-index to out advantage, using the tuple of the index values we want and restricting the columns to just the data of interest." + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
GAB
accarje01152509
adamsru01621
banksjo01265
burneaj01814
chacigu01650
\n", + "
" + ], + "text/plain": [ + " G AB\n", + "accarje01 152 509\n", + "adamsru01 62 1\n", + "banksjo01 26 5\n", + "burneaj01 8 14\n", + "chacigu01 65 0" + ] + }, + "execution_count": 42, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_mi.loc[(2007, 'AL', 'TOR'), ['G', 'AB']].head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Ξ" + ] + } + ], + "metadata": { + "anaconda-cloud": {}, + "gist": { + "data": { + "description": "nb/2_dataframe_operations.ipynb", + "public": false + }, + "id": "" + }, + "kernelspec": { + "display_name": "Python [default]", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.1" + }, + "toc": { + "colors": { + "hover_highlight": "#DAA520", + "navigate_num": "#000000", + "navigate_text": "#333333", + "running_highlight": "#FF0000", + "selected_highlight": "#FFD700", + "sidebar_border": "#EEEEEE", + "wrapper_background": "#FFFFFF" + }, + "moveMenuLeft": true, + "nav_menu": { + "height": "211px", + "width": "252px" + }, + "navigate_menu": true, + "number_sections": false, + "sideBar": true, + "threshold": 4, + "toc_cell": true, + "toc_section_display": "block", + "toc_window_display": true, + "widenNotebook": false + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}