kmaull
/
talk_2017_08_RMACC_GotPandas

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "** NAVIGATION **\n",
    "\n",
    "**Got Pandas? _Practical Data Wrangling with Pandas_**\n",
    "\n",
    "* [Introduction](./0_introduction.ipynb)\n",
    "1. **Data Structures**\n",
    "2. [Importing Data](./2_importing_data.ipynb)\n",
    "3. [Manipulating DataFrames](./3_dataframe_operations.ipynb)\n",
    "4. [Wrap Up](./3_wrapping_up.ipynb)\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": "true"
   },
   "source": [
    "# Table of Contents\n",
    " <p><div class=\"lev1 toc-item\"><a href=\"#Core-Pandas-Data-Structures\" data-toc-modified-id=\"Core-Pandas-Data-Structures-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>Core Pandas Data Structures</a></div><div class=\"lev2 toc-item\"><a href=\"#Series\" data-toc-modified-id=\"Series-11\"><span class=\"toc-item-num\">1.1&nbsp;&nbsp;</span>Series</a></div><div class=\"lev2 toc-item\"><a href=\"#DataFrames\" data-toc-modified-id=\"DataFrames-12\"><span class=\"toc-item-num\">1.2&nbsp;&nbsp;</span>DataFrames</a></div><div class=\"lev3 toc-item\"><a href=\"#[]-operator-for-basic-slicing\" data-toc-modified-id=\"[]-operator-for-basic-slicing-121\"><span class=\"toc-item-num\">1.2.1&nbsp;&nbsp;</span><code>[]</code> operator for basic slicing</a></div><div class=\"lev3 toc-item\"><a href=\"#iloc[]\" data-toc-modified-id=\"iloc[]-122\"><span class=\"toc-item-num\">1.2.2&nbsp;&nbsp;</span><code>iloc[]</code></a></div><div class=\"lev3 toc-item\"><a href=\"#More-sophisticated-slicing\" data-toc-modified-id=\"More-sophisticated-slicing-123\"><span class=\"toc-item-num\">1.2.3&nbsp;&nbsp;</span>More sophisticated slicing</a></div><div class=\"lev3 toc-item\"><a href=\"#loc()\" data-toc-modified-id=\"loc()-124\"><span class=\"toc-item-num\">1.2.4&nbsp;&nbsp;</span><code>loc()</code></a></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**NOTEBOOK OBJECTIVES**\n",
    "\n",
    "In this notebook we'll:\n",
    "\n",
    "* explore the Series and DataFrame data structures, \n",
    "* understand basic selection and slicing operation in each."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Core Pandas Data Structures\n",
    "\n",
    "As we move through the content, we will be working with two of Pandas primary data structures.  There are more, but we will only focus on these two:\n",
    "\n",
    "* Series \n",
    "* DataFrames\n",
    "\n",
    "The structures we won't have time to explore are [Panels](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#panel), which should be explored when you're ready to do so.\n",
    "\n",
    "There are some key ideas behind the data structures provided by Pandas:\n",
    "\n",
    "* data may be heterogeneous\n",
    "* when data is numeric, convenience functions exist to provide aggregate statistical operations (`min()`, `max()`, `cumsum()`, `median()`, `mode()`, etc.),\n",
    "* data structures are decomposable and composable, that is making DataFrames from Series or Series from DataFrame is supported natively,\n",
    "* data structures are translatable, that is you can create Numpy [`NDArray`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html#numpy.ndarray)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Series\n",
    "\n",
    "The Pandas [Series](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html) data structure is a one dimensional structure much like a vector, that has axis labels.  Series objects can be initialized from an array-like data object, dictionary or scalar value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    43\n",
       "1     3\n",
       "2    15\n",
       "3    20\n",
       "4     3\n",
       "5    45\n",
       "6    44\n",
       "7    30\n",
       "8    25\n",
       "9    48\n",
       "dtype: int32"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np_ints = np.random.randint(0,51,10)\n",
    "pd.Series(np_ints)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Data in a series do not have to be numeric:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    WhTIhhttjy\n",
       "1    ubtcrWeUrE\n",
       "2    pPzpwOsKpm\n",
       "3    pQLCcUiotK\n",
       "4    AiOwCuildy\n",
       "5    DCkniyiWqp\n",
       "6    TOCDhTYkFw\n",
       "7    ziJpNNTbRo\n",
       "8    UjveUhQFFm\n",
       "9    EDEqSQpCKV\n",
       "dtype: object"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import string\n",
    "random_letters = [''\n",
    "                  .join([string.ascii_letters[c] for c in np.random.randint(0,51,10)]) \n",
    "                  for i in range(10)]\n",
    "pd.Series(random_letters)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can specify an index if we'd like that will allow us to have meaningful labels to access the data in the Series:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "city            Denver\n",
       "state               CO\n",
       "zip              80023\n",
       "neigborhood    Furhman\n",
       "area_code          303\n",
       "dtype: object"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "index = ['city', 'state', 'zip', 'neigborhood', 'area_code']\n",
    "data  = ['Denver', 'CO', '80023', 'Furhman', '303']\n",
    "s = pd.Series(data, index)\n",
    "s"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can access the data by its index label ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('Denver', 'CO')"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s['city'], s['state']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Accessing data in a series is much like accessing data in a Python list.  The usual slicing operator is available to get at the data in the Series."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'CO'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "city     Denver\n",
       "state        CO\n",
       "zip       80023\n",
       "dtype: object"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s[0:3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "city    Denver\n",
       "zip      80023\n",
       "dtype: object"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s[0:3:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "More sophisticated slicing by index using integers can be achieved with [`iloc`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.iloc.html#pandas.Series.iloc).  Here are some simple examples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "city     Denver\n",
       "state        CO\n",
       "zip       80023\n",
       "dtype: object"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s.iloc[0:3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "city    Denver\n",
       "zip      80023\n",
       "dtype: object"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s.iloc[0:3:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can pass a list of the indices we'd like just as easily ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "state           CO\n",
       "zip          80023\n",
       "area_code      303\n",
       "dtype: object"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s.iloc[[1,2,4]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To get at all the values of the Series as an [NDArray](https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html), simply do"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['Denver', 'CO', '80023', 'Furhman', '303'], dtype=object)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s.values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "which then allows us to convert to a list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Denver', 'CO', '80023', 'Furhman', '303']"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s.values.tolist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## DataFrames\n",
    "\n",
    "[DataFrames](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html#pandas.DataFrame) are a natural extension to Series in that they are 2-dimensional, and similarly to matrices (from vectors). They have many of the same operations (extended to 2 dimensions), but have some additional properties and operations.  We'll cover the basics here.\n",
    "\n",
    "* a DataFrame has a _row_ and _column_ **axis**, which defaults to numeric values (0 and 1),\n",
    "* a DataFrame axis can be multi-level, that is multi-level indices can be created for _row_, _column_ or **both**,\n",
    "* DataFrames can be converted to `NDArray` and thus be converted to lists and dictionaries,\n",
    "* indexing is achieved either by integer value or index label (both where applicable),\n",
    "* DataFrame values may be heterogeous."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's first begin by building a DataFrame from a 2D Numpy array ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>58</td>\n",
       "      <td>6</td>\n",
       "      <td>91</td>\n",
       "      <td>11</td>\n",
       "      <td>22</td>\n",
       "      <td>29</td>\n",
       "      <td>25</td>\n",
       "      <td>36</td>\n",
       "      <td>55</td>\n",
       "      <td>87</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>83</td>\n",
       "      <td>30</td>\n",
       "      <td>8</td>\n",
       "      <td>55</td>\n",
       "      <td>43</td>\n",
       "      <td>62</td>\n",
       "      <td>82</td>\n",
       "      <td>74</td>\n",
       "      <td>12</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>83</td>\n",
       "      <td>95</td>\n",
       "      <td>28</td>\n",
       "      <td>33</td>\n",
       "      <td>95</td>\n",
       "      <td>28</td>\n",
       "      <td>7</td>\n",
       "      <td>32</td>\n",
       "      <td>6</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>17</td>\n",
       "      <td>25</td>\n",
       "      <td>82</td>\n",
       "      <td>3</td>\n",
       "      <td>65</td>\n",
       "      <td>39</td>\n",
       "      <td>73</td>\n",
       "      <td>63</td>\n",
       "      <td>6</td>\n",
       "      <td>49</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>13</td>\n",
       "      <td>63</td>\n",
       "      <td>18</td>\n",
       "      <td>86</td>\n",
       "      <td>29</td>\n",
       "      <td>35</td>\n",
       "      <td>97</td>\n",
       "      <td>24</td>\n",
       "      <td>71</td>\n",
       "      <td>50</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>33</td>\n",
       "      <td>39</td>\n",
       "      <td>84</td>\n",
       "      <td>85</td>\n",
       "      <td>15</td>\n",
       "      <td>42</td>\n",
       "      <td>68</td>\n",
       "      <td>45</td>\n",
       "      <td>26</td>\n",
       "      <td>69</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>55</td>\n",
       "      <td>27</td>\n",
       "      <td>17</td>\n",
       "      <td>44</td>\n",
       "      <td>78</td>\n",
       "      <td>19</td>\n",
       "      <td>38</td>\n",
       "      <td>63</td>\n",
       "      <td>31</td>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>41</td>\n",
       "      <td>19</td>\n",
       "      <td>48</td>\n",
       "      <td>36</td>\n",
       "      <td>92</td>\n",
       "      <td>35</td>\n",
       "      <td>41</td>\n",
       "      <td>97</td>\n",
       "      <td>98</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>59</td>\n",
       "      <td>65</td>\n",
       "      <td>67</td>\n",
       "      <td>58</td>\n",
       "      <td>36</td>\n",
       "      <td>84</td>\n",
       "      <td>8</td>\n",
       "      <td>45</td>\n",
       "      <td>16</td>\n",
       "      <td>76</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>56</td>\n",
       "      <td>36</td>\n",
       "      <td>98</td>\n",
       "      <td>63</td>\n",
       "      <td>73</td>\n",
       "      <td>54</td>\n",
       "      <td>36</td>\n",
       "      <td>61</td>\n",
       "      <td>56</td>\n",
       "      <td>73</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    0   1   2   3   4   5   6   7   8   9\n",
       "0  58   6  91  11  22  29  25  36  55  87\n",
       "1  83  30   8  55  43  62  82  74  12   5\n",
       "2  83  95  28  33  95  28   7  32   6   2\n",
       "3  17  25  82   3  65  39  73  63   6  49\n",
       "4  13  63  18  86  29  35  97  24  71  50\n",
       "5  33  39  84  85  15  42  68  45  26  69\n",
       "6  55  27  17  44  78  19  38  63  31  60\n",
       "7  41  19  48  36  92  35  41  97  98  10\n",
       "8  59  65  67  58  36  84   8  45  16  76\n",
       "9  56  36  98  63  73  54  36  61  56  73"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame(np.random.randint(1,100,100).reshape(10,10))\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `[]` operator for basic slicing\n",
    "\n",
    "The slicing operator `[]` works on DataFrames over **row slices**.  Getting at data in the DataFrame is otherwise done with the [`iloc[]`](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.iloc.html) and [`loc[]`](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.loc.html) selectors.\n",
    "\n",
    "Be mindful to use this operator sparingly it is not consistent with `loc` and `iloc`, and may create confusing code if mixed arbitrarily with those selectors.  Let's see a few basic cases for `[]` ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>83</td>\n",
       "      <td>30</td>\n",
       "      <td>8</td>\n",
       "      <td>55</td>\n",
       "      <td>43</td>\n",
       "      <td>62</td>\n",
       "      <td>82</td>\n",
       "      <td>74</td>\n",
       "      <td>12</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>83</td>\n",
       "      <td>95</td>\n",
       "      <td>28</td>\n",
       "      <td>33</td>\n",
       "      <td>95</td>\n",
       "      <td>28</td>\n",
       "      <td>7</td>\n",
       "      <td>32</td>\n",
       "      <td>6</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>17</td>\n",
       "      <td>25</td>\n",
       "      <td>82</td>\n",
       "      <td>3</td>\n",
       "      <td>65</td>\n",
       "      <td>39</td>\n",
       "      <td>73</td>\n",
       "      <td>63</td>\n",
       "      <td>6</td>\n",
       "      <td>49</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    0   1   2   3   4   5   6   7   8   9\n",
       "1  83  30   8  55  43  62  82  74  12   5\n",
       "2  83  95  28  33  95  28   7  32   6   2\n",
       "3  17  25  82   3  65  39  73  63   6  49"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[1:4] # selecting the rows index 1 to 4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     6\n",
       "1    30\n",
       "2    95\n",
       "3    25\n",
       "Name: 1, dtype: int32"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[0:4][1] # selecting rows index 0 to 4, column index 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     6\n",
       "1    30\n",
       "2    95\n",
       "3    25\n",
       "4    63\n",
       "5    39\n",
       "6    27\n",
       "7    19\n",
       "8    65\n",
       "9    36\n",
       "Name: 1, dtype: int32"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[1] # selecting column 1, rows 0 .. n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'df' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-1-d68189ac4f7e>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mdf\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m4\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;31m# selecting the value at column index 1, row index 4\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[1;31mNameError\u001b[0m: name 'df' is not defined"
     ]
    }
   ],
   "source": [
    "df[1][4] # selecting the value at column index 1, row index 4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `[]` operator is merely a convenience and does not provide the functionality of `iloc()` and `loc()` discussed below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `iloc[]`\n",
    "\n",
    "`iloc` is an integer-based selector.  As such, you will need to know the integer values of the _row_ or _column_ indices as necessary.  You may see some correspondence with the `[]` selector, but it provides much more."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    83\n",
       "1    30\n",
       "2     8\n",
       "3    55\n",
       "4    43\n",
       "5    62\n",
       "6    82\n",
       "7    74\n",
       "8    12\n",
       "9     5\n",
       "Name: 1, dtype: int32"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.iloc[1] # row index 1, returns the full ROW"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "8"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.iloc[1,2] # row index 1, column index 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    83\n",
       "1    30\n",
       "2     8\n",
       "3    55\n",
       "4    43\n",
       "5    62\n",
       "6    82\n",
       "7    74\n",
       "8    12\n",
       "9     5\n",
       "Name: 1, dtype: int32"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.iloc[1,:] # row index 1, as above"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2     8\n",
       "3    55\n",
       "Name: 1, dtype: int32"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.iloc[1,2:4] # row index 2, column index 2:3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>83</td>\n",
       "      <td>30</td>\n",
       "      <td>8</td>\n",
       "      <td>55</td>\n",
       "      <td>43</td>\n",
       "      <td>62</td>\n",
       "      <td>82</td>\n",
       "      <td>74</td>\n",
       "      <td>12</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    0   1  2   3   4   5   6   7   8  9\n",
       "1  83  30  8  55  43  62  82  74  12  5"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.iloc[1:2,:] # row index 1:2, column index 0:-1 same as above"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### More sophisticated slicing\n",
    "\n",
    "\n",
    "More sophisticated slicing can be done over integer indices.  For example, if we wanted specific rows and columns slicing, we can something like the following.  Remember that the first argument to the selector is the _row_ and the second, the _column_."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>8</td>\n",
       "      <td>55</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   2   3   4\n",
       "1  8  55  43"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.iloc[1:2,2:5] # row index 1:2, column index 2:5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>2</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>8</td>\n",
       "      <td>62</td>\n",
       "      <td>82</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>82</td>\n",
       "      <td>39</td>\n",
       "      <td>73</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>48</td>\n",
       "      <td>35</td>\n",
       "      <td>41</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    2   5   6\n",
       "1   8  62  82\n",
       "3  82  39  73\n",
       "7  48  35  41"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.iloc[[1,3,7], [2,5,6]] # row indices 1,3,7 column indices 2,5,6"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `loc()`\n",
    "\n",
    "`loc()` is a label-based selector, and provides a much richer experience in selecting data.  It improves the overall readibility of analysis code and also allows for multi-indices to become more easily understood, as complex multi-indices that are numeric are often difficult to follow when complexity increases.\n",
    "\n",
    "We will create a DataFrame with explicit index and column labels, filling in the values of our previous DataFrame `df` above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df_si = pd.DataFrame(df.values, \n",
    "                     index=['r{}'.format(i) for i in range(0,10)],\n",
    "                     columns=['c{}'.format(i) for i in range(0,10)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>c0</th>\n",
       "      <th>c1</th>\n",
       "      <th>c2</th>\n",
       "      <th>c3</th>\n",
       "      <th>c4</th>\n",
       "      <th>c5</th>\n",
       "      <th>c6</th>\n",
       "      <th>c7</th>\n",
       "      <th>c8</th>\n",
       "      <th>c9</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>r0</th>\n",
       "      <td>58</td>\n",
       "      <td>6</td>\n",
       "      <td>91</td>\n",
       "      <td>11</td>\n",
       "      <td>22</td>\n",
       "      <td>29</td>\n",
       "      <td>25</td>\n",
       "      <td>36</td>\n",
       "      <td>55</td>\n",
       "      <td>87</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>r1</th>\n",
       "      <td>83</td>\n",
       "      <td>30</td>\n",
       "      <td>8</td>\n",
       "      <td>55</td>\n",
       "      <td>43</td>\n",
       "      <td>62</td>\n",
       "      <td>82</td>\n",
       "      <td>74</td>\n",
       "      <td>12</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>r2</th>\n",
       "      <td>83</td>\n",
       "      <td>95</td>\n",
       "      <td>28</td>\n",
       "      <td>33</td>\n",
       "      <td>95</td>\n",
       "      <td>28</td>\n",
       "      <td>7</td>\n",
       "      <td>32</td>\n",
       "      <td>6</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>r3</th>\n",
       "      <td>17</td>\n",
       "      <td>25</td>\n",
       "      <td>82</td>\n",
       "      <td>3</td>\n",
       "      <td>65</td>\n",
       "      <td>39</td>\n",
       "      <td>73</td>\n",
       "      <td>63</td>\n",
       "      <td>6</td>\n",
       "      <td>49</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>r4</th>\n",
       "      <td>13</td>\n",
       "      <td>63</td>\n",
       "      <td>18</td>\n",
       "      <td>86</td>\n",
       "      <td>29</td>\n",
       "      <td>35</td>\n",
       "      <td>97</td>\n",
       "      <td>24</td>\n",
       "      <td>71</td>\n",
       "      <td>50</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>r5</th>\n",
       "      <td>33</td>\n",
       "      <td>39</td>\n",
       "      <td>84</td>\n",
       "      <td>85</td>\n",
       "      <td>15</td>\n",
       "      <td>42</td>\n",
       "      <td>68</td>\n",
       "      <td>45</td>\n",
       "      <td>26</td>\n",
       "      <td>69</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>r6</th>\n",
       "      <td>55</td>\n",
       "      <td>27</td>\n",
       "      <td>17</td>\n",
       "      <td>44</td>\n",
       "      <td>78</td>\n",
       "      <td>19</td>\n",
       "      <td>38</td>\n",
       "      <td>63</td>\n",
       "      <td>31</td>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>r7</th>\n",
       "      <td>41</td>\n",
       "      <td>19</td>\n",
       "      <td>48</td>\n",
       "      <td>36</td>\n",
       "      <td>92</td>\n",
       "      <td>35</td>\n",
       "      <td>41</td>\n",
       "      <td>97</td>\n",
       "      <td>98</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>r8</th>\n",
       "      <td>59</td>\n",
       "      <td>65</td>\n",
       "      <td>67</td>\n",
       "      <td>58</td>\n",
       "      <td>36</td>\n",
       "      <td>84</td>\n",
       "      <td>8</td>\n",
       "      <td>45</td>\n",
       "      <td>16</td>\n",
       "      <td>76</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>r9</th>\n",
       "      <td>56</td>\n",
       "      <td>36</td>\n",
       "      <td>98</td>\n",
       "      <td>63</td>\n",
       "      <td>73</td>\n",
       "      <td>54</td>\n",
       "      <td>36</td>\n",
       "      <td>61</td>\n",
       "      <td>56</td>\n",
       "      <td>73</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    c0  c1  c2  c3  c4  c5  c6  c7  c8  c9\n",
       "r0  58   6  91  11  22  29  25  36  55  87\n",
       "r1  83  30   8  55  43  62  82  74  12   5\n",
       "r2  83  95  28  33  95  28   7  32   6   2\n",
       "r3  17  25  82   3  65  39  73  63   6  49\n",
       "r4  13  63  18  86  29  35  97  24  71  50\n",
       "r5  33  39  84  85  15  42  68  45  26  69\n",
       "r6  55  27  17  44  78  19  38  63  31  60\n",
       "r7  41  19  48  36  92  35  41  97  98  10\n",
       "r8  59  65  67  58  36  84   8  45  16  76\n",
       "r9  56  36  98  63  73  54  36  61  56  73"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_si"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a convenience, the `[]` selector is capable of also dealing with labels, though we will not go any further than this basic example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "r3    65\n",
       "r4    29\n",
       "r5    15\n",
       "r6    78\n",
       "Name: c4, dtype: int32"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_si['r3':'r6']['c4']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting contiguous slices (indices are sorted), is very straightforward."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>c0</th>\n",
       "      <th>c1</th>\n",
       "      <th>c2</th>\n",
       "      <th>c3</th>\n",
       "      <th>c4</th>\n",
       "      <th>c5</th>\n",
       "      <th>c6</th>\n",
       "      <th>c7</th>\n",
       "      <th>c8</th>\n",
       "      <th>c9</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>r3</th>\n",
       "      <td>17</td>\n",
       "      <td>25</td>\n",
       "      <td>82</td>\n",
       "      <td>3</td>\n",
       "      <td>65</td>\n",
       "      <td>39</td>\n",
       "      <td>73</td>\n",
       "      <td>63</td>\n",
       "      <td>6</td>\n",
       "      <td>49</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>r4</th>\n",
       "      <td>13</td>\n",
       "      <td>63</td>\n",
       "      <td>18</td>\n",
       "      <td>86</td>\n",
       "      <td>29</td>\n",
       "      <td>35</td>\n",
       "      <td>97</td>\n",
       "      <td>24</td>\n",
       "      <td>71</td>\n",
       "      <td>50</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    c0  c1  c2  c3  c4  c5  c6  c7  c8  c9\n",
       "r3  17  25  82   3  65  39  73  63   6  49\n",
       "r4  13  63  18  86  29  35  97  24  71  50"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_si.loc['r3':'r4',]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As expected we can slice the columns as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>c2</th>\n",
       "      <th>c3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>r3</th>\n",
       "      <td>82</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>r4</th>\n",
       "      <td>18</td>\n",
       "      <td>86</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    c2  c3\n",
       "r3  82   3\n",
       "r4  18  86"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_si.loc['r3':'r4', 'c2':'c3']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This wraps up our basic discussion of selection, we will return to this discussion in part 3, when we talk about more complex boolean slicing and multi-index slicing. &Xi;"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [conda root]",
   "language": "python",
   "name": "conda-root-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  },
  "toc": {
   "colors": {
    "hover_highlight": "#DAA520",
    "navigate_num": "#000000",
    "navigate_text": "#333333",
    "running_highlight": "#FF0000",
    "selected_highlight": "#FFD700",
    "sidebar_border": "#EEEEEE",
    "wrapper_background": "#FFFFFF"
   },
   "moveMenuLeft": true,
   "nav_menu": {
    "height": "143px",
    "width": "252px"
   },
   "navigate_menu": true,
   "number_sections": false,
   "sideBar": true,
   "threshold": 4,
   "toc_cell": true,
   "toc_section_display": "block",
   "toc_window_display": false,
   "widenNotebook": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}