Talk given at RMACC August 17, 2017 titled "Practical Data Wrangling in Pandas".
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

4_wrapping_up.ipynb 8.4 KiB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293
  1. {
  2. "cells": [
  3. {
  4. "cell_type": "markdown",
  5. "metadata": {},
  6. "source": [
  7. "** NAVIGATION **\n",
  8. "\n",
  9. "**Got Pandas? _Practical Data Wrangling with Pandas_**\n",
  10. "\n",
  11. "* [Introduction](./0_introduction.ipynb)\n",
  12. "1. [Data Structures](./1_data_structures.ipynb)\n",
  13. "2. [Importing Data](./2_importing_data.ipynb)\n",
  14. "3. [Manipulating DataFrames](./3_dataframe_operations.ipynb)\n",
  15. "4. **Wrap Up**\n",
  16. "\n",
  17. "---"
  18. ]
  19. },
  20. {
  21. "cell_type": "markdown",
  22. "metadata": {},
  23. "source": [
  24. "**NOTEBOOK OBJECTIVES**\n",
  25. "\n",
  26. "In this notebook we'll:\n",
  27. "\n",
  28. "* explore exporting DataFrames,\n",
  29. "* explore basic visualization capabilities."
  30. ]
  31. },
  32. {
  33. "cell_type": "markdown",
  34. "metadata": {},
  35. "source": [
  36. "# Exporting Data\n",
  37. "\n",
  38. "We've see [that data importing](./1_importing_data.ipynb) is very easy to do and for a variety of formats.\n",
  39. "\n",
  40. "We're going to show exporting of CSV data to Excel, SQL and JSON. There are a few other exporters that may be of interest to the reader:\n",
  41. "\n",
  42. "* [to_xarray()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_xarray.html#pandas.DataFrame.to_xarray): a method for converting to xarrays\n",
  43. "* [to_latex()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_latex.html#pandas.DataFrame.to_latex) : a convenience method for making pretty $\\LaTeX$ from data \n",
  44. "* [to_pickel()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_pickle.html#pandas.DataFrame.to_pickle): a method for pickling (serializing) data to file\n",
  45. "\n",
  46. "In all our basic examples here, we will be using methods over DataFrames.\n",
  47. "\n",
  48. "We're going to go back to our baseball batting data file and learn how to convert this CSV file into something perhaps more interesting - in particular, Excel, SQL and JSON."
  49. ]
  50. },
  51. {
  52. "cell_type": "code",
  53. "execution_count": 1,
  54. "metadata": {
  55. "collapsed": true
  56. },
  57. "outputs": [],
  58. "source": [
  59. "import pandas as pd \n",
  60. "df = pd.read_csv(\"./datasets/Batting.csv\")"
  61. ]
  62. },
  63. {
  64. "cell_type": "markdown",
  65. "metadata": {},
  66. "source": [
  67. "## Excel\n",
  68. "\n",
  69. "An [`ExcelWriter`]() object is required to perform the export to Microsoft Excel, but once it is created, writing to the file is a cinch."
  70. ]
  71. },
  72. {
  73. "cell_type": "code",
  74. "execution_count": 2,
  75. "metadata": {
  76. "collapsed": true
  77. },
  78. "outputs": [],
  79. "source": [
  80. "writer = pd.ExcelWriter('export/batting.xlsx')\n",
  81. "df.to_excel(writer)\n",
  82. "writer.save()"
  83. ]
  84. },
  85. {
  86. "cell_type": "markdown",
  87. "metadata": {},
  88. "source": [
  89. "## SQL\n",
  90. "\n",
  91. "Dumping data to a database is nearly as easy as it was to read it. We need to use the SQLAlchemy engines [as before](./1_importing_data.ipynb#SQL)."
  92. ]
  93. },
  94. {
  95. "cell_type": "code",
  96. "execution_count": null,
  97. "metadata": {
  98. "collapsed": true
  99. },
  100. "outputs": [],
  101. "source": [
  102. "from sqlalchemy import create_engine\n",
  103. "engine = create_engine('sqlite:///export/demo.sqlite')\n",
  104. "\n",
  105. "with engine.connect() as conn, conn.begin():\n",
  106. " try:\n",
  107. " df.to_sql('batting', conn)\n",
  108. " except ValueError: # table already exists\n",
  109. " pass"
  110. ]
  111. },
  112. {
  113. "cell_type": "markdown",
  114. "metadata": {},
  115. "source": [
  116. "## JSON\n",
  117. "\n",
  118. "Exporting to JSON is also very straightforward, but because the more intricate structure that can be communicated in JSON, we have several options regarding how the data is organized."
  119. ]
  120. },
  121. {
  122. "cell_type": "markdown",
  123. "metadata": {},
  124. "source": [
  125. "The default [`to_json()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html#pandas.DataFrame.to_json) structures the object in a way that _column_ labels are represented as the keys and the values for each column are represented as an object with the index as the key and the value for that (index, column) pair as the value. "
  126. ]
  127. },
  128. {
  129. "cell_type": "code",
  130. "execution_count": null,
  131. "metadata": {
  132. "collapsed": true
  133. },
  134. "outputs": [],
  135. "source": [
  136. "import json\n",
  137. "json.loads(df[:5].to_json())"
  138. ]
  139. },
  140. {
  141. "cell_type": "markdown",
  142. "metadata": {},
  143. "source": [
  144. "You are encouraged to [read the documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html#pandas.DataFrame.to_json) further, but you can orient the output data along _records_ or _rows_ and obtain a slightly different output which may be more suitable for your JSON needs."
  145. ]
  146. },
  147. {
  148. "cell_type": "code",
  149. "execution_count": null,
  150. "metadata": {
  151. "collapsed": true
  152. },
  153. "outputs": [],
  154. "source": [
  155. "json.loads(df[:5].to_json(orient='records'))"
  156. ]
  157. },
  158. {
  159. "cell_type": "markdown",
  160. "metadata": {},
  161. "source": [
  162. "## Basic visualization\n",
  163. "\n",
  164. "By now you are probably very happy to have been acquainted with Pandas, but in this tutorial, we've just scratched the surface.\n",
  165. "\n",
  166. "If all the other data wrangling features of Pandas weren't enough."
  167. ]
  168. },
  169. {
  170. "cell_type": "code",
  171. "execution_count": null,
  172. "metadata": {
  173. "collapsed": true
  174. },
  175. "outputs": [],
  176. "source": [
  177. "Making basic scatter plots is very easy with the [`plot()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html#pandas.DataFrame.plot) method. As a convenience we can also use the `plot.scatter()`, which is equivalent."
  178. ]
  179. },
  180. {
  181. "cell_type": "code",
  182. "execution_count": null,
  183. "metadata": {
  184. "collapsed": true
  185. },
  186. "outputs": [],
  187. "source": [
  188. "%matplotlib inline\n",
  189. "df[(df.teamID=='WAS') & (df.yearID==2016)][['G', 'AB']].plot.scatter(x='G', y='AB')"
  190. ]
  191. },
  192. {
  193. "cell_type": "markdown",
  194. "metadata": {},
  195. "source": [
  196. "Making basic line plots is similarly easy with the `plot.line()`:"
  197. ]
  198. },
  199. {
  200. "cell_type": "code",
  201. "execution_count": null,
  202. "metadata": {
  203. "collapsed": true
  204. },
  205. "outputs": [],
  206. "source": [
  207. "df[(df.teamID=='WAS') & (df.yearID==2016)][['H', 'AB']].plot.line(x='H', y='AB')"
  208. ]
  209. },
  210. {
  211. "cell_type": "code",
  212. "execution_count": null,
  213. "metadata": {
  214. "collapsed": true
  215. },
  216. "outputs": [],
  217. "source": [
  218. "# plot the number of cummulative HR by team by year\n",
  219. "df_was_hr = df[(df.teamID=='WAS') & (df.yearID>2006)][['yearID', 'HR']].groupby('yearID').sum()\n",
  220. "df_was_hr.plot.bar()"
  221. ]
  222. },
  223. {
  224. "cell_type": "markdown",
  225. "metadata": {},
  226. "source": [
  227. "\n",
  228. "* Pandas plotting provides the basics ...\n",
  229. "* for more, you won't escape the grips of [matplotlib](http://matplotlib.org/)!"
  230. ]
  231. },
  232. {
  233. "cell_type": "markdown",
  234. "metadata": {},
  235. "source": [
  236. "# Additional resources\n",
  237. "\n",
  238. "As can be imagined, there are many great resources to use to learn more about Pandas:\n",
  239. "\n",
  240. "* the [pydata documentation](http://pandas.pydata.org) is complete, if not overwhelming for the beginner\n",
  241. "* [Pandas Cookbook](https://github.com/jvns/pandas-cookbook) on Github by Julia Evans (also on pydata.org)\n",
  242. "* [Data Wrangling with Pandas cheat sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) by pydata.org\n",
  243. "* [Pandas for Data Science cheat sheet](https://www.datacamp.com/community/blog/python-pandas-cheat-sheet) by DataCamp.com"
  244. ]
  245. }
  246. ],
  247. "metadata": {
  248. "anaconda-cloud": {},
  249. "kernelspec": {
  250. "display_name": "Python [conda root]",
  251. "language": "python",
  252. "name": "conda-root-py"
  253. },
  254. "language_info": {
  255. "codemirror_mode": {
  256. "name": "ipython",
  257. "version": 3
  258. },
  259. "file_extension": ".py",
  260. "mimetype": "text/x-python",
  261. "name": "python",
  262. "nbconvert_exporter": "python",
  263. "pygments_lexer": "ipython3",
  264. "version": "3.6.1"
  265. },
  266. "toc": {
  267. "colors": {
  268. "hover_highlight": "#DAA520",
  269. "navigate_num": "#000000",
  270. "navigate_text": "#333333",
  271. "running_highlight": "#FF0000",
  272. "selected_highlight": "#FFD700",
  273. "sidebar_border": "#EEEEEE",
  274. "wrapper_background": "#FFFFFF"
  275. },
  276. "moveMenuLeft": true,
  277. "nav_menu": {
  278. "height": "123px",
  279. "width": "251px"
  280. },
  281. "navigate_menu": true,
  282. "number_sections": false,
  283. "sideBar": true,
  284. "threshold": 4,
  285. "toc_cell": false,
  286. "toc_section_display": "block",
  287. "toc_window_display": false,
  288. "widenNotebook": false
  289. }
  290. },
  291. "nbformat": 4,
  292. "nbformat_minor": 2
  293. }