Talk given at RMACC August 17, 2017 titled "Practical Data Wrangling in Pandas".
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 5.7 KiB

6 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
  1. # Practical Data Wrangling with Pandas
  2. ## ABSTRACT
  3. Hacking Python? Need to import some Excel data and run a detailed data analysis? Got Pandas? Pandas has become a staple in the Python data science stack with strengths in data manipulation and analysis. In this workshop, we will focus on real-world data analysis scenarios that show the strengths of this library. We’ll cover basic Pandas data structures, core import/export and I/O functionality, manipulation of data in Pandas and the basics of Pandas data visualization. We will focus on the practical so that you can leave ready to apply your skills. We assume a basic working knowledge of Python and exposure to Jupyter Notebooks.
  4. ## ABOUT
  5. This talk was originally given on August 17, 2017 at the Rocky Mountain Advanced Computing Consortium (RMACC) 2017 Symposium.
  6. There are two parts to this repository:
  7. 1. **the slides** which can best be [viewed here](http://keithmaull.com/talks/20170817/slides), though the HTML source is [here](./slides); NOTE: _slides prepared using the [RISE plugin](https://github.com/damianavila/RISE) for Jupyter Notebooks and [NBExtensions](http://jupyter-contrib-nbextensions.readthedocs.io/en/latest/install.html)_
  8. 2. **the notebooks** which are supplemental to the slides (and also the basis for their content); they are [best viewed with NBViewer starting here](http://nbviewer.jupyter.org/urls/code.keithmaull.net/kmaull/talk_2017_08_RMACC_GotPandas/raw/master/nb/0_introduction.ipynb), but you are free to clone the repo and work on the notebooks from it or from NB
  9. ## SECTION 0: INTRODUCTION
  10. | ~ 10m | [notebook](http://nbviewer.jupyter.org/urls/code.keithmaull.net/kmaull/talk_2017_08_RMACC_GotPandas/raw/master/nb/0_introduction.ipynb) | [slides](http://keithmaull.com/talks/20170817/slides/0_introduction.slides.html)|
  11. |-------------:|:-------------------------------------------------------------------|
  12. | **Content** | what is pandas; why pandas; pandas v numpy; installing pandas |
  13. | **Expected<br/>Outcomes** | &#8226; basic introduction to the Pandas ecosystem<br/> |
  14. <br/><br/>
  15. ## SECTION 1: PANDAS DATA STRUCTURES
  16. | ~20m | [notebook](http://nbviewer.jupyter.org/urls/code.keithmaull.net/kmaull/talk_2017_08_RMACC_GotPandas/raw/master/nb/1_data_structures.ipynb) &#124; [slides](http://keithmaull.com/talks/20170817/slides/1_data_structures.slides.html) |
  17. |-------------:|:-------------------------------------------------------------------|
  18. | **Content** | core pandas data structures; series, dataframe, (optionally panel); basic concepts of data structures and manipulation strategies |
  19. | **Expected<br/>Outcomes** | &#8226; identify and utilize series and dataframe structures<br/>&#8226; perform basic manipulation operations<br/>&#8226; understand basic Pythonic manipulation concepts<br/> |
  20. <br/><br/>
  21. ## SECTION 2: IMPORTING DATA
  22. | ~20m | [notebook](http://nbviewer.jupyter.org/urls/code.keithmaull.net/kmaull/talk_2017_08_RMACC_GotPandas/raw/master/nb/2_importing_data.ipynb) &#124; [slides](http://keithmaull.com/talks/20170817/slides/2_importing_data.slides.html) |
  23. |-------------:|:-------------------------------------------------------------------|
  24. | **Content** | importing data; csv and excel; json; sql; other supported data formats |
  25. | **Expected<br/>Outcomes** | &#8226; import data of various formats<br/>&#8226; perform data imports into dataframes<br/>&#8226; perform various conversions in Pandas<br/> |
  26. <br/><br/>
  27. ## SECTION 3: MANIPULATING DATA
  28. | ~20m | [notebook](http://nbviewer.jupyter.org/urls/code.keithmaull.net/kmaull/talk_2017_08_RMACC_GotPandas/raw/master/nb/3_dataframe_operations.ipynb) &#124; [slides](http://keithmaull.com/talks/20170817/slides/3_dataframe_operations.slides.html) |
  29. |-------------:|:-------------------------------------------------------------------|
  30. | **Content** | basic terminology; selecting data; slicing dataframes; setting and assigning operations; built-in summary statistics |
  31. | **Expected<br/>Outcomes** | &#8226; understand the basic terminology<br/>&#8226; perform selecting data by row, coloum<br/>&#8226; perform selecting data by label/index and boolean selections<br/>&#8226; perform slicing, merging and subsetting<br/>&#8226; perform multi-indexing<br/>&#8226; access basic stats and summary<br/> |
  32. <br/><br/>
  33. ## SECTION 4: WRAPPING UP
  34. | ~15m | [notebook](http://nbviewer.jupyter.org/urls/code.keithmaull.net/kmaull/talk_2017_08_RMACC_GotPandas/raw/master/nb/4_wrapping_up.ipynb) &#124; [slides](http://keithmaull.com/talks/20170817/slides/4_wrapping_up.slides.html) |
  35. |-------------:|:-------------------------------------------------------------------|
  36. | **Content** | putting it all together; finding the need for Pandas; integrating Pandas into data engineering workflows |
  37. | **Expected<br/>Outcomes** | &#8226; identify real-world use cases for Pandas<br/>&#8226; navigate and utilize key online resources for further study<br/> |
  38. <br/><br/>
  39. ## RESOURCES
  40. Resources to use to learn more about Pandas:
  41. * the [pydata documentation](http://pandas.pydata.org) is complete, if not overwhelming for the beginner
  42. * [Pandas Cookbook](https://github.com/jvns/pandas-cookbook) on Github by Julia Evans (also on pydata.org)
  43. * [Data Wrangling with Pandas cheat sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) by pydata.org
  44. * [Pandas for Data Science cheat sheet](https://www.datacamp.com/community/blog/python-pandas-cheat-sheet) by DataCamp.com
  45. ## LICENSE
  46. Originally created by Keith E. Maull, 2017.
  47. CC-BY-4.0
  48. ![](https://i.creativecommons.org/l/by/4.0/88x31.png)
  49. This work is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).