Talk given at RMACC August 17, 2017 titled "Practical Data Wrangling in Pandas".
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 

6.1 KiB

NAVIGATION

Got Pandas? Practical Data Wrangling with Pandas

  • Introduction
  1. Data Structures
  2. Importing Data
  3. Manipulating DataFrames
  4. Wrap Up

NOTEBOOK OBJECTIVES

In this notebook we'll:

  • explore the purpose of Pandas,
  • understand where Pandas fits in the scientific data analysis ecosystem,
  • understand installation options.

Pandas

Pandas is a fantastic library, and if you don't Got Pandas? ... perhaps it is time you do.

Pandas is a fast and built on top of NumPy with dependencies on statsmodel, so if you have familiarity with NumPy, Pandas might be what you've always wanted and never knew you did!

For readers who are familiar with R and considering Python, Pandas may be the right tool to make the transition smoothly as the core DataFrame structure in Pandas is modeled after that of R's data.frame.

Pandas has many strengths but here are a few that might pique your interests:

  • flexible, consistent data import and export from a wide array of sources, including SQL, CSV, EXCEL, etc.
  • tabular / matrix data representation with heterogeneous labeled or unlabeled columns
  • intuitive handling of missing data
  • import and conversion of data to / from NumPy
  • sophisticated slicing, indexing and subsetting of data
  • support for hierarchical labeling of data
  • support for time series data, including time/date conversion, moving windows, etc.
  • and much more ...
picture

Why Pandas?

Pandas has become known as the go-to library in the Python data science stack. With its strong support for importing various data formats, it can be the first tool you might use to work with, manipulate, convert, reorganize and prepare data for analysis.

Pandas is not a replacement for NumPy, but rather a supplement to it. With its sophisticated indexing, it becomes a more powerful way to access and prepare data for analysis in NumPy, and in many cases it will become a necessary compliment to the features already provided by NumPy.

Pandas brings the fun back into data engineering, and once mastered is one of many tools that will be required for doing high quality data analysis in Python.

Everything you'd every want to know about Python can be found :

There are also many great tutorials around the web and in the blogosphere.

How Pandas?

Pandas can be installed in Python 2 and Python 3, though it is recommended to use Python 3 as Python 2 will soon lose support and updates.

Pandas can be installed from a variety of mechanisms.

If you've installed Anaconda then you need do nothing -- Pandas is installed by default in the conda stack.

If you want, you can install Pandas via binaries from Pypi or you can install via pip:

pip install pandas

should get you going.

For more about installation, please see: