We've see that data importing is very easy to do and for a variety of formats.
We're going to show exporting of CSV data to Excel, SQL and JSON. There are a few other exporters that may be of interest to the reader:
In all our basic examples here, we will be using methods over DataFrames.
We're going to go back to our baseball batting data file and learn how to convert this CSV file into something perhaps more interesting - in particular, Excel, SQL and JSON.
import pandas as pd
df = pd.read_csv("./datasets/Batting.csv")
An ExcelWriter
object is required to perform the export to Microsoft Excel, but once it is created, writing to the file is a cinch.
writer = pd.ExcelWriter('export/batting.xlsx')
df.to_excel(writer)
writer.save()
from sqlalchemy import create_engine
engine = create_engine('sqlite:///export/demo.sqlite')
with engine.connect() as conn, conn.begin():
try:
df.to_sql('batting', conn)
except ValueError:
pass # may already exist
Exporting to JSON is also very straightforward, but because the more intricate structure that can be communicated in JSON, we have several options regarding how the data is organized.
The default to_json()
structures the object in a way that column labels are represented as the keys and the values for each column are represented as an object with the index as the key and the value for that (index, column) pair as the value.
import json
json.loads(df[:5].to_json())
{'2B': {'0': 0, '1': 6, '2': 4, '3': 10, '4': 11}, '3B': {'0': 0, '1': 0, '2': 5, '3': 2, '4': 3}, 'AB': {'0': 4, '1': 118, '2': 137, '3': 133, '4': 120}, 'BB': {'0': 0, '1': 4, '2': 2, '3': 0, '4': 2}, 'CS': {'0': 0.0, '1': 1.0, '2': 1.0, '3': 1.0, '4': 2.0}, 'G': {'0': 1, '1': 25, '2': 29, '3': 27, '4': 25}, 'GIDP': {'0': None, '1': None, '2': None, '3': None, '4': None}, 'H': {'0': 0, '1': 32, '2': 40, '3': 44, '4': 39}, 'HBP': {'0': None, '1': None, '2': None, '3': None, '4': None}, 'HR': {'0': 0, '1': 0, '2': 0, '3': 2, '4': 0}, 'IBB': {'0': None, '1': None, '2': None, '3': None, '4': None}, 'R': {'0': 0, '1': 30, '2': 28, '3': 28, '4': 29}, 'RBI': {'0': 0.0, '1': 13.0, '2': 19.0, '3': 27.0, '4': 16.0}, 'SB': {'0': 0.0, '1': 8.0, '2': 3.0, '3': 1.0, '4': 6.0}, 'SF': {'0': None, '1': None, '2': None, '3': None, '4': None}, 'SH': {'0': None, '1': None, '2': None, '3': None, '4': None}, 'SO': {'0': 0.0, '1': 0.0, '2': 5.0, '3': 2.0, '4': 1.0}, 'lgID': {'0': None, '1': None, '2': None, '3': None, '4': None}, 'playerID': {'0': 'abercda01', '1': 'addybo01', '2': 'allisar01', '3': 'allisdo01', '4': 'ansonca01'}, 'stint': {'0': 1, '1': 1, '2': 1, '3': 1, '4': 1}, 'teamID': {'0': 'TRO', '1': 'RC1', '2': 'CL1', '3': 'WS3', '4': 'RC1'}, 'yearID': {'0': 1871, '1': 1871, '2': 1871, '3': 1871, '4': 1871}}
json.loads(df[:5].to_json(orient='records'))
[{'2B': 0, '3B': 0, 'AB': 4, 'BB': 0, 'CS': 0.0, 'G': 1, 'GIDP': None, 'H': 0, 'HBP': None, 'HR': 0, 'IBB': None, 'R': 0, 'RBI': 0.0, 'SB': 0.0, 'SF': None, 'SH': None, 'SO': 0.0, 'lgID': None, 'playerID': 'abercda01', 'stint': 1, 'teamID': 'TRO', 'yearID': 1871}, {'2B': 6, '3B': 0, 'AB': 118, 'BB': 4, 'CS': 1.0, 'G': 25, 'GIDP': None, 'H': 32, 'HBP': None, 'HR': 0, 'IBB': None, 'R': 30, 'RBI': 13.0, 'SB': 8.0, 'SF': None, 'SH': None, 'SO': 0.0, 'lgID': None, 'playerID': 'addybo01', 'stint': 1, 'teamID': 'RC1', 'yearID': 1871}, {'2B': 4, '3B': 5, 'AB': 137, 'BB': 2, 'CS': 1.0, 'G': 29, 'GIDP': None, 'H': 40, 'HBP': None, 'HR': 0, 'IBB': None, 'R': 28, 'RBI': 19.0, 'SB': 3.0, 'SF': None, 'SH': None, 'SO': 5.0, 'lgID': None, 'playerID': 'allisar01', 'stint': 1, 'teamID': 'CL1', 'yearID': 1871}, {'2B': 10, '3B': 2, 'AB': 133, 'BB': 0, 'CS': 1.0, 'G': 27, 'GIDP': None, 'H': 44, 'HBP': None, 'HR': 2, 'IBB': None, 'R': 28, 'RBI': 27.0, 'SB': 1.0, 'SF': None, 'SH': None, 'SO': 2.0, 'lgID': None, 'playerID': 'allisdo01', 'stint': 1, 'teamID': 'WS3', 'yearID': 1871}, {'2B': 11, '3B': 3, 'AB': 120, 'BB': 2, 'CS': 2.0, 'G': 25, 'GIDP': None, 'H': 39, 'HBP': None, 'HR': 0, 'IBB': None, 'R': 29, 'RBI': 16.0, 'SB': 6.0, 'SF': None, 'SH': None, 'SO': 1.0, 'lgID': None, 'playerID': 'ansonca01', 'stint': 1, 'teamID': 'RC1', 'yearID': 1871}]
By now you are probably very happy to have been acquainted with Pandas, but in this tutorial, we've just scratched the surface.
If all the other data wrangling features of Pandas weren't enough.
Making basic scatter plots is very easy with the plot()
method. As a convenience we can also use the plot.scatter()
, which is equivalent.
%matplotlib inline
df[(df.teamID=='WAS') & (df.yearID==2016)][['G', 'AB']].plot.scatter(x='G', y='AB')
<matplotlib.axes._subplots.AxesSubplot at 0x2008f4be5f8>
Making basic line plots is similarly easy with the plot.line()
:
df[(df.teamID=='WAS') & (df.yearID==2016)][['H', 'AB']].plot.line(x='H', y='AB')
<matplotlib.axes._subplots.AxesSubplot at 0x2008f8f3400>
As are bar plots ...
# plot the number of cummulative HR by team by year
df_was_hr = df[(df.teamID=='WAS') & (df.yearID>2006)][['yearID', 'HR']].groupby('yearID').sum()
df_was_hr.plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x2008f9b87f0>
GET PANDAS!
Thank you, have fun and stay in touch!
kmaull@ucar.edu