Reproducible Data Analysis in Jupyter¶

Jake VanderPlas, March 2017

Jupyter notebooks provide a useful environment for interactive exploration of data. A common question, though, is how you can progress from this nonlinear, interactive, trial-and-error style of analysis to a more linear and reproducible analysis based on organized, well-tested code. This series of videos shows an example of how I approach reproducible data analysis within the Jupyter notebook.

Each video is approximately 5-8 minutes; the videos are available in a YouTube Playlist. Alternatively, below you can find the videos with some description and lists of relevant resources

In [1]:

# Quick utility to embed the videos below
from IPython.display import YouTubeVideo
def embed_video(index, playlist='PLYCpMb24GpOC704uO9svUrihl-HY1tTJJ'):
    return YouTubeVideo('', index=index - 1, list=playlist, width=600, height=350)

Part 1: Loading and Visualizing Data¶

In this video, I introduce the dataset, and use the Jupyter notebook to download and visualize it.

In [2]:

embed_video(1)

Out[2]:

Relevant resources:

Fremont Bridge Bike Counter: the website where you can explore the data
A Whirlwind Tour of Python: my book introducing the Python programming language, aimed at scientists and engineers.
Python Data Science Handbook: my book introducing Python's data science tools, including an introduction to the IPython, Pandas, and Matplotlib tools used here.

Part 2: Further Data Exploration¶

In this video, I do some slightly more sophisticated visualization with the data, using matplotlib and pandas.

In [3]:

embed_video(2)

Out[3]:

Relevant Resources:

Pivot Tables Section from the Python Data Science Handbook

Part 3: Version Control with Git & GitHub¶

In this video, I set up a repository on GitHub and commit the notebook into version control.

In [4]:

embed_video(3)

Out[4]:

Relevant Resources:

Version Control With Git: excellent novice-level tutorial from Software Carpentry
Github Guides: set of tutorials on using GitHub
The Whys and Hows of Licensing Scientific Code: my 2014 blog post on AstroBetter