Jake VanderPlas, March 2017
Jupyter notebooks provide a useful environment for interactive exploration of data. A common question, though, is how you can progress from this nonlinear, interactive, trial-and-error style of analysis to a more linear and reproducible analysis based on organized, well-tested code. This series of videos shows an example of how I approach reproducible data analysis within the Jupyter notebook.
Each video is approximately 5-8 minutes; the videos are available in a YouTube Playlist. Alternatively, below you can find the videos with some description and lists of relevant resources
# Quick utility to embed the videos below
from IPython.display import YouTubeVideo
def embed_video(index, playlist='PLYCpMb24GpOC704uO9svUrihl-HY1tTJJ'):
return YouTubeVideo('', index=index - 1, list=playlist, width=600, height=350)
In this video, I introduce the dataset, and use the Jupyter notebook to download and visualize it.
embed_video(1)
Relevant resources:
Fremont Bridge Bike Counter: the website where you can explore the data
A Whirlwind Tour of Python: my book introducing the Python programming language, aimed at scientists and engineers.
Python Data Science Handbook: my book introducing Python's data science tools, including an introduction to the IPython, Pandas, and Matplotlib tools used here.
In this video, I do some slightly more sophisticated visualization with the data, using matplotlib and pandas.
embed_video(2)
Relevant Resources:
In this video, I set up a repository on GitHub and commit the notebook into version control.
embed_video(3)
Relevant Resources:
In this video, I refactor the data download script so that it only downloads the data when needed
embed_video(4)
In this video, I move the data download utility into its own separate package
embed_video(5)
Relevant Resources:
In this video, I add unit tests for the data download utility
embed_video(6)
Relevant resources:
In this video, I refactor the data download function to be a bit faster
embed_video(7)
Relevant Resources:
strftime referenceIn this video, I discover that my refactoring has caused a bug. I debug it and fix it.
embed_video(8)
In this video, I discover a bug in the scikit-learn codebase, and go through the process of submitting a GitHub Pull Request fixing the bug
embed_video(9)
In this video, I apply unsupervised learning techniques to the data to explore what we can learn from it
embed_video(10)
Relevant Resources:
In this video, I clean-up the unsupervised learning analysis to make it more reproducible and presentable.
embed_video(11)
Relevant Resources: