#!/usr/bin/env python # coding: utf-8 # # Reproducible Data Analysis in Jupyter # # *Jake VanderPlas, March 2017* # # Jupyter notebooks provide a useful environment for interactive exploration of data. A common question, though, is how you can progress from this nonlinear, interactive, trial-and-error style of analysis to a more linear and reproducible analysis based on organized, well-tested code. This series of videos shows an example of how I approach reproducible data analysis within the Jupyter notebook. # # Each video is approximately 5-8 minutes; the videos are # available in a [YouTube Playlist](https://www.youtube.com/playlist?list=PLYCpMb24GpOC704uO9svUrihl-HY1tTJJ). # Alternatively, below you can find the videos with some description and lists of relevant resources # In[1]: # Quick utility to embed the videos below from IPython.display import YouTubeVideo def embed_video(index, playlist='PLYCpMb24GpOC704uO9svUrihl-HY1tTJJ'): return YouTubeVideo('', index=index - 1, list=playlist, width=600, height=350) # ## Part 1: Loading and Visualizing Data # # *In this video, I introduce the dataset, and use the Jupyter notebook to download and visualize it.* # In[2]: embed_video(1) # Relevant resources: # # - [Fremont Bridge Bike Counter](http://www.seattle.gov/transportation/bikecounter_fremont.htm): the website where you can explore the data # # - [A Whirlwind Tour of Python](https://github.com/jakevdp/WhirlwindTourOfPython): my book introducing the Python programming language, aimed at scientists and engineers. # # - [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook): my book introducing Python's data science tools, including an introduction to the IPython, Pandas, and Matplotlib tools used here. # ## Part 2: Further Data Exploration # # *In this video, I do some slightly more sophisticated visualization with the data, using matplotlib and pandas.* # In[3]: embed_video(2) # Relevant Resources: # # - [Pivot Tables Section](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.09-Pivot-Tables.ipynb) from the Python Data Science Handbook # ## Part 3: Version Control with Git & GitHub # # *In this video, I set up a repository on GitHub and commit the notebook into version control.* # In[4]: embed_video(3) # Relevant Resources: # # - [Version Control With Git](https://swcarpentry.github.io/git-novice/): excellent novice-level tutorial from Software Carpentry # - [Github Guides](https://guides.github.com/): set of tutorials on using GitHub # - [The Whys and Hows of Licensing Scientific Code](http://www.astrobetter.com/blog/2014/03/10/the-whys-and-hows-of-licensing-scientific-code/): my 2014 blog post on AstroBetter # ## Part 4: Working with Data and GitHub # # *In this video, I refactor the data download script so that it only downloads the data when needed* # In[5]: embed_video(4) # ## Part 5: Creating a Python Package # # *In this video, I move the data download utility into its own separate package* # In[6]: embed_video(5) # Relevant Resources: # # - [How To Package Your Python Code](https://python-packaging.readthedocs.io/): broad tutorial on Python packaging. # ## Part 6: Unit Testing with PyTest # # *In this video, I add unit tests for the data download utility* # In[7]: embed_video(6) # Relevant resources: # # - [Pytest Documentation](http://doc.pytest.org/) # - [Getting Started with Pytest](https://jacobian.org/writing/getting-started-with-pytest/): a nice tutorial by Jacob Kaplan-Moss # ## Part 7: Refactoring for Speed # # *In this video, I refactor the data download function to be a bit faster* # In[8]: embed_video(7) # Relevant Resources: # # - [Python ``strftime`` reference](http://strftime.org/) # - [Pandas Datetime Section](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.11-Working-with-Time-Series.ipynb) from the Python Data Science Handbook # ## Part 8: Debugging a Broken Function # # *In this video, I discover that my refactoring has caused a bug. I debug it and fix it.* # In[9]: embed_video(8) # ## Part 8.5: Finding and Fixing a scikit-learn bug # # *In this video, I discover a bug in the scikit-learn codebase, and go through the process of submitting a GitHub Pull Request fixing the bug* # In[10]: embed_video(9) # ## Part 9: Further Data Exploration: PCA and GMM # # *In this video, I apply unsupervised learning techniques to the data to explore what we can learn from it* # In[11]: embed_video(10) # Relevant Resources: # # - [Principal Component Analysis In-Depth](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.09-Principal-Component-Analysis.ipynb) from the Python Data Science Handbook # - [Gaussian Mixture Models In-Depth](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.12-Gaussian-Mixtures.ipynb) from the Python Data Science Handbook # ## Part 10: Cleaning-up the Notebook # # *In this video, I clean-up the unsupervised learning analysis to make it more reproducible and presentable.* # In[12]: embed_video(11) # Relevant Resources: # # - [Learning Seattle's Work Habits from Bicycle Counts](https://jakevdp.github.io/blog/2015/07/23/learning-seattles-work-habits-from-bicycle-counts/): My 2015 blog post using Fremont Bridge data