#!/usr/bin/env python
# coding: utf-8

# # Reproducible Data Analysis in Jupyter
# 
# *Jake VanderPlas, March 2017*
# 
# Jupyter notebooks provide a useful environment for interactive exploration of data. A common question, though, is how you can progress from this nonlinear, interactive, trial-and-error style of analysis to a more linear and reproducible analysis based on organized, well-tested code. This series of videos shows an example of how I approach reproducible data analysis within the Jupyter notebook.
# 
# Each video is approximately 5-8 minutes; the videos are
# available in a [YouTube Playlist](https://www.youtube.com/playlist?list=PLYCpMb24GpOC704uO9svUrihl-HY1tTJJ).
# Alternatively, below you can find the videos with some description and lists of relevant resources

# In[1]:


# Quick utility to embed the videos below
from IPython.display import YouTubeVideo
def embed_video(index, playlist='PLYCpMb24GpOC704uO9svUrihl-HY1tTJJ'):
    return YouTubeVideo('', index=index - 1, list=playlist, width=600, height=350)


# ## Part 1: Loading and Visualizing Data
# 
# *In this video, I introduce the dataset, and use the Jupyter notebook to download and visualize it.*

# In[2]:


embed_video(1)


# Relevant resources:
# 
# - [Fremont Bridge Bike Counter](http://www.seattle.gov/transportation/bikecounter_fremont.htm): the website where you can explore the data
# 
# - [A Whirlwind Tour of Python](https://github.com/jakevdp/WhirlwindTourOfPython): my book introducing the Python programming language, aimed at scientists and engineers.
# 
# - [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook): my book introducing Python's data science tools, including an introduction to the IPython, Pandas, and Matplotlib tools used here.

# ## Part 2: Further Data Exploration
# 
# *In this video, I do some slightly more sophisticated visualization with the data, using matplotlib and pandas.*

# In[3]:


embed_video(2)


# Relevant Resources:
# 
# - [Pivot Tables Section](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.09-Pivot-Tables.ipynb) from the Python Data Science Handbook

# ## Part 3: Version Control with Git & GitHub
# 
# *In this video, I set up a repository on GitHub and commit the notebook into version control.*

# In[4]:


embed_video(3)


# Relevant Resources:
# 
# - [Version Control With Git](https://swcarpentry.github.io/git-novice/): excellent novice-level tutorial from Software Carpentry
# - [Github Guides](https://guides.github.com/): set of tutorials on using GitHub
# - [The Whys and Hows of Licensing Scientific Code](http://www.astrobetter.com/blog/2014/03/10/the-whys-and-hows-of-licensing-scientific-code/): my 2014 blog post on AstroBetter

# ## Part 4: Working with Data and GitHub
# 
# *In this video, I refactor the data download script so that it only downloads the data when needed*

# In[5]:


embed_video(4)


# ## Part 5: Creating a Python Package
# 
# *In this video, I move the data download utility into its own separate package*

# In[6]:


embed_video(5)


# Relevant Resources:
# 
# - [How To Package Your Python Code](https://python-packaging.readthedocs.io/): broad tutorial on Python packaging.

# ## Part 6: Unit Testing with PyTest
# 
# *In this video, I add unit tests for the data download utility*

# In[7]:


embed_video(6)


# Relevant resources:
# 
# - [Pytest Documentation](http://doc.pytest.org/)
# - [Getting Started with Pytest](https://jacobian.org/writing/getting-started-with-pytest/): a nice tutorial by Jacob Kaplan-Moss

# ## Part 7: Refactoring for Speed
# 
# *In this video, I refactor the data download function to be a bit faster*

# In[8]:


embed_video(7)


# Relevant Resources:
# 
# - [Python ``strftime`` reference](http://strftime.org/)
# - [Pandas Datetime Section](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.11-Working-with-Time-Series.ipynb) from the Python Data Science Handbook

# ## Part 8: Debugging a Broken Function
# 
# *In this video, I discover that my refactoring has caused a bug. I debug it and fix it.*

# In[9]:


embed_video(8)


# ## Part 8.5: Finding and Fixing a scikit-learn bug
# 
# *In this video, I discover a bug in the scikit-learn codebase, and go through the process of submitting a GitHub Pull Request fixing the bug*

# In[10]:


embed_video(9)


# ## Part 9: Further Data Exploration: PCA and GMM
# 
# *In this video, I apply unsupervised learning techniques to the data to explore what we can learn from it*

# In[11]:


embed_video(10)


# Relevant Resources:
# 
# - [Principal Component Analysis In-Depth](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.09-Principal-Component-Analysis.ipynb) from the Python Data Science Handbook
# - [Gaussian Mixture Models In-Depth](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.12-Gaussian-Mixtures.ipynb) from the Python Data Science Handbook

# ## Part 10: Cleaning-up the Notebook
# 
# *In this video, I clean-up the unsupervised learning analysis to make it more reproducible and presentable.*

# In[12]:


embed_video(11)


# Relevant Resources:
# 
# - [Learning Seattle's Work Habits from Bicycle Counts](https://jakevdp.github.io/blog/2015/07/23/learning-seattles-work-habits-from-bicycle-counts/): My 2015 blog post using Fremont Bridge data