#!/usr/bin/env python
# coding: utf-8
#
#
#
# *This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
#
# *The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
#
# < [Pivot Tables](03.09-Pivot-Tables.ipynb) | [Contents](Index.ipynb) | [Working with Time Series](03.11-Working-with-Time-Series.ipynb) >
#
#
#
# # Vectorized String Operations
# One strength of Python is its relative ease in handling and manipulating string data.
# Pandas builds on this and provides a comprehensive set of *vectorized string operations* that become an essential piece of the type of munging required when working with (read: cleaning up) real-world data.
# In this section, we'll walk through some of the Pandas string operations, and then take a look at using them to partially clean up a very messy dataset of recipes collected from the Internet.
# ## Introducing Pandas String Operations
#
# We saw in previous sections how tools like NumPy and Pandas generalize arithmetic operations so that we can easily and quickly perform the same operation on many array elements. For example:
# In[1]:
import numpy as np
x = np.array([2, 3, 5, 7, 11, 13])
x * 2
# This *vectorization* of operations simplifies the syntax of operating on arrays of data: we no longer have to worry about the size or shape of the array, but just about what operation we want done.
# For arrays of strings, NumPy does not provide such simple access, and thus you're stuck using a more verbose loop syntax:
# In[2]:
data = ['peter', 'Paul', 'MARY', 'gUIDO']
[s.capitalize() for s in data]
# This is perhaps sufficient to work with some data, but it will break if there are any missing values.
# For example:
# In[3]:
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
[s.capitalize() for s in data]
# Pandas includes features to address both this need for vectorized string operations and for correctly handling missing data via the ``str`` attribute of Pandas Series and Index objects containing strings.
# So, for example, suppose we create a Pandas Series with this data:
# In[4]:
import pandas as pd
names = pd.Series(data)
names
# We can now call a single method that will capitalize all the entries, while skipping over any missing values:
# In[5]:
names.str.capitalize()
# Using tab completion on this ``str`` attribute will list all the vectorized string methods available to Pandas.
# ## Tables of Pandas String Methods
#
# If you have a good understanding of string manipulation in Python, most of Pandas string syntax is intuitive enough that it's probably sufficient to just list a table of available methods; we will start with that here, before diving deeper into a few of the subtleties.
# The examples in this section use the following series of names:
# In[6]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
'Eric Idle', 'Terry Jones', 'Michael Palin'])
# ### Methods similar to Python string methods
# Nearly all Python's built-in string methods are mirrored by a Pandas vectorized string method. Here is a list of Pandas ``str`` methods that mirror Python string methods:
#
# | | | | |
# |-------------|------------------|------------------|------------------|
# |``len()`` | ``lower()`` | ``translate()`` | ``islower()`` |
# |``ljust()`` | ``upper()`` | ``startswith()`` | ``isupper()`` |
# |``rjust()`` | ``find()`` | ``endswith()`` | ``isnumeric()`` |
# |``center()`` | ``rfind()`` | ``isalnum()`` | ``isdecimal()`` |
# |``zfill()`` | ``index()`` | ``isalpha()`` | ``split()`` |
# |``strip()`` | ``rindex()`` | ``isdigit()`` | ``rsplit()`` |
# |``rstrip()`` | ``capitalize()`` | ``isspace()`` | ``partition()`` |
# |``lstrip()`` | ``swapcase()`` | ``istitle()`` | ``rpartition()`` |
#
# Notice that these have various return values. Some, like ``lower()``, return a series of strings:
# In[7]:
monte.str.lower()
# But some others return numbers:
# In[8]:
monte.str.len()
# Or Boolean values:
# In[9]:
monte.str.startswith('T')
# Still others return lists or other compound values for each element:
# In[10]:
monte.str.split()
# We'll see further manipulations of this kind of series-of-lists object as we continue our discussion.
# ### Methods using regular expressions
#
# In addition, there are several methods that accept regular expressions to examine the content of each string element, and follow some of the API conventions of Python's built-in ``re`` module:
#
# | Method | Description |
# |--------|-------------|
# | ``match()`` | Call ``re.match()`` on each element, returning a boolean. |
# | ``extract()`` | Call ``re.match()`` on each element, returning matched groups as strings.|
# | ``findall()`` | Call ``re.findall()`` on each element |
# | ``replace()`` | Replace occurrences of pattern with some other string|
# | ``contains()`` | Call ``re.search()`` on each element, returning a boolean |
# | ``count()`` | Count occurrences of pattern|
# | ``split()`` | Equivalent to ``str.split()``, but accepts regexps |
# | ``rsplit()`` | Equivalent to ``str.rsplit()``, but accepts regexps |
# With these, you can do a wide range of interesting operations.
# For example, we can extract the first name from each by asking for a contiguous group of characters at the beginning of each element:
# In[11]:
monte.str.extract('([A-Za-z]+)', expand=False)
# Or we can do something more complicated, like finding all names that start and end with a consonant, making use of the start-of-string (``^``) and end-of-string (``$``) regular expression characters:
# In[12]:
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')
# The ability to concisely apply regular expressions across ``Series`` or ``Dataframe`` entries opens up many possibilities for analysis and cleaning of data.
# ### Miscellaneous methods
# Finally, there are some miscellaneous methods that enable other convenient operations:
#
# | Method | Description |
# |--------|-------------|
# | ``get()`` | Index each element |
# | ``slice()`` | Slice each element|
# | ``slice_replace()`` | Replace slice in each element with passed value|
# | ``cat()`` | Concatenate strings|
# | ``repeat()`` | Repeat values |
# | ``normalize()`` | Return Unicode form of string |
# | ``pad()`` | Add whitespace to left, right, or both sides of strings|
# | ``wrap()`` | Split long strings into lines with length less than a given width|
# | ``join()`` | Join strings in each element of the Series with passed separator|
# | ``get_dummies()`` | extract dummy variables as a dataframe |
# #### Vectorized item access and slicing
#
# The ``get()`` and ``slice()`` operations, in particular, enable vectorized element access from each array.
# For example, we can get a slice of the first three characters of each array using ``str.slice(0, 3)``.
# Note that this behavior is also available through Python's normal indexing syntax–for example, ``df.str.slice(0, 3)`` is equivalent to ``df.str[0:3]``:
# In[13]:
monte.str[0:3]
# Indexing via ``df.str.get(i)`` and ``df.str[i]`` is likewise similar.
#
# These ``get()`` and ``slice()`` methods also let you access elements of arrays returned by ``split()``.
# For example, to extract the last name of each entry, we can combine ``split()`` and ``get()``:
# In[14]:
monte.str.split().str.get(-1)
# #### Indicator variables
#
# Another method that requires a bit of extra explanation is the ``get_dummies()`` method.
# This is useful when your data has a column containing some sort of coded indicator.
# For example, we might have a dataset that contains information in the form of codes, such as A="born in America," B="born in the United Kingdom," C="likes cheese," D="likes spam":
# In[15]:
full_monte = pd.DataFrame({'name': monte,
'info': ['B|C|D', 'B|D', 'A|C',
'B|D', 'B|C', 'B|C|D']})
full_monte
# The ``get_dummies()`` routine lets you quickly split-out these indicator variables into a ``DataFrame``:
# In[16]:
full_monte['info'].str.get_dummies('|')
# With these operations as building blocks, you can construct an endless range of string processing procedures when cleaning your data.
#
# We won't dive further into these methods here, but I encourage you to read through ["Working with Text Data"](http://pandas.pydata.org/pandas-docs/stable/text.html) in the Pandas online documentation, or to refer to the resources listed in [Further Resources](03.13-Further-Resources.ipynb).
# ## Example: Recipe Database
#
# These vectorized string operations become most useful in the process of cleaning up messy, real-world data.
# Here I'll walk through an example of that, using an open recipe database compiled from various sources on the Web.
# Our goal will be to parse the recipe data into ingredient lists, so we can quickly find a recipe based on some ingredients we have on hand.
#
# The scripts used to compile this can be found at https://github.com/fictivekin/openrecipes, and the link to the current version of the database is found there as well.
#
# As of Spring 2016, this database is about 30 MB, and can be downloaded and unzipped with these commands:
# In[17]:
# !curl -O http://openrecipes.s3.amazonaws.com/recipeitems-latest.json.gz
# !gunzip recipeitems-latest.json.gz
# The database is in JSON format, so we will try ``pd.read_json`` to read it:
# In[18]:
try:
recipes = pd.read_json('recipeitems-latest.json')
except ValueError as e:
print("ValueError:", e)
# Oops! We get a ``ValueError`` mentioning that there is "trailing data."
# Searching for the text of this error on the Internet, it seems that it's due to using a file in which *each line* is itself a valid JSON, but the full file is not.
# Let's check if this interpretation is true:
# In[19]:
with open('recipeitems-latest.json') as f:
line = f.readline()
pd.read_json(line).shape
# Yes, apparently each line is a valid JSON, so we'll need to string them together.
# One way we can do this is to actually construct a string representation containing all these JSON entries, and then load the whole thing with ``pd.read_json``:
# In[20]:
# read the entire file into a Python array
with open('recipeitems-latest.json', 'r') as f:
# Extract each line
data = (line.strip() for line in f)
# Reformat so each line is the element of a list
data_json = "[{0}]".format(','.join(data))
# read the result as a JSON
recipes = pd.read_json(data_json)
# In[21]:
recipes.shape
# We see there are nearly 200,000 recipes, and 17 columns.
# Let's take a look at one row to see what we have:
# In[22]:
recipes.iloc[0]
# There is a lot of information there, but much of it is in a very messy form, as is typical of data scraped from the Web.
# In particular, the ingredient list is in string format; we're going to have to carefully extract the information we're interested in.
# Let's start by taking a closer look at the ingredients:
# In[23]:
recipes.ingredients.str.len().describe()
# The ingredient lists average 250 characters long, with a minimum of 0 and a maximum of nearly 10,000 characters!
#
# Just out of curiousity, let's see which recipe has the longest ingredient list:
# In[24]:
recipes.name[np.argmax(recipes.ingredients.str.len())]
# That certainly looks like an involved recipe.
#
# We can do other aggregate explorations; for example, let's see how many of the recipes are for breakfast food:
# In[25]:
recipes.description.str.contains('[Bb]reakfast').sum()
# Or how many of the recipes list cinnamon as an ingredient:
# In[26]:
recipes.ingredients.str.contains('[Cc]innamon').sum()
# We could even look to see whether any recipes misspell the ingredient as "cinamon":
# In[27]:
recipes.ingredients.str.contains('[Cc]inamon').sum()
# This is the type of essential data exploration that is possible with Pandas string tools.
# It is data munging like this that Python really excels at.
# ### A simple recipe recommender
#
# Let's go a bit further, and start working on a simple recipe recommendation system: given a list of ingredients, find a recipe that uses all those ingredients.
# While conceptually straightforward, the task is complicated by the heterogeneity of the data: there is no easy operation, for example, to extract a clean list of ingredients from each row.
# So we will cheat a bit: we'll start with a list of common ingredients, and simply search to see whether they are in each recipe's ingredient list.
# For simplicity, let's just stick with herbs and spices for the time being:
# In[28]:
spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',
'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']
# We can then build a Boolean ``DataFrame`` consisting of True and False values, indicating whether this ingredient appears in the list:
# In[29]:
import re
spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE))
for spice in spice_list))
spice_df.head()
# Now, as an example, let's say we'd like to find a recipe that uses parsley, paprika, and tarragon.
# We can compute this very quickly using the ``query()`` method of ``DataFrame``s, discussed in [High-Performance Pandas: ``eval()`` and ``query()``](03.12-Performance-Eval-and-Query.ipynb):
# In[30]:
selection = spice_df.query('parsley & paprika & tarragon')
len(selection)
# We find only 10 recipes with this combination; let's use the index returned by this selection to discover the names of the recipes that have this combination:
# In[31]:
recipes.name[selection.index]
# Now that we have narrowed down our recipe selection by a factor of almost 20,000, we are in a position to make a more informed decision about what we'd like to cook for dinner.
# ### Going further with recipes
#
# Hopefully this example has given you a bit of a flavor (ba-dum!) for the types of data cleaning operations that are efficiently enabled by Pandas string methods.
# Of course, building a very robust recipe recommendation system would require a *lot* more work!
# Extracting full ingredient lists from each recipe would be an important piece of the task; unfortunately, the wide variety of formats used makes this a relatively time-consuming process.
# This points to the truism that in data science, cleaning and munging of real-world data often comprises the majority of the work, and Pandas provides the tools that can help you do this efficiently.
#
# < [Pivot Tables](03.09-Pivot-Tables.ipynb) | [Contents](Index.ipynb) | [Working with Time Series](03.11-Working-with-Time-Series.ipynb) >
#
#
#