Notebook

In [0]:

%pylab inline
import numpy as np
import pylab as pl

Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

Pipelining estimators¶

In this section we study how different estimators maybe be chained

A simple example: feature extraction and selection before an estimator¶

Feature extraction: vectorizer¶

For some types of data, for instance text data, a feature extraction step must be applied to convert it to numerical features.

In [1]:

from sklearn import datasets, feature_selection
from sklearn.feature_extraction.text import TfidfVectorizer
news = datasets.fetch_20newsgroups()
X, y = news.data, news.target
vectorizer = TfidfVectorizer()
vectorizer.fit(X)
vector_X = vectorizer.transform(X)
print vector_X.shape

(11314, 56436)

/usr/local/lib/python2.7/site-packages/scikits/__init__.py:1: UserWarning: Module argparse was already imported from /usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/argparse.pyc, but /usr/local/lib/python2.7/site-packages is being added to sys.path
  __import__('pkg_resources').declare_namespace(__name__)

The feature selection object is a "transformer": it has a "fit" method and a "transform" method.

Importantly, the "fit" method of the transformer is applied on the training set, but the transform method can be applied on any data, including the test set.

We can see that the vectorized data has a very large number of features, as it list the words of the document. Many of these are not relevant for the classification problem.

Feature selection¶

Supervised feature selection can select features that seem relevent for a learning task based on a simple test. It is often a computationally cheap way of reducing the dimensionality.

Scikit-learn has a variety of feature selection strategy. The univariate feature selection strategies, (FDR, FPR, FWER, k-best, percentile) apply a simple function to compute a test statistic on each feature. The choice of this function (the score_func parameter) is important:

f_regression for regression problems
f_classif for classification problems
chi2 for classification problems with sparse non-negative data (typically text data).

In [2]:

from sklearn import feature_selection

selector = feature_selection.SelectPercentile(percentile=5, score_func=feature_selection.chi2)
X_red = selector.fit_transform(vector_X, y)
print "Original data shape %s, reduced data shape %s" % (vector_X.shape, X_red.shape)

Original data shape (11314, 56436), reduced data shape (11314, 2821)

/usr/local/lib/python2.7/site-packages/sklearn/feature_selection/univariate_selection.py:327: UserWarning: Duplicate scores. Result may depend on feature ordering.There are probably duplicate features, or you used a classification score for a regression task.
  warn("Duplicate scores. Result may depend on feature ordering."

Pipelining¶

A transformer and a predictor can be combined to form a predictor using the pipeline object.

The constructor of the pipeline object takes a list of (name, estimator) pairs, that are applied on the data in the order of the list. The pipeline object exposes fit, transform, predict and score methods that result from applying the transforms (and fit in the case of the fit method) one after the other to the data, and calling the last object's corresponding function.

Using a pipeline we can combine our feature extraction, selection and final SVC in one step. This is convenient, as it enables to do clean cross-validation.

In [3]:

from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.cross_validation import cross_val_score

svc = LinearSVC()
pipeline = Pipeline([('vectorize', vectorizer), ('select', selector), ('svc', svc)])
cross_val_score(pipeline, X, y, verbose=3)

score: 0.865589
score: 0.859984
score: 0.861045

[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    9.7s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   28.1s finished

Out[3]:

array([ 0.86558855,  0.85998409,  0.86104482])

Parameter selection¶

The resulting pipelined predictor object has implicitely many parameters. How do we set them in a principled way?

As a reminder, the GridSearchCV object can be used to set the parameters of an estimator. We just need to know the name of the parameters to set.

The pipeline object exposes the parameters of the estimators it wraps with the following convention: first the name of the estimator, as given in the constructor list, then the name of parameter, separated by a double underscore. For instance, to set the SVC's 'C' parameter:

In [4]:

pipeline.set_params(svc__C=10)

Out[4]:

Pipeline(steps=[('vectorize', TfidfVectorizer(analyzer='word', binary=False, charset='utf-8',
        charset_error='strict', dtype=<type 'long'>, input='content',
        lowercase=True, max_df=1.0, max_features=None, max_n=None,
        min_df=2, min_n=None, ngram_range=(1, 1), norm='l2',
        preproces...ling=1, loss='l2', multi_class='ovr', penalty='l2',
     random_state=None, tol=0.0001, verbose=0))])

We can then use the grid search to choose the best C between 3 values.

Performance tip: choosing parameters by cross-validation may imply running the transformers many times on the same data with the same parameters. One way to avoid part of this overhead is to use memoization. In particular, we can use the version of joblib that is embedded in scikit-learn:

In [5]:

from sklearn.externals import joblib
memory = joblib.Memory(cachedir='.')
memory.clear()
selector.score_func = memory.cache(selector.score_func)

WARNING:root:[Memory(cachedir='./joblib')]: Flushing completely the cache

Now we can proceed to run the grid search:

In [6]:

from sklearn.grid_search import GridSearchCV
grid = GridSearchCV(estimator=pipeline, param_grid=dict(svc__C=[1e-2, 1, 1e2]))
grid.fit(X, y)
print grid.best_estimator_.named_steps['svc']

________________________________________________________________________________
[Memory] Calling sklearn.feature_selection.univariate_selection.chi2...
chi2(<7542x44103 sparse matrix of type '<type 'numpy.float64'>'
	with 1127049 stored elements in Compressed Sparse Row format>, 
array([7, ..., 8]))
_____________________________________________________________chi2 - 0.1s, 0.0min
________________________________________________________________________________
[Memory] Calling sklearn.feature_selection.univariate_selection.chi2...
chi2(<7543x43233 sparse matrix of type '<type 'numpy.float64'>'
	with 1131660 stored elements in Compressed Sparse Row format>, 
array([4, ..., 8]))
_____________________________________________________________chi2 - 0.1s, 0.0min
________________________________________________________________________________
[Memory] Calling sklearn.feature_selection.univariate_selection.chi2...
chi2(<7543x44119 sparse matrix of type '<type 'numpy.float64'>'
	with 1137767 stored elements in Compressed Sparse Row format>, 
array([7, ..., 1]))
_____________________________________________________________chi2 - 0.1s, 0.0min
________________________________________________________________________________
[Memory] Calling sklearn.feature_selection.univariate_selection.chi2...
chi2(<11314x56436 sparse matrix of type '<type 'numpy.float64'>'
	with 1713894 stored elements in Compressed Sparse Row format>, 
array([7, ..., 8]))
_____________________________________________________________chi2 - 0.2s, 0.0min
LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss=l2, multi_class=ovr, penalty=l2,
     random_state=None, tol=0.0001, verbose=0)

Exercise¶

On the 'labeled faces in the wild' (datasets.fetch_lfw_people) chain a randomized PCA with an SVC for prediction

In [7]: