%pylab inline
import numpy as np
import pylab as pl
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline]. For more information, type 'help(pylab)'.
In this section we study how different estimators maybe be chained
For some types of data, for instance text data, a feature extraction step must be applied to convert it to numerical features.
from sklearn import datasets, feature_selection
from sklearn.feature_extraction.text import TfidfVectorizer
news = datasets.fetch_20newsgroups()
X, y = news.data, news.target
vectorizer = TfidfVectorizer()
vectorizer.fit(X)
vector_X = vectorizer.transform(X)
print vector_X.shape
(11314, 56436)
/usr/local/lib/python2.7/site-packages/scikits/__init__.py:1: UserWarning: Module argparse was already imported from /usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/argparse.pyc, but /usr/local/lib/python2.7/site-packages is being added to sys.path
__import__('pkg_resources').declare_namespace(__name__)
The feature selection object is a "transformer": it has a "fit" method and a "transform" method.
Importantly, the "fit" method of the transformer is applied on the training set, but the transform method can be applied on any data, including the test set.
We can see that the vectorized data has a very large number of features, as it list the words of the document. Many of these are not relevant for the classification problem.
Supervised feature selection can select features that seem relevent for a learning task based on a simple test. It is often a computationally cheap way of reducing the dimensionality.
Scikit-learn has a variety of feature selection strategy. The univariate feature selection strategies, (FDR, FPR, FWER, k-best, percentile) apply a simple function to compute a test statistic on each feature. The choice of this function (the score_func parameter) is important:
from sklearn import feature_selection
selector = feature_selection.SelectPercentile(percentile=5, score_func=feature_selection.chi2)
X_red = selector.fit_transform(vector_X, y)
print "Original data shape %s, reduced data shape %s" % (vector_X.shape, X_red.shape)
Original data shape (11314, 56436), reduced data shape (11314, 2821)
/usr/local/lib/python2.7/site-packages/sklearn/feature_selection/univariate_selection.py:327: UserWarning: Duplicate scores. Result may depend on feature ordering.There are probably duplicate features, or you used a classification score for a regression task.
warn("Duplicate scores. Result may depend on feature ordering."
A transformer and a predictor can be combined to form a predictor using the pipeline object.
The constructor of the pipeline object takes a list of (name, estimator) pairs, that are applied on the data in the order of the list. The pipeline object exposes fit, transform, predict and score methods that result from applying the transforms (and fit in the case of the fit method) one after the other to the data, and calling the last object's corresponding function.
Using a pipeline we can combine our feature extraction, selection and final SVC in one step. This is convenient, as it enables to do clean cross-validation.
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.cross_validation import cross_val_score
svc = LinearSVC()
pipeline = Pipeline([('vectorize', vectorizer), ('select', selector), ('svc', svc)])
cross_val_score(pipeline, X, y, verbose=3)
score: 0.865589 score: 0.859984 score: 0.861045
[Parallel(n_jobs=1)]: Done 1 jobs | elapsed: 9.7s [Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 28.1s finished
array([ 0.86558855, 0.85998409, 0.86104482])
The resulting pipelined predictor object has implicitely many parameters. How do we set them in a principled way?
As a reminder, the GridSearchCV object can be used to set the parameters of an estimator. We just need to know the name of the parameters to set.
The pipeline object exposes the parameters of the estimators it wraps with the following convention: first the name of the estimator, as given in the constructor list, then the name of parameter, separated by a double underscore. For instance, to set the SVC's 'C' parameter:
pipeline.set_params(svc__C=10)
Pipeline(steps=[('vectorize', TfidfVectorizer(analyzer='word', binary=False, charset='utf-8',
charset_error='strict', dtype=<type 'long'>, input='content',
lowercase=True, max_df=1.0, max_features=None, max_n=None,
min_df=2, min_n=None, ngram_range=(1, 1), norm='l2',
preproces...ling=1, loss='l2', multi_class='ovr', penalty='l2',
random_state=None, tol=0.0001, verbose=0))])
We can then use the grid search to choose the best C between 3 values.
Performance tip: choosing parameters by cross-validation may imply running the transformers many times on the same data with the same parameters. One way to avoid part of this overhead is to use memoization. In particular, we can use the version of joblib that is embedded in scikit-learn:
from sklearn.externals import joblib
memory = joblib.Memory(cachedir='.')
memory.clear()
selector.score_func = memory.cache(selector.score_func)
WARNING:root:[Memory(cachedir='./joblib')]: Flushing completely the cache
Now we can proceed to run the grid search:
from sklearn.grid_search import GridSearchCV
grid = GridSearchCV(estimator=pipeline, param_grid=dict(svc__C=[1e-2, 1, 1e2]))
grid.fit(X, y)
print grid.best_estimator_.named_steps['svc']
________________________________________________________________________________
[Memory] Calling sklearn.feature_selection.univariate_selection.chi2...
chi2(<7542x44103 sparse matrix of type '<type 'numpy.float64'>'
with 1127049 stored elements in Compressed Sparse Row format>,
array([7, ..., 8]))
_____________________________________________________________chi2 - 0.1s, 0.0min
________________________________________________________________________________
[Memory] Calling sklearn.feature_selection.univariate_selection.chi2...
chi2(<7543x43233 sparse matrix of type '<type 'numpy.float64'>'
with 1131660 stored elements in Compressed Sparse Row format>,
array([4, ..., 8]))
_____________________________________________________________chi2 - 0.1s, 0.0min
________________________________________________________________________________
[Memory] Calling sklearn.feature_selection.univariate_selection.chi2...
chi2(<7543x44119 sparse matrix of type '<type 'numpy.float64'>'
with 1137767 stored elements in Compressed Sparse Row format>,
array([7, ..., 1]))
_____________________________________________________________chi2 - 0.1s, 0.0min
________________________________________________________________________________
[Memory] Calling sklearn.feature_selection.univariate_selection.chi2...
chi2(<11314x56436 sparse matrix of type '<type 'numpy.float64'>'
with 1713894 stored elements in Compressed Sparse Row format>,
array([7, ..., 8]))
_____________________________________________________________chi2 - 0.2s, 0.0min
LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,
intercept_scaling=1, loss=l2, multi_class=ovr, penalty=l2,
random_state=None, tol=0.0001, verbose=0)
On the 'labeled faces in the wild' (datasets.fetch_lfw_people) chain a randomized PCA with an SVC for prediction