Unsupervised Learning: Dimensionality Reduction and Visualization¶

Unsupervised learning is interested in situations in which X is available, but not y: data without labels.

A typical use case is to find hiden structure in the data.

Previously we worked on visualizing the iris data by plotting pairs of dimensions by trial and error, until we arrived at the best pair of dimensions for our dataset. Here we will use an unsupervised dimensionality reduction algorithm to accomplish this more automatically.

By the end of this section you will

Know how to instantiate and train an unsupervised dimensionality reduction algorithm: Principal Component Analysis (PCA)
Know how to use PCA to visualize high-dimensional data

Dimensionality Reduction: PCA¶

Dimensionality reduction is the task of deriving a set of new artificial features that is smaller than the original feature set while retaining most of the variance of the original data. Here we'll use a common but powerful dimensionality reduction technique called Principal Component Analysis (PCA). We'll perform PCA on the iris dataset that we saw before:

In [0]:

from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

PCA is performed using linear combinations of the original features using a truncated Singular Value Decomposition of the matrix X so as to project the data onto a base of the top singular vectors. If the number of retained components is 2 or 3, PCA can be used to visualize the dataset.

In [1]:

from sklearn.decomposition import PCA
pca = PCA(n_components=2, whiten=True)
pca.fit(X)

/usr/local/lib/python2.7/site-packages/scikits/__init__.py:1: UserWarning: Module argparse was already imported from /usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/argparse.pyc, but /usr/local/lib/python2.7/site-packages is being added to sys.path
  __import__('pkg_resources').declare_namespace(__name__)

Out[1]:

PCA(copy=True, n_components=2, whiten=True)

Once fitted, the pca model exposes the singular vectors in the components_ attribute:

In [2]:

pca.components_

Out[2]:

array([[ 0.17650757, -0.04015901,  0.41812992,  0.17516725],
       [-1.33840478, -1.48757227,  0.35831476,  0.15229463]])

Other attributes are available as well:

In [3]:

pca.explained_variance_ratio_

Out[3]:

array([ 0.92461621,  0.05301557])

In [4]:

pca.explained_variance_ratio_.sum()

Out[4]:

0.97763177502480336

Let us project the iris dataset along those first two dimensions:

In [5]:

X_pca = pca.transform(X)

PCA normalizes and whitens the data, which means that the data is now centered on both components with unit variance:

In [6]:

import numpy as np
np.round(X_pca.mean(axis=0), decimals=5)

Out[6]:

array([-0.,  0.])

In [7]:

np.round(X_pca.std(axis=0), decimals=5)

Out[7]:

array([ 1.,  1.])

Furthermore, the samples components do no longer carry any linear correlation:

In [8]:

np.corrcoef(X_pca.T)

Out[8]:

array([[  1.00000000e+00,  -3.90798505e-16],
       [ -4.02640884e-16,   1.00000000e+00]])

We can visualize the projection using pylab, but first let's make sure our ipython notebook is in pylab inline mode

In [9]:

%pylab inline

Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

Now we can visualize the results using the following utility function:

In [10]:

import pylab as pl
from itertools import cycle

def plot_PCA_2D(data, target, target_names):
    colors = cycle('rgbcmykw')
    target_ids = range(len(target_names))
    pl.figure()
    for i, c, label in zip(target_ids, colors, target_names):
        pl.scatter(data[target == i, 0], data[target == i, 1],
                   c=c, label=label)
    pl.legend()

Now calling this function for our data, we see the plot:

In [11]:

plot_PCA_2D(X_pca, iris.target, iris.target_names)

Note that this projection was determined without any information about the labels (represented by the colors): this is the sense in which the learning is unsupervised. Nevertheless, we see that the projection gives us insight into the distribution of the different flowers in parameter space: notably, iris setosa is much more distinct than the other two species.

Note also that the default implementation of PCA computes the singular value decomposition (SVD) of the full data matrix, which is not scalable when both n_samples and n_features are big (more that a few thousands). If you are interested in a number of components that is much smaller than both n_samples and n_features, consider using sklearn.decomposition.RandomizedPCA instead.

Other dimensionality reduction techniques which are useful to know about:

sklearn.decomposition.PCA: Principal Component Analysis
sklearn.decomposition.RandomizedPCA: fast non-exact PCA implementation based on a randomized algorithm
sklearn.decomposition.SparsePCA: PCA variant including L1 penalty for sparsity
sklearn.decomposition.FastICA: Independent Component Analysis
sklearn.decomposition.NMF: non-negative matrix factorization
sklearn.manifold.LocallyLinearEmbedding: nonlinear manifold learning technique based on local neighborhood geometry
sklearn.manifold.IsoMap: nonlinear manifold learning technique based on a sparse graph algorithm

Exercise: Randomized PCA¶

Repeat the above dimensionality reduction with sklearn.decomposition.RandomizedPCA.

You can re-use the plot_PCA_2D function from above. Are the results similar to those from standard PCA?

In [12]:

Exercise: Dimension reduction of digits¶

Take the digits data, applied to it PCA (or RandomizedPCA?).

Try also dictionnary learning, as well as Isomap.

In [13]:

Note that such data reduction can be used as a first step before a supervised problem. We will see more about this soon.

In [14]: