Clustering¶

Clustering techniques are unsupervised learning algorithms that try to group unlabelled data into "clusters", using the (typically spatial) structure of the data itself.

The easiest way to demonstrate how clustering works is to simply generate some data and show them in action. We'll start off by importing the libraries we'll be using today.

In [1]:

import math, matplotlib.pyplot as plt, operator, torch
torch.manual_seed(1);

Create data¶

In [2]:

n_clusters=6
n_samples =250

To generate our data, we're going to pick 6 random points, which we'll call centroids, and for each point we're going to generate 250 random points about it.

In [3]:

centroids = torch.randint(-35, 35, (n_clusters, 2)).float()

In [4]:

from torch.distributions.multivariate_normal import MultivariateNormal
from torch import tensor

In [5]:

def sample(m): return MultivariateNormal(m, torch.diag(tensor([5.,5.]))).sample((n_samples,))

In [6]:

slices = [sample(c) for c in centroids]
data = torch.cat(slices)

Below we can see each centroid marked w/ X, and the coloring associated to each respective cluster.

In [7]:

def plot_data(centroids, data, n_samples):
    for i, centroid in enumerate(centroids):
        samples = data[i*n_samples:(i+1)*n_samples]
        plt.scatter(samples[:,0], samples[:,1], s=1)
        plt.plot(centroid[0], centroid[1], markersize=10, marker="x", color='k', mew=5)
        plt.plot(centroid[0], centroid[1], markersize=5, marker="x", color='m', mew=2)

In [8]:

plot_data(centroids, data, n_samples)

Mean shift¶

Most people that have come across clustering algorithms have learnt about k-means. Mean shift clustering is a newer and less well-known approach, but it has some important advantages:

It doesn't require selecting the number of clusters in advance, but instead just requires a bandwidth to be specified, which can be easily chosen automatically
It can handle clusters of any shape, whereas k-means (without using special extensions) requires that clusters be roughly ball shaped.

The algorithm is as follows:

For each data point x in the sample X, find the distance between that point x and every other point in X
Create weights for each point in X by using the Gaussian kernel of that point's distance to x
- This weighting approach penalizes points further away from x
- The rate at which the weights fall to zero is determined by the bandwidth, which is the standard deviation of the Gaussian
Update x as the weighted average of all other points in X, weighted based on the previous step

This will iteratively push points that are close together even closer until they are next to each other.

So here's the definition of the gaussian kernel, which you may remember from high school...

In [9]:

def gaussian(d, bw): return torch.exp(-0.5*((d/bw))**2) / (bw*math.sqrt(2*math.pi))

x = torch.linspace(0,10,100)
plt.plot(x, gaussian(x,2.5));

This person at the science march certainly remembered!

In our implementation, we choose the bandwidth to be 2.5.

One easy way to choose bandwidth is to find which bandwidth covers one third of the data.

In [10]:

X = data.clone()
x = data[0]
dist = torch.sqrt(((x-X)**2).sum(1))
dist[:8]

Out[10]:

tensor([0.0000, 1.4130, 3.2164, 2.8909, 4.5990, 3.1394, 3.9166, 5.3368])

In [11]:

weight = gaussian(dist, 2.5)
weight

Out[11]:

tensor([1.5958e-01, 1.3602e-01, 6.9749e-02,  ..., 3.5634e-09, 1.7959e-10,
        2.7274e-16])

In [12]:

weight.shape,X.shape

Out[12]:

(torch.Size([1500]), torch.Size([1500, 2]))

In [13]:

(weight[:,None]*X)

Out[13]:

tensor([[-2.1094e-01,  3.9275e+00],
        [-2.9346e-01,  3.1928e+00],
        [-1.0870e-01,  1.4929e+00],
        ...,
        [-5.0652e-08,  1.1388e-07],
        [-2.6280e-09,  6.0299e-09],
        [-4.7550e-15,  1.0221e-14]])

In [14]:

def meanshift(data):
    X = data.clone()
    for it in range(5):
        for i, x in enumerate(X):
            dist = torch.sqrt(((x-X)**2).sum(1))
            weight = gaussian(dist, 2.5)
            X[i] = (weight[:,None]*X).sum(0)/weight.sum()
    return X

In [15]:

%time X=meanshift(data)

CPU times: user 567 ms, sys: 0 ns, total: 567 ms
Wall time: 567 ms

We can see that mean shift clustering has almost reproduced our original clustering. The one exception are the very close clusters, but if we really wanted to differentiate them we could lower the bandwidth.

What is impressive is that this algorithm nearly reproduced the original clusters without telling it how many clusters there should be.

In [16]:

plot_data(centroids+2, X, n_samples)

All the computation is happening in the for loop, which isn't accelerated by pytorch. Each iteration launches a new cuda kernel, which takes time and slows the algorithm down as a whole. Furthermore, each iteration doesn't have enough processing to do to fill up all of the threads of the GPU. But at least the results are correct...

We should be able to accelerate this algorithm with a GPU.

GPU batched algorithm¶

To truly accelerate the algorithm, we need to be performing updates on a batch of points per iteration, instead of just one as we were doing.

In [42]:

def dist_b(a,b): return torch.sqrt(((a[None]-b[:,None])**2).sum(2))

In [43]:

X=torch.rand(8,2)
x=torch.rand(5,2)
dist_b(X, x)

Out[43]:

tensor([[0.6161, 0.7434, 0.7351, 0.9002, 0.5875, 0.5845, 0.2929, 0.1938],
        [1.1132, 0.2402, 0.9845, 0.4507, 1.0699, 0.6556, 0.6886, 0.7938],
        [0.0261, 0.9847, 0.3508, 1.0109, 0.0595, 0.5418, 0.4450, 0.4208],
        [0.4530, 0.8696, 0.1635, 0.7858, 0.4366, 0.4354, 0.6171, 0.7125],
        [0.6937, 0.4877, 0.4028, 0.3711, 0.6562, 0.2454, 0.5457, 0.6985]])

In [44]:

bs=5
X = data.clone()
x = X[:bs]
weight = gaussian(dist_b(X, x), 2)
weight

Out[44]:

tensor([[1.9947e-01, 1.5541e-01, 5.4735e-02,  ..., 2.2132e-13, 2.0775e-15,
         1.6831e-24],
        [1.5541e-01, 1.9947e-01, 1.1173e-01,  ..., 3.1427e-13, 2.0317e-15,
         9.7350e-25],
        [5.4735e-02, 1.1173e-01, 1.9947e-01,  ..., 3.5939e-16, 9.4595e-19,
         3.9723e-29],
        [7.0175e-02, 3.1377e-02, 1.8560e-02,  ..., 5.2249e-18, 3.3886e-20,
         3.0925e-30],
        [1.4180e-02, 3.1085e-03, 1.1476e-03,  ..., 8.5815e-20, 8.2003e-22,
         9.4978e-32]])

In [45]:

weight.shape,X.shape

Out[45]:

(torch.Size([5, 1500]), torch.Size([1500, 2]))

In [46]:

num = (weight[...,None]*X[None]).sum(1)
num.shape

Out[46]:

torch.Size([5, 2])

In [47]:

div = weight.sum(1, keepdim=True)
div.shape

Out[47]:

torch.Size([5, 1])

In [48]:

num/div

Out[48]:

tensor([[-0.5274, 24.3688],
        [-1.0042, 23.6233],
        [-0.7959, 22.5723],
        [ 1.0092, 24.4519],
        [ 1.8587, 25.1916]])

In [49]:

from fastcore.all import chunked

In [50]:

def meanshift(data, bs=500):
    n = len(data)
    X = data.clone()
    for it in range(5):
        for i in range(0, n, bs):
            s = slice(i, min(i+bs,n))
            weight = gaussian(dist_b(X, X[s]), 2)
            num = (weight[...,None]*X[None]).sum(1)
            div = weight.sum(1, keepdim=True)
            X[s] = num/div
    return X

Although each iteration still has to launch a new cuda kernel, there are now fewer iterations, and the acceleration from updating a batch of points more than makes up for it.

In [51]:

data = data.cuda()

In [54]:

X = meanshift(data).cpu()

In [55]:

%timeit -n 1 X = meanshift(data).cpu()

3.25 ms ± 290 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [56]:

plot_data(centroids+2, X, n_samples)

In [ ]: