Clustering techniques are unsupervised learning algorithms that try to group unlabelled data into "clusters", using the (typically spatial) structure of the data itself.
The easiest way to demonstrate how clustering works is to simply generate some data and show them in action. We'll start off by importing the libraries we'll be using today.
import math, matplotlib.pyplot as plt, operator, torch
torch.manual_seed(1);
n_clusters=6
n_samples =250
To generate our data, we're going to pick 6 random points, which we'll call centroids, and for each point we're going to generate 250 random points about it.
centroids = torch.randint(-35, 35, (n_clusters, 2)).float()
from torch.distributions.multivariate_normal import MultivariateNormal
from torch import tensor
def sample(m): return MultivariateNormal(m, torch.diag(tensor([5.,5.]))).sample((n_samples,))
slices = [sample(c) for c in centroids]
data = torch.cat(slices)
Below we can see each centroid marked w/ X, and the coloring associated to each respective cluster.
def plot_data(centroids, data, n_samples):
for i, centroid in enumerate(centroids):
samples = data[i*n_samples:(i+1)*n_samples]
plt.scatter(samples[:,0], samples[:,1], s=1)
plt.plot(centroid[0], centroid[1], markersize=10, marker="x", color='k', mew=5)
plt.plot(centroid[0], centroid[1], markersize=5, marker="x", color='m', mew=2)
plot_data(centroids, data, n_samples)
Most people that have come across clustering algorithms have learnt about k-means. Mean shift clustering is a newer and less well-known approach, but it has some important advantages:
The algorithm is as follows:
This will iteratively push points that are close together even closer until they are next to each other.
So here's the definition of the gaussian kernel, which you may remember from high school...
def gaussian(d, bw): return torch.exp(-0.5*((d/bw))**2) / (bw*math.sqrt(2*math.pi))
x = torch.linspace(0,10,100)
plt.plot(x, gaussian(x,2.5));
This person at the science march certainly remembered!
In our implementation, we choose the bandwidth to be 2.5.
One easy way to choose bandwidth is to find which bandwidth covers one third of the data.
X = data.clone()
x = data[0]
dist = torch.sqrt(((x-X)**2).sum(1))
dist[:8]
tensor([0.0000, 1.4130, 3.2164, 2.8909, 4.5990, 3.1394, 3.9166, 5.3368])
weight = gaussian(dist, 2.5)
weight
tensor([1.5958e-01, 1.3602e-01, 6.9749e-02, ..., 3.5634e-09, 1.7959e-10,
2.7274e-16])
weight.shape,X.shape
(torch.Size([1500]), torch.Size([1500, 2]))
(weight[:,None]*X)
tensor([[-2.1094e-01, 3.9275e+00],
[-2.9346e-01, 3.1928e+00],
[-1.0870e-01, 1.4929e+00],
...,
[-5.0652e-08, 1.1388e-07],
[-2.6280e-09, 6.0299e-09],
[-4.7550e-15, 1.0221e-14]])
def meanshift(data):
X = data.clone()
for it in range(5):
for i, x in enumerate(X):
dist = torch.sqrt(((x-X)**2).sum(1))
weight = gaussian(dist, 2.5)
X[i] = (weight[:,None]*X).sum(0)/weight.sum()
return X
%time X=meanshift(data)
CPU times: user 567 ms, sys: 0 ns, total: 567 ms Wall time: 567 ms
We can see that mean shift clustering has almost reproduced our original clustering. The one exception are the very close clusters, but if we really wanted to differentiate them we could lower the bandwidth.
What is impressive is that this algorithm nearly reproduced the original clusters without telling it how many clusters there should be.
plot_data(centroids+2, X, n_samples)
All the computation is happening in the for loop, which isn't accelerated by pytorch. Each iteration launches a new cuda kernel, which takes time and slows the algorithm down as a whole. Furthermore, each iteration doesn't have enough processing to do to fill up all of the threads of the GPU. But at least the results are correct...
We should be able to accelerate this algorithm with a GPU.
To truly accelerate the algorithm, we need to be performing updates on a batch of points per iteration, instead of just one as we were doing.
def dist_b(a,b): return torch.sqrt(((a[None]-b[:,None])**2).sum(2))
X=torch.rand(8,2)
x=torch.rand(5,2)
dist_b(X, x)
tensor([[0.6161, 0.7434, 0.7351, 0.9002, 0.5875, 0.5845, 0.2929, 0.1938],
[1.1132, 0.2402, 0.9845, 0.4507, 1.0699, 0.6556, 0.6886, 0.7938],
[0.0261, 0.9847, 0.3508, 1.0109, 0.0595, 0.5418, 0.4450, 0.4208],
[0.4530, 0.8696, 0.1635, 0.7858, 0.4366, 0.4354, 0.6171, 0.7125],
[0.6937, 0.4877, 0.4028, 0.3711, 0.6562, 0.2454, 0.5457, 0.6985]])
bs=5
X = data.clone()
x = X[:bs]
weight = gaussian(dist_b(X, x), 2)
weight
tensor([[1.9947e-01, 1.5541e-01, 5.4735e-02, ..., 2.2132e-13, 2.0775e-15,
1.6831e-24],
[1.5541e-01, 1.9947e-01, 1.1173e-01, ..., 3.1427e-13, 2.0317e-15,
9.7350e-25],
[5.4735e-02, 1.1173e-01, 1.9947e-01, ..., 3.5939e-16, 9.4595e-19,
3.9723e-29],
[7.0175e-02, 3.1377e-02, 1.8560e-02, ..., 5.2249e-18, 3.3886e-20,
3.0925e-30],
[1.4180e-02, 3.1085e-03, 1.1476e-03, ..., 8.5815e-20, 8.2003e-22,
9.4978e-32]])
weight.shape,X.shape
(torch.Size([5, 1500]), torch.Size([1500, 2]))
num = (weight[...,None]*X[None]).sum(1)
num.shape
torch.Size([5, 2])
div = weight.sum(1, keepdim=True)
div.shape
torch.Size([5, 1])
num/div
tensor([[-0.5274, 24.3688],
[-1.0042, 23.6233],
[-0.7959, 22.5723],
[ 1.0092, 24.4519],
[ 1.8587, 25.1916]])
from fastcore.all import chunked
def meanshift(data, bs=500):
n = len(data)
X = data.clone()
for it in range(5):
for i in range(0, n, bs):
s = slice(i, min(i+bs,n))
weight = gaussian(dist_b(X, X[s]), 2)
num = (weight[...,None]*X[None]).sum(1)
div = weight.sum(1, keepdim=True)
X[s] = num/div
return X
Although each iteration still has to launch a new cuda kernel, there are now fewer iterations, and the acceleration from updating a batch of points more than makes up for it.
data = data.cuda()
X = meanshift(data).cpu()
%timeit -n 1 X = meanshift(data).cpu()
3.25 ms ± 290 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
plot_data(centroids+2, X, n_samples)