K-means Clustering and Vector Quantization Module
Provides routines for k-means clustering, generating code books from k-means models, and quantizing vectors by comparing them with centroids in a code book.
The k-means algorithm takes as input the number of clusters to generate, k, and a set of observation vectors to cluster. It returns a set of centroids, one for each of the k clusters. An observation vector is classified with the cluster number or centroid index of the centroid closest to it.
A vector v belongs to cluster i if it is closer to centroid i than any other centroids. If v belongs to i, we say centroid i is the dominating centroid of v. The k-means algorithm tries to minimize distortion, which is defined as the sum of the squared distances between each observation vector and its dominating centroid. Each step of the k-means algorithm refines the choices of centroids to reduce distortion. The change in distortion is used as a stopping criterion: when the change is lower than a threshold, the k-means algorithm is not making sufficient progress and terminates. One can also define a maximum number of iterations.
Since vector quantization is a natural application for k-means, information theory terminology is often used. The centroid index or cluster index is also referred to as a “code” and the table mapping codes to centroids and vice versa is often referred as a “code book”. The result of k-means, a set of centroids, can be used to quantize vectors. Quantization aims to find an encoding of vectors that reduces the expected distortion.
All routines expect obs to be a M by N array where the rows are the observation vectors. The codebook is a k by N array where the i’th row is the centroid of code word i. The observation vectors and centroids have the same feature dimension.
As an example, suppose we wish to compress a 24-bit color image (each pixel is represented by one byte for red, one for blue, and one for green) before sending it over the web. By using a smaller 8-bit encoding, we can reduce the amount of data by two thirds. Ideally, the colors for each of the 256 possible 8-bit encoding values should be chosen to minimize distortion of the color. Running k-means with k=256 generates a code book of 256 codes, which fills up all possible 8-bit sequences. Instead of sending a 3-byte value for each pixel, the 8-bit centroid index (or code word) of the dominating centroid is transmitted. The code book is also sent over the wire so each 8-bit code can be translated back to a 24-bit pixel value representation. If the image of interest was of an ocean, we would expect many 24-bit blues to be represented by 8-bit codes. If it was an image of a human face, more flesh tone colors would be represented in the code book.
Normalize a group of observations on a per feature basis.
Before running k-means, it is beneficial to rescale each feature dimension of the observation set with whitening. Each feature is divided by its standard deviation across all observations to give it unit variance.
Parameters : |
|
---|---|
Returns : |
|
Examples
>>> from numpy import array
>>> from scipy.cluster.vq import whiten
>>> features = array([[ 1.9,2.3,1.7],
... [ 1.5,2.5,2.2],
... [ 0.8,0.6,1.7,]])
>>> whiten(features)
array([[ 3.41250074, 2.20300046, 5.88897275],
[ 2.69407953, 2.39456571, 7.62102355],
[ 1.43684242, 0.57469577, 5.88897275]])
Vector Quantization: assign codes from a code book to observations.
Assigns a code from a code book to each observation. Each observation vector in the M by N obs array is compared with the centroids in the code book and assigned the code of the closest centroid.
The features in obs should have unit variance, which can be acheived by passing them through the whiten function. The code book can be created with the k-means algorithm or a different encoding algorithm.
Parameters : |
|
---|---|
Returns : |
|
Notes
This currently forces 32-bit math precision for speed. Anyone know of a situation where this undermines the accuracy of the algorithm?
Examples
>>> from numpy import array
>>> from scipy.cluster.vq import vq
>>> code_book = array([[1.,1.,1.],
... [2.,2.,2.]])
>>> features = array([[ 1.9,2.3,1.7],
... [ 1.5,2.5,2.2],
... [ 0.8,0.6,1.7]])
>>> vq(features,code_book)
(array([1, 1, 0],'i'), array([ 0.43588989, 0.73484692, 0.83066239]))
Performs k-means on a set of observation vectors forming k clusters.
The k-means algorithm adjusts the centroids until sufficient progress cannot be made, i.e. the change in distortion since the last iteration is less than some threshold. This yields a code book mapping centroids to codes and vice versa.
Distortion is defined as the sum of the squared differences between the observations and the corresponding centroid.
Parameters : | obs : ndarray
k_or_guess : int or ndarray
iter : int, optional
thresh : float, optional
|
---|---|
Returns : | codebook : ndarray
distortion : float
|
See also
Examples
>>> from numpy import array
>>> from scipy.cluster.vq import vq, kmeans, whiten
>>> features = array([[ 1.9,2.3],
... [ 1.5,2.5],
... [ 0.8,0.6],
... [ 0.4,1.8],
... [ 0.1,0.1],
... [ 0.2,1.8],
... [ 2.0,0.5],
... [ 0.3,1.5],
... [ 1.0,1.0]])
>>> whitened = whiten(features)
>>> book = array((whitened[0],whitened[2]))
>>> kmeans(whitened,book)
(array([[ 2.3110306 , 2.86287398],
[ 0.93218041, 1.24398691]]), 0.85684700941625547)
>>> from numpy import random
>>> random.seed((1000,2000))
>>> codes = 3
>>> kmeans(whitened,codes)
(array([[ 2.3110306 , 2.86287398],
[ 1.32544402, 0.65607529],
[ 0.40782893, 2.02786907]]), 0.5196582527686241)
The algorithm attempts to minimize the Euclidian distance between observations and centroids. Several initialization methods are included.
Parameters : |
|
---|---|
Returns : |
|