SciPy

scipy.cluster.vq.kmeans2

scipy.cluster.vq.kmeans2(data, k, iter=10, thresh=1e-05, minit='random', missing='warn')[source]

Classify a set of observations into k clusters using the k-means algorithm.

The algorithm attempts to minimize the Euclidian distance between observations and centroids. Several initialization methods are included.

Parameters:

data : ndarray

A ‘M’ by ‘N’ array of ‘M’ observations in ‘N’ dimensions or a length ‘M’ array of ‘M’ one-dimensional observations.

k : int or ndarray

The number of clusters to form as well as the number of centroids to generate. If minit initialization string is ‘matrix’, or if a ndarray is given instead, it is interpreted as initial cluster to use instead.

iter : int

Number of iterations of the k-means algrithm to run. Note that this differs in meaning from the iters parameter to the kmeans function.

thresh : float

(not used yet)

minit : string

Method for initialization. Available methods are ‘random’, ‘points’, ‘uniform’, and ‘matrix’:

‘random’: generate k centroids from a Gaussian with mean and variance estimated from the data.

‘points’: choose k observations (rows) at random from data for the initial centroids.

‘uniform’: generate k observations from the data from a uniform distribution defined by the data set (unsupported).

‘matrix’: interpret the k parameter as a k by M (or length k array for one-dimensional data) array of initial centroids.

Returns:

centroid : ndarray

A ‘k’ by ‘N’ array of centroids found at the last iteration of k-means.

label : ndarray

label[i] is the code or index of the centroid the i’th observation is closest to.