SciPy

scipy.cluster.hierarchy.centroid

scipy.cluster.hierarchy.centroid(y)[source]

Perform centroid/UPGMC linkage.

See linkage for more information on the input matrix, return structure, and algorithm.

The following are common calling conventions:

  1. Z = centroid(y)

    Performs centroid/UPGMC linkage on the condensed distance matrix y.

  2. Z = centroid(X)

    Performs centroid/UPGMC linkage on the observation matrix X using Euclidean distance as the distance metric.

Parameters:
y : ndarray

A condensed distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that pdist returns. Alternatively, a collection of m observation vectors in n dimensions may be passed as a m by n array.

Returns:
Z : ndarray

A linkage matrix containing the hierarchical clustering. See the linkage function documentation for more information on its structure.

See also

linkage
for advanced creation of hierarchical clusterings.
scipy.spatial.distance.pdist
pairwise distance metrics

Examples

>>> from scipy.cluster.hierarchy import centroid, fcluster
>>> from scipy.spatial.distance import pdist

First we need a toy dataset to play with:

x x    x x
x        x

x        x
x x    x x
>>> X = [[0, 0], [0, 1], [1, 0],
...      [0, 4], [0, 3], [1, 4],
...      [4, 0], [3, 0], [4, 1],
...      [4, 4], [3, 4], [4, 3]]

Then we get a condensed distance matrix from this dataset:

>>> y = pdist(X)

Finally, we can perform the clustering:

>>> Z = centroid(y)
>>> Z
array([[ 0.        ,  1.        ,  1.        ,  2.        ],
       [ 3.        ,  4.        ,  1.        ,  2.        ],
       [ 9.        , 10.        ,  1.        ,  2.        ],
       [ 6.        ,  7.        ,  1.        ,  2.        ],
       [ 2.        , 12.        ,  1.11803399,  3.        ],
       [ 5.        , 13.        ,  1.11803399,  3.        ],
       [ 8.        , 15.        ,  1.11803399,  3.        ],
       [11.        , 14.        ,  1.11803399,  3.        ],
       [18.        , 19.        ,  3.33333333,  6.        ],
       [16.        , 17.        ,  3.33333333,  6.        ],
       [20.        , 21.        ,  3.33333333, 12.        ]])

The linkage matrix Z represents a dendrogram - see scipy.cluster.hierarchy.linkage for a detailed explanation of its contents.

We can use scipy.cluster.hierarchy.fcluster to see to which cluster each initial point would belong given a distance threshold:

>>> fcluster(Z, 0.9, criterion='distance')
array([ 7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6], dtype=int32)
>>> fcluster(Z, 1.1, criterion='distance')
array([5, 5, 6, 7, 7, 8, 1, 1, 2, 3, 3, 4], dtype=int32)
>>> fcluster(Z, 2, criterion='distance')
array([3, 3, 3, 4, 4, 4, 1, 1, 1, 2, 2, 2], dtype=int32)
>>> fcluster(Z, 4, criterion='distance')
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

Also scipy.cluster.hierarchy.dendrogram can be used to generate a plot of the dendrogram.

Previous topic

scipy.cluster.hierarchy.weighted

Next topic

scipy.cluster.hierarchy.median