Processing math: 100%
SciPy

This is documentation for an old release of SciPy (version 0.18.1). Read this page in the documentation of the latest stable release (version 1.15.0).

scipy.cluster.hierarchy.linkage

scipy.cluster.hierarchy.linkage(y, method='single', metric='euclidean')[source]

Performs hierarchical/agglomerative clustering on the condensed distance matrix y.

y must be a (n2) sized vector where n is the number of original observations paired in the distance matrix. The behavior of this function is very similar to the MATLAB linkage function.

An (n1) by 4 matrix Z is returned. At the i-th iteration, clusters with indices Z[i, 0] and Z[i, 1] are combined to form cluster n+i. A cluster with an index less than n corresponds to one of the n original observations. The distance between clusters Z[i, 0] and Z[i, 1] is given by Z[i, 2]. The fourth value Z[i, 3] represents the number of original observations in the newly formed cluster.

The following linkage methods are used to compute the distance d(s,t) between two clusters s and t. The algorithm begins with a forest of clusters that have yet to be used in the hierarchy being formed. When two clusters s and t from this forest are combined into a single cluster u, s and t are removed from the forest, and u is added to the forest. When only one cluster remains in the forest, the algorithm stops, and this cluster becomes the root.

A distance matrix is maintained at each iteration. The d[i,j] entry corresponds to the distance between cluster i and j in the original forest.

At each iteration, the algorithm must update the distance matrix to reflect the distance of the newly formed cluster u with the remaining clusters in the forest.

Suppose there are |u| original observations u[0],,u[|u|1] in cluster u and |v| original objects v[0],,v[|v|1] in cluster v. Recall s and t are combined to form cluster u. Let v be any remaining cluster in the forest that is not u.

The following are methods for calculating the distance between the newly formed cluster u and each v.

  • method=’single’ assigns

    d(u,v)=min(dist(u[i],v[j]))

    for all points i in cluster u and j in cluster v. This is also known as the Nearest Point Algorithm.

  • method=’complete’ assigns

    d(u,v)=max(dist(u[i],v[j]))

    for all points i in cluster u and j in cluster v. This is also known by the Farthest Point Algorithm or Voor Hees Algorithm.

  • method=’average’ assigns

    d(u,v)=ijd(u[i],v[j])(|u||v|)

    for all points i and j where |u| and |v| are the cardinalities of clusters u and v, respectively. This is also called the UPGMA algorithm.

  • method=’weighted’ assigns

    d(u,v)=(dist(s,v)+dist(t,v))/2

    where cluster u was formed with cluster s and t and v is a remaining cluster in the forest. (also called WPGMA)

  • method=’centroid’ assigns

    dist(s,t)=||csct||2

    where cs and ct are the centroids of clusters s and t, respectively. When two clusters s and t are combined into a new cluster u, the new centroid is computed over all the original objects in clusters s and t. The distance then becomes the Euclidean distance between the centroid of u and the centroid of a remaining cluster v in the forest. This is also known as the UPGMC algorithm.

  • method=’median’ assigns d(s,t) like the centroid method. When two clusters s and t are combined into a new cluster u, the average of centroids s and t give the new centroid u. This is also known as the WPGMC algorithm.

  • method=’ward’ uses the Ward variance minimization algorithm. The new entry d(u,v) is computed as follows,

    d(u,v)=|v|+|s|Td(v,s)2+|v|+|t|Td(v,t)2|v|Td(s,t)2

    where u is the newly joined cluster consisting of clusters s and t, v is an unused cluster in the forest, T=|v|+|s|+|t|, and || is the cardinality of its argument. This is also known as the incremental algorithm.

Warning: When the minimum distance pair in the forest is chosen, there may be two or more pairs with the same minimum distance. This implementation may chose a different minimum than the MATLAB version.

Parameters:

y : ndarray

A condensed or redundant distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that pdist returns. Alternatively, a collection of m observation vectors in n dimensions may be passed as an m by n array.

method : str, optional

The linkage algorithm to use. See the Linkage Methods section below for full descriptions.

metric : str or function, optional

The distance metric to use in the case that y is a collection of observation vectors; ignored otherwise. See the distance.pdist function for a list of valid distance metrics. A custom distance function can also be used. See the distance.pdist function for details.

Returns:

Z : ndarray

The hierarchical clustering encoded as a linkage matrix.

Notes

  1. For method ‘single’ an optimized algorithm called SLINK is implemented, which has O(n2) time complexity. For methods ‘complete’, ‘average’, ‘weighted’ and ‘ward’ an algorithm called nearest-neighbors chain is implemented, which too has time complexity O(n2). For other methods a naive algorithm is implemented with O(n3) time complexity. All algorithms use O(n2) memory. Refer to [R38] for details about the algorithms.
  2. Methods ‘centroid’, ‘median’ and ‘ward’ are correctly defined only if Euclidean pairwise metric is used. If y is passed as precomputed pairwise distances, then it is a user responsibility to assure that these distances are in fact Euclidean, otherwise the produced result will be incorrect.

References

[R38](1, 2) Daniel Mullner, “Modern hierarchical, agglomerative clustering algorithms”, arXiv:1109.2378v1 , 2011.