scipy.cluster.hierarchy.linkage#
- scipy.cluster.hierarchy.linkage(y, method='single', metric='euclidean', optimal_ordering=False)[source]#
Perform hierarchical/agglomerative clustering.
The input y may be either a 1-D condensed distance matrix or a 2-D array of observation vectors.
If y is a 1-D condensed distance matrix, then y must be a sized vector, where n is the number of original observations paired in the distance matrix. The behavior of this function is very similar to the MATLAB linkage function.
A by 4 matrix
Z
is returned. At the -th iteration, clusters with indicesZ[i, 0]
andZ[i, 1]
are combined to form cluster . A cluster with an index less than corresponds to one of the original observations. The distance between clustersZ[i, 0]
andZ[i, 1]
is given byZ[i, 2]
. The fourth valueZ[i, 3]
represents the number of original observations in the newly formed cluster.The following linkage methods are used to compute the distance between two clusters and . The algorithm begins with a forest of clusters that have yet to be used in the hierarchy being formed. When two clusters and from this forest are combined into a single cluster , and are removed from the forest, and is added to the forest. When only one cluster remains in the forest, the algorithm stops, and this cluster becomes the root.
A distance matrix is maintained at each iteration. The
d[i,j]
entry corresponds to the distance between cluster and in the original forest.At each iteration, the algorithm must update the distance matrix to reflect the distance of the newly formed cluster u with the remaining clusters in the forest.
Suppose there are original observations in cluster and original objects in cluster . Recall, and are combined to form cluster . Let be any remaining cluster in the forest that is not .
The following are methods for calculating the distance between the newly formed cluster and each .
method=’single’ assigns
for all points in cluster and in cluster . This is also known as the Nearest Point Algorithm.
method=’complete’ assigns
for all points in cluster u and in cluster . This is also known by the Farthest Point Algorithm or Voor Hees Algorithm.
method=’average’ assigns
for all points and where and are the cardinalities of clusters and , respectively. This is also called the UPGMA algorithm.
method=’weighted’ assigns
where cluster u was formed with cluster s and t and v is a remaining cluster in the forest (also called WPGMA).
method=’centroid’ assigns
where and are the centroids of clusters and , respectively. When two clusters and are combined into a new cluster , the new centroid is computed over all the original objects in clusters and . The distance then becomes the Euclidean distance between the centroid of and the centroid of a remaining cluster in the forest. This is also known as the UPGMC algorithm.
method=’median’ assigns like the
centroid
method. When two clusters and are combined into a new cluster , the average of centroids s and t give the new centroid . This is also known as the WPGMC algorithm.method=’ward’ uses the Ward variance minimization algorithm. The new entry is computed as follows,
where is the newly joined cluster consisting of clusters and , is an unused cluster in the forest, , and is the cardinality of its argument. This is also known as the incremental algorithm.
Warning: When the minimum distance pair in the forest is chosen, there may be two or more pairs with the same minimum distance. This implementation may choose a different minimum than the MATLAB version.
- Parameters:
- yndarray
A condensed distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that
pdist
returns. Alternatively, a collection of observation vectors in dimensions may be passed as an by array. All elements of the condensed distance matrix must be finite, i.e., no NaNs or infs.- methodstr, optional
The linkage algorithm to use. See the
Linkage Methods
section below for full descriptions.- metricstr or function, optional
The distance metric to use in the case that y is a collection of observation vectors; ignored otherwise. See the
pdist
function for a list of valid distance metrics. A custom distance function can also be used.- optimal_orderingbool, optional
If True, the linkage matrix will be reordered so that the distance between successive leaves is minimal. This results in a more intuitive tree structure when the data are visualized. defaults to False, because this algorithm can be slow, particularly on large datasets [2]. See also the
optimal_leaf_ordering
function.New in version 1.0.0.
- Returns:
- Zndarray
The hierarchical clustering encoded as a linkage matrix.
See also
scipy.spatial.distance.pdist
pairwise distance metrics
Notes
For method ‘single’, an optimized algorithm based on minimum spanning tree is implemented. It has time complexity . For methods ‘complete’, ‘average’, ‘weighted’ and ‘ward’, an algorithm called nearest-neighbors chain is implemented. It also has time complexity . For other methods, a naive algorithm is implemented with time complexity. All algorithms use memory. Refer to [1] for details about the algorithms.
Methods ‘centroid’, ‘median’, and ‘ward’ are correctly defined only if Euclidean pairwise metric is used. If y is passed as precomputed pairwise distances, then it is the user’s responsibility to assure that these distances are in fact Euclidean, otherwise the produced result will be incorrect.
References
[1]Daniel Mullner, “Modern hierarchical, agglomerative clustering algorithms”, arXiv:1109.2378v1.
[2]Ziv Bar-Joseph, David K. Gifford, Tommi S. Jaakkola, “Fast optimal leaf ordering for hierarchical clustering”, 2001. Bioinformatics DOI:10.1093/bioinformatics/17.suppl_1.S22
Examples
>>> from scipy.cluster.hierarchy import dendrogram, linkage >>> from matplotlib import pyplot as plt >>> X = [[i] for i in [2, 8, 0, 4, 1, 9, 9, 0]]
>>> Z = linkage(X, 'ward') >>> fig = plt.figure(figsize=(25, 10)) >>> dn = dendrogram(Z)
>>> Z = linkage(X, 'single') >>> fig = plt.figure(figsize=(25, 10)) >>> dn = dendrogram(Z) >>> plt.show()