scipy.cluster.hierarchy.linkage¶

scipy.cluster.hierarchy.linkage(y, method='single', metric='euclidean')[source]¶

Performs hierarchical/agglomerative clustering on the condensed distance matrix y.

y must be a ${n \choose 2}$ sized vector where n is the number of original observations paired in the distance matrix. The behavior of this function is very similar to the MATLAB linkage function.

A 4 by $(n-1)$ matrix Z is returned. At the $i$ -th iteration, clusters with indices Z[i, 0] and Z[i, 1] are combined to form cluster $n + i$ . A cluster with an index less than $n$ corresponds to one of the $n$ original observations. The distance between clusters Z[i, 0] and Z[i, 1] is given by Z[i, 2]. The fourth value Z[i, 3] represents the number of original observations in the newly formed cluster.

The following linkage methods are used to compute the distance $d(s, t)$ between two clusters $s$ and $t$ . The algorithm begins with a forest of clusters that have yet to be used in the hierarchy being formed. When two clusters $s$ and $t$ from this forest are combined into a single cluster $u$ , $s$ and $t$ are removed from the forest, and $u$ is added to the forest. When only one cluster remains in the forest, the algorithm stops, and this cluster becomes the root.

A distance matrix is maintained at each iteration. The d[i,j] entry corresponds to the distance between cluster $i$ and $j$ in the original forest.

At each iteration, the algorithm must update the distance matrix to reflect the distance of the newly formed cluster u with the remaining clusters in the forest.

Suppose there are $|u|$ original observations $u[0], \ldots, u[|u|-1]$ in cluster $u$ and $|v|$ original objects $v[0], \ldots, v[|v|-1]$ in cluster $v$ . Recall $s$ and $t$ are combined to form cluster $u$ . Let $v$ be any remaining cluster in the forest that is not $u$ .

The following are methods for calculating the distance between the newly formed cluster $u$ and each $v$ .

method=’single’ assigns

$d(u,v) = \min(dist(u[i],v[j]))$

for all points $i$ in cluster $u$ and $j$ in cluster $v$ . This is also known as the Nearest Point Algorithm.

method=’complete’ assigns

$d(u, v) = \max(dist(u[i],v[j]))$

for all points $i$ in cluster u and $j$ in cluster $v$ . This is also known by the Farthest Point Algorithm or Voor Hees Algorithm.

method=’average’ assigns

$d(u,v) = \sum_{ij} \frac{d(u[i], v[j])} {(|u|*|v|)}$

for all points $i$ and $j$ where $|u|$ and $|v|$ are the cardinalities of clusters $u$ and $v$ , respectively. This is also called the UPGMA algorithm. This is called UPGMA.

method=’weighted’ assigns

$d(u,v) = (dist(s,v) + dist(t,v))/2$

where cluster u was formed with cluster s and t and v is a remaining cluster in the forest. (also called WPGMA)

method=’centroid’ assigns

$dist(s,t) = ||c_s-c_t||_2$

where $c_s$ and $c_t$ are the centroids of clusters $s$ and $t$ , respectively. When two clusters $s$ and $t$ are combined into a new cluster $u$ , the new centroid is computed over all the original objects in clusters $s$ and $t$ . The distance then becomes the Euclidean distance between the centroid of $u$ and the centroid of a remaining cluster $v$ in the forest. This is also known as the UPGMC algorithm.

method=’median’ assigns math:d(s,t) like the centroid method. When two clusters $s$ and $t$ are combined into a new cluster $u$ , the average of centroids s and t give the new centroid $u$ . This is also known as the WPGMC algorithm.

method=’ward’ uses the Ward variance minimization algorithm. The new entry $d(u,v)$ is computed as follows,

$d(u,v) = \sqrt{\frac{|v|+|s|} {T}d(v,s)^2 + \frac{|v|+|t|} {T}d(v,t)^2 + \frac{|v|} {T}d(s,t)^2}$

where $u$ is the newly joined cluster consisting of clusters $s$ and $t$ , $v$ is an unused cluster in the forest, $T=|v|+|s|+|t|$ , and $|*|$ is the cardinality of its argument. This is also known as the incremental algorithm.

Warning: When the minimum distance pair in the forest is chosen, there may be two or more pairs with the same minimum distance. This implementation may chose a different minimum than the MATLAB version.

Parameters:

Parameters:	y : ndarray A condensed or redundant distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that `pdist` returns. Alternatively, a collection of $m$ observation vectors in n dimensions may be passed as an $m$ by $n$ array. method : str, optional The linkage algorithm to use. See the `Linkage Methods` section below for full descriptions. metric : str, optional The distance metric to use. See the `distance.pdist` function for a list of valid distance metrics.
Returns:	Z : ndarray The hierarchical clustering encoded as a linkage matrix.

y : ndarray

A condensed or redundant distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that pdist returns. Alternatively, a collection of $m$ observation vectors in n dimensions may be passed as an $m$ by $n$ array.

method : str, optional

The linkage algorithm to use. See the Linkage Methods section below for full descriptions.

metric : str, optional

The distance metric to use. See the distance.pdist function for a list of valid distance metrics.

Returns:

Z : ndarray

The hierarchical clustering encoded as a linkage matrix.

Previous topic

scipy.cluster.hierarchy.leaders

Next topic

scipy.cluster.hierarchy.single