Buffl

DataViz

DW
by Daniela W.

Hierarchical clusterung

We describe bottom-up (a.k.a. agglomerative) hierarchical clustering:

  1. Initialization: Compute all the n(n−1)/2�(�−1)/2 pairwise dissimilarities between the n� observations. Treat each observation as its own cluster. A typically dissimilarity measure is the Euclidean distance. Other dissimilarities can be used (1-correlation), Manhattan distance, etc.

  2. For i=n,n−1,...,2�=�,�−1,...,2:

  • Fuse the two clusters that are least dissimilar. The dissimilarity between these two clusters indicates the height in the dendrogram at which the fusion should be placed.

  • Compute the new pairwise inter-cluster dissimilarities among the i−1�−1 remaining clusters using the linkage rule.

The linkage rules define dissimilarity between clusters. Here are four popular linkage rules:

  • Complete: The dissimilarity between cluster A and cluster B is the largest dissimilarity between any element of A and any element of B.

  • Single: The dissimilarity between cluster A and cluster B is the smallest dissimilarity between any element of A and any element of B. Single linkage can result in extended, trailing clusters in which single observations are fused one-at-a-time.

  • Average: The dissimilarity between cluster A and cluster B is the average dissimilarity between any element of A and any element of B.

  • Centroid: The dissimilarity between cluster A and cluster B is the dissimilarity between the centroids (mean vector) of A and B. Centroid linkage can result in undesirable inversions.


Author

Daniela W.

Information

Last changed