Stata Help

Cluster Analysis: Agglomerative Methods

As alluded to on the main cluster analysis page, there are seven agglomerative clustering commands offered by Stata. Each method uses a different criteria to merge clusters as the hierarchy progresses. Below provides an exceedingly brief overview of the seven methods.

One method is single-linkage clustering (single). In single-linkage clustering (also known as nearest neighbor clustering) distance between clusters is defined as the distance between the two closest elements of different clusters. Sometimes this can cause clusters that should be separate to be grouped together because one element of each cluster is too close. The opposite method is complete-linkage ( complete) which measures distance based on the furthest away elements in each cluster.

A conceptually similar method to single-linkage is average-linkage (average. In average-linkage (also known as UPGMA, distance is calculated as the average of the distance between each pair of elements across clusters. Somewhat similarly, weighted-average linkage (waverage) performs the same calculation but weights distances based on the number of elements in the cluster. Thus, the latter method is preferred when clusters are not of approximately equal size.

Median-linkage (median) calculates distances based on the medians of each cluster, useful when wanting every element to be equally valued in the distance calculation. Similarly, centroid-linkage (centroid calculates distance based on group means.

Ward's linkage method (wards is the most conceptually complex. In this method, clusters are created using the idea of minimalizing information loss. The measure for this is known as an error-sum-of-square (ESS) criterion. The link below offers a very clean example of what that practically looks like. Research suggests this method does not work well with groups of varied sizes or unequal numbers of observations.

For a nice example of many of the above with useful illustrations see here.

To perform a cluster analysis using the above methods the command is cluster, followed by the command indicated in parentheses above, followed by the variables being used to perform the cluster. Therefore to perform a single-linkage cluster analysis of three variables called weight height and health, the command would be cluster single weight height health

Back to Cluster Overview