Cluster Analysis: Partition Methods

Stata offers two commands for partitioning observations into k number of clusters. These commands are cluster kmeans and cluster kmedians and use means and medians to create the partitions. Both require using the k(number of groups) option. From there, your further specifications will depend on the details of your situations.

In general, Stata offers options that determine what similarity (or dissimilarity) measure will be used (see help measure options within Stata or the measure option entry in Stata's Multivariate Statistics manual) via the measure option.This is particularly relevant for continuous versus binary data.

Usefully, you can also give the cluster analysis a name via the name([name of cluster]) option. This can be a good way to differentiate between iterations of the command if you try multiple k values.

Additionally, you can select a method by which the initial group centers will be determined using the start([option]) option. There are eight start options. Three of these deal with various random methods of choosing the initial k groups. One makes initial groups using the firstk observations and one makes k initial groups using the last k observations. The other three use different methods. For more information see help cluster kmeans which includes an explanation of the various start options.

The keepcenters option tells Stata to retain the group means (or medians, depending on which command you use) and append them to the data set (i.e., your last k observations in the data set are now the means or medians from your k groups).

There are two advanced options as well. The first is generate([groupvar]) which creates a new variable in the data set assigning observations according to their groups as determined by the cluster analysis. The second option is iterate([value]) which limits the amount of iterations allowed to the clustering algorithim. The default is 10,000.

Note: All of the above applies to kmeans and kmedians.

The basic syntax is simply cluster kmeans [variables for clustering], k([# of groups]) [additional options] Additionally, you can see help cluster kmeans for examples pr [MV] cluster kmeans and kmedians.

Back to Cluster Analysis Overview