
To implement K-means clustering, a user needs to specify
However, it is not easy to accurately estimate the true number of clusters in data
Hierarchical clustering
However, users need to determine which level of hierarchy is sensible to obtain the eventual clusters
Two modes of hierarchical clustering
We will focus on “bottom-up”" (or “agglomerative”) HC
Note: distinct class labels will be indicated by distinct colors

Suppose we do not know the class labels and perform hierarchical clustering by using
We obtain a dendrogram that represents hierarchical clustering results as a tree-like structure

A dendrogram is an upside-down tree with a height bar, where

When hierarchical clustering is represented by a dendrogram,
Hierarchical clustering often produces hierarchical clusters, where clusters obtained by cutting the dendrogram at a given height (i.e., clusters at a given height) are necessarily nested within the clusters obtained by cutting the dendrogram at any greater height (i.e., clusters at any greater height), when there is only one criterion that determines clusters or groups in the data
However, hierarchical clustering will not necessarily be produced when sets of clusters are produced by different criteria
Suppose we have observations from a group of people
So, the two sets of groups are not nested, in the sense that the best 3-group clustering cannot be obtained splitting one group of the best 2-group clustering (since the group of people with the same nationality must contain people from both genders), and hence cannot be well represented by hierarchical clustering
Using some dissimilarity measure, hierarchical clustering will split the \(n\) observations \(\boldsymbol{x}_i\) into sets of clusters at different dissimilarity levels (i.e., at different heights)

The choice of a dissimilarity measure is crucial to hierarchical clustering, and there are many choices for this measure:
Note: inter-cluster dissimilarity and pairwise dissimilarity should be compatible with each other in order to obtain hierarchical clustering
Agglomerative HC with Euclidean distance and complete linkage:

Agglomerative HC with Euclidean distance and complete linkage:

Single linkage, i.e., minimal intercluster dissimilarity: compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the smallest of these dissimilarities. Single linkage can result in extended, trailing clusters in which single observations are fused one-at-a-time
Centroid: dissimilarity between the centroid for cluster A and the centroid for cluster B. Centroid linkage may induce inversion in clustering, where two clusters are fused at a height below either of the individual clusters in the dendrogram
Average linkage, i.e., mean intercluster dissimilarity: compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the average of these dissimilarities
Complete linkage, i.e., maximal intercluster dissimilarity: compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the largest of these dissimilarities
Note: Average and complete linkages are generally preferred over single linkage, since they tend to yield more balanced dendrograms

Some practical issues on hierarchical clustering:
The choice of dissimilarity measure

Consider an online retailer interested in clustering shoppers based on their past shopping histories. The goal is to identify subgroups of similar shoppers, so that shoppers within each subgroup can be shown items and advertisements that are particularly likely to interest them.
Suppose the data takes the form of a matrix where the rows are the shoppers and the columns are the items available for purchase; the elements of the data matrix indicate the number of times a given shopper has purchased a given item (i.e. a 0 if the shopper has never purchased this item, a 1 if the shopper has purchased it once, etc.)
What type of dissimilarity measure should be used to cluster the shoppers?
If Euclidean distance is used, then shoppers who have bought very few items overall (i.e. infrequent users of the online shopping site) will be clustered together. This may not be desirable.
If correlation-based distance is used, then shoppers with similar preferences (e.g. shoppers who have bought items A and B but never items C or D) will be clustered together, even if some shoppers with these preferences are higher-volume shoppers than others.
Consider an online retailer interested in clustering shoppers based on their past shopping histories. The goal is to identify subgroups of similar shoppers, so that shoppers within each subgroup can be shown items and advertisements that are particularly likely to interest them.
Suppose the data takes the form of a matrix where the rows are the shoppers and the columns are the items available for purchase; the elements of the data matrix indicate the number of times a given shopper has purchased a given item (i.e. a 0 if the shopper has never purchased this item, a 1 if the shopper has purchased it once, etc.)
If we choose a dissimilarity measure that is easily affected the magnitudes of observations, then high-frequency purchases like socks tend to have a much larger effect on the inter-shopper dissimilarities, and hence on the clustering ultimately obtained, than rare purchases like computers. This may not be desirable.
If the variables are scaled to have standard deviation one before the inter-observation dissimilarities are computed, then each variable will in effect be given equal importance in the hierarchical clustering performed.
Left: raw data on purchase frequency; Center: scaled data on purchase frequency; Right: raw data on purchase expense

We might also want to scale the variables to have standard deviation one if they are measured on different scales; otherwise, the choice of units (e.g. centimeters versus kilometers) for a particular variable will greatly affect the dissimilarity measure obtained.
Whether or not it is a good decision to scale the variables before computing the dissimilarity measure depends on the application at hand.
The issue of whether or not to scale the variables before performing clustering applies to K-means clustering as well.
Suppose that most of the observations truly belong to a small number of (unknown) subgroups, and a small subset of the observations are quite different from each other and from all other observations.
Since both K-means and hierarchical clustering will assign each observation to a cluster, the clusters found may be heavily distorted due to the presence of outliers that do not belong to any cluster.
Mixture models are an attractive approach for accommodating the presence of such outliers.
Suppose that we cluster \(n\) observations, and then cluster the observations again after removing a subset of the \(n\) observations at random.
We would hope that the two sets of clusters obtained would be quite similar. But often this is not the case!
Clustering methods generally are not very robust to perturbations to the data.
Perform clustering with different choices of standardization, dissimilarity measures, and linkages, and look at the full set of results in order to see what patterns consistently emerge.
Clustering subsets of the data in order to get a sense of the robustness of the clusters obtained.
Clustering results should not be taken as the absolute truth about a data set. Rather, they should constitute a starting point for the development of a scientific hypothesis and further study, preferably on an independent data set.
R commands and libraries needed:
dist or as.dist to compute pairwise dissimilarityhclust applies to pairwise dissimilarities to implement HCcutree applies to the output of hclust to obtain clustersggdendro and command ggdendrogram{ggdendro} create dendrogram from the output of hclustWorkflow for software implementation:
distdist computes and returns the distance matrix that is obtained by using the specified distance measure to compute the distances between the rows of a data matrix. Its basic syntax is
dist(x, method = "euclidean")
x: a numeric matrix, data frame or “dist” object.method: the distance measure to be used. It must be one of “euclidean”, “maximum”, “manhattan”, “canberra”, “binary” or “minkowski”.as.distas.dist converts an arbitrary square symmetric matrix into a form that hlust recognizes as a distance matrix. It can be used to induce correlation-based distance. Its basic syntax is
as.dist(x, ...)
x: a numeric matrix, data frame or “dist” object.hclusthclust implements hierarchical cluster analysis on a set of dissimilarities using a specified linkage. Its basic syntax is
hclust(d, method = "complete")
d: a dissimilarity structure as produced by distmethod: the agglomeration method to be used. It can be “single”, “complete”, “average” or “centroid”hclustWhen clustering \(n\) observations, hclust returns:
height: a set of \(n-1\) real values that are called “clustering heights”. Each height is the value of the dissimilarity measure for the corresponding agglomeration.labels: labels for each of the objects being clustered.cutreecutree cuts a tree, e.g., as resulting from hclust, into several groups either by specifying the desired number(s) of groups or the cut height(s). Its basic syntax is
cutree(tree, k = NULL, h = NULL)
tree: a tree as produced by hclust.k: an integer scalar or vector with the desired number of groups.h: numeric scalar or vector with heights where the tree should be cut.ggdendrogramggdendrogram{ggdendro} creates a dendrogram using the output of hclust. Its basic syntax is
library(ggdendro)
ggdendrogram(data, leaf_labels = TRUE, rotate = FALSE)
data: objects of class dendrogram, hclust or treeleaf_labels: if TRUE, shows leaf labelsrotate: if TRUE, rotates plot by 90 degreesNote: the command plot can be applied to hclust object to create dendrogram
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] knitr_1.21
loaded via a namespace (and not attached):
[1] compiler_3.5.0 magrittr_1.5 tools_3.5.0
[4] htmltools_0.3.6 revealjs_0.9 yaml_2.2.0
[7] Rcpp_1.0.0 stringi_1.2.4 rmarkdown_1.11
[10] stringr_1.3.1 xfun_0.4 digest_0.6.18
[13] evaluate_0.12