
Randomly generate observations from 2 bivariate Gaussian distributions with identity covariance matrix, so that
Namely, the 20 observations form 2 distinct groups according to their mean vectors, and there are 2 features.
> set.seed(2)
> x=matrix(rnorm(20*2), ncol=2)
> x[1:10,1]=x[1:10,1]+3; x[11:20,2]=x[1:10,2]-4
x contains observations for a feature; each row of x contains an observation for 2 features20 observations form 2 clusters:

Hierarchical clustering (with Euclidean distance as pairwise dissimilarity):
> hc.complete=hclust(dist(x), method="complete")
> hc.average=hclust(dist(x), method="average")
> hc.single=hclust(dist(x), method="single")
> par(mfrow=c(1,3))
> plot(hc.complete,main="Complete Linkage",xlab="",sub="",cex=.9)
> abline(h=2.5, col="red")
> plot(hc.average, main="Average Linkage",xlab="",sub="",cex=.9)
> abline(h=2.5, col="red")
> plot(hc.single, main="Single Linkage",xlab="",sub="",cex=.9)
> abline(h=2.5, col="red")
xlab="",sub="": set x-axis label and subtitle as the empty string; cex: ratio by which plotting text and symbols should be scaled relative to the default 1; note the abline at a given “height” hGood clustering results; differing clusters at the same height:

height from “complete linkage” versus dist(x):
> hc.complete$height
[1] 0.2702697 0.2906542 0.3423736 0.7377621 0.7428389
[6] 0.7678718 0.8906271 0.9188777 1.1219922 1.3796578
[11] 1.4636674 1.7568482 1.9923715 2.9849541 3.3246916
[16] 3.7466658 4.6576892 5.2990990 10.5951990
> min(dist(x)) == min(hc.complete$height)
[1] TRUE
> max(hc.complete$height) == max(dist(x))
[1] TRUE
min(dist(x))=min(hc.complete$height), i.e., minimal pairwise dissimilarity equals to minimal heightmax(hc.complete$height): the value of linkage at which the last 2 clusters are mergedheight from “average linkage” versus dist(x):
> hc.average$height
[1] 0.2702697 0.2906542 0.3423736 0.6260169 0.7377621
[6] 0.7428389 0.8906271 0.8930763 1.1219922 1.1666896
[11] 1.2082378 1.2435332 1.5734885 2.2700984 2.4041542
[16] 2.5747045 2.8959725 4.2777927 5.2820763
> min(dist(x)) == min(hc.average$height)
[1] TRUE
> max(hc.average$height)
[1] 5.282076
> max(dist(x))
[1] 10.5952
min(dist(x))=min(hc.average$height), i.e., minimal pairwise dissimilarity equals to minimal heightmax(hc.complete$height): the value of linkage at which the last 2 clusters are mergedMaximum of height is where the last 2 clusters are merged:

Cut the dendrogram to obtain 2 clusters:
> cutree(hc.complete,2)
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
> cutree(hc.average,2)
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
> cutree(hc.single,2)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1
Cut tree to obtain 3 clusters and find the height of cut:
> hc.average$height
[1] 0.2702697 0.2906542 0.3423736 0.6260169 0.7377621
[6] 0.7428389 0.8906271 0.8930763 1.1219922 1.1666896
[11] 1.2082378 1.2435332 1.5734885 2.2700984 2.4041542
[16] 2.5747045 2.8959725 4.2777927 5.2820763
> clngrp = cutree(hc.average,k=3)
> clngrp
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 3 2 2 2 2
> nh = length(hc.average$height)
> clhgt = cutree(hc.average,h=hc.average$height[nh-2])
> clhgt
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 3 2 2 2 2
> all.equal(clngrp,clhgt)
[1] TRUE
cutree to obtain \(k\) groups is the same as cutree at the \(k\)th largest heightHC with scaled features (i.e., each feature has standard deviation 1), Euclidean distance, and complete linkage:
> xsc=scale(x) # standardize each column (i.e., feature)
> par(mfrow=c(1,1),mar = c(7,4.5,0,1.4))
> plot(hclust(dist(xsc),method="complete"),xlab="",sub="",
+ main="",ylab="Height")

With scaled features, obs. 6 and the last 10 obs. are put as a cluster; with original obs., we obtain the true groups

x> set.seed(1); x=matrix(rnorm(30*3), ncol=3)
> dd=as.dist(1-cor(t(x))) #corr.-based distance
x has 30 rows and 3 columns, i.e., there are 30 observations for 3 featurescor acts columnwise; we need to compute the correlation between each pair of observations, so we do cor(t(x))1-cor(t(x)) is a 30-by-30 symmetric matrix, each of whose entries is between 0 and 2Truth is that the 30 observations form only 1 cluster in terms of what distributions they follow; but sample correlations are not zero for independent observations, leading to clusters:

HC with complete linkage for dissimilarities on same scale:

Human cancer microarray data:
The data matrix has dimension \(64 \times 6830\); each row is an observation of all features
> library(ISLR)
> nci.data=NCI60$data; nci.labs=NCI60$labs
> dim(nci.data)
[1] 64 6830
> nci.labs[1:4]
[1] "CNS" "CNS" "CNS" "RENAL"
> table(nci.labs)
nci.labs
BREAST CNS COLON K562A-repro K562B-repro
7 5 7 1 1
LEUKEMIA MCF7A-repro MCF7D-repro MELANOMA NSCLC
6 1 1 8 9
OVARIAN PROSTATE RENAL UNKNOWN
6 2 9 1
nci.data: gene expression data matrixnci.labs: cancer type for each sample; information on cancer types available at wiki-NCI60table(nci.labs) counts how many samples each cancer type has> sd.data=scale(nci.data) # standardize data
> # Euclidean norm as pairwise dissimiliarty
> data.dist=dist(sd.data)
> plot(hclust(data.dist, method="average"), labels=nci.labs,
+ xlab="", sub="",ylab="", main="Average Linkage")
scale applies to each column of a matrixnci.data contains observations for a feature (i.e., a variable)scale(nci.data)Cell lines within a cancer type do tend to cluster together

> hc.al=hclust(dist(sd.data),method="average")
> hcal.clusters=cutree(hc.al,4)
> table(hcal.clusters,nci.labs)
nci.labs
hcal.clusters BREAST CNS COLON K562A-repro K562B-repro
1 7 5 7 0 0
2 0 0 0 0 0
3 0 0 0 1 1
4 0 0 0 0 0
nci.labs
hcal.clusters LEUKEMIA MCF7A-repro MCF7D-repro MELANOMA NSCLC
1 0 1 1 8 8
2 0 0 0 0 0
3 6 0 0 0 0
4 0 0 0 0 1
nci.labs
hcal.clusters OVARIAN PROSTATE RENAL UNKNOWN
1 6 2 8 1
2 0 0 1 0
3 0 0 0 0
4 0 0 0 0
nci.labs: cancer type for each observation; the truth> cutheight=hc.al$height[length(hc.al$height)-3]
> all.equal(cutree(hc.al,h=cutheight),cutree(hc.al,4))
[1] TRUE
> par(mfrow=c(1,1),mar = c(3.5,2.5,1.4,1.4))
> plot(hc.al, labels=nci.labs,xlab="", sub="",main="")
> abline(h=cutheight, col="red",lty="dotted")

Human cancer microarray data:
The data matrix has dimension \(6830 \times 64\), each of whose columns is an observation for all features. We will randomly sample \(100\) genes and \(30\) samples, and use the measurements on these genes to cluster these samples.
Information on cancer types at wiki-NCI60
> library(ElemStatLearn) # library containing data
> data(nci); n = dim(nci)[2]; p = dim(nci)[1] #get dimensions
> set.seed(123)
> rSel = sample(1:p, size=100, replace = FALSE)
> cSel = sample(1:n, size=30, replace = FALSE)
> nci_a = nci[rSel,cSel]
> colnames(nci_a) = colnames(nci)[cSel]
nci is an observation for all features; each row of nci contains observations for a featureFirst few entries of data matrix nci_a:
LEUKEMIA UNKNOWN NSCLC MELANOMA OVARIAN
[1,] 0.6800 0.0800 0.2400 -0.2600 0.0200
[2,] 0.2075 0.1675 -1.4725 0.2775 0.1475
[3,] 0.1650 0.6950 -0.5750 -0.1050 -0.5250
> # Euclidean norm as pairwise dissimilarity
> # applied to scaled data
> dMat = dist(scale(t(nci_a)))
> length(dMat)
[1] 435
The command dist(scale(t(nci_a))):
nci_a is a feature, leading to scale(t(nci_a)) since we standardize each featurescale(t(nci_a)) is an observation for all 100 standardized features, and then dist computes Euclidean distance between each pair of rows of scale(t(nci_a))> EHC_al = hclust(dMat, method = "average")
> library(ggdendro)
> ggdendrogram(EHC_al, rotate = F)

> EHC_SL = hclust(dMat, method = "single")
> library(ggdendro)
> ggdendrogram(EHC_SL, rotate = F)

> EHC_CL = hclust(dMat, method = "complete")
> library(ggdendro)
> ggdendrogram(EHC_CL, rotate = F)

Customize dendrogram for EHC_al:
> library(ElemStatLearn); library(ggplot2)
> source("Plotggdendro.r")
> cutheight = EHC_al$height[length(EHC_al$height)-2]
> droplot = plot_ggdendro(dendro_data_k(EHC_al, 3),
+ direction = "tb",heightReferece=cutheight,expand.y = 0.2)
dendro_data_k(EHC_al,3): extract information on cutting the tree to obtain 3 groupscutheight: height at which 3 groups are obtained when cutting the treeplot_ggdendro: visualize tree at the hierarchy of 3 groups; plot_ggdendro contained in file “Plotggdendro.r”; credit at websiteLeafs and branches for each group are colored:

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] ggdendro_0.1-20 ElemStatLearn_2015.6.26
[3] ISLR_1.2 ggplot2_3.1.0
[5] knitr_1.21
loaded via a namespace (and not attached):
[1] Rcpp_1.0.3 pillar_1.3.1 compiler_3.5.0
[4] plyr_1.8.4 highr_0.7 tools_3.5.0
[7] digest_0.6.18 evaluate_0.12 tibble_2.1.3
[10] gtable_0.2.0 pkgconfig_2.0.2 rlang_0.4.4
[13] rstudioapi_0.8 yaml_2.2.0 xfun_0.4
[16] withr_2.1.2 stringr_1.3.1 dplyr_0.8.4
[19] revealjs_0.9 grid_3.5.0 tidyselect_0.2.5
[22] glue_1.3.0 R6_2.3.0 rmarkdown_1.11
[25] purrr_0.2.5 magrittr_1.5 MASS_7.3-49
[28] scales_1.0.0 codetools_0.2-15 htmltools_0.3.6
[31] assertthat_0.2.0 colorspace_1.3-2 labeling_0.3
[34] stringi_1.2.4 lazyeval_0.2.1 munsell_0.5.0
[37] crayon_1.3.4