2  Clustering Methods

Author

Daniel Redel

3 ISLR: Simulated Data

We begin with a simple simulated example in which there truly are two clusters in the data: the first 25 observations have a mean shift relative to the next 25 observations:

set.seed(2)

data <- tibble(
  V1 = rnorm(n = 50, mean = rep(c(0, 3), each = 25)),
  V2 = rnorm(n = 50, mean = rep(c(0, -4), each = 25))
)

data1 <- data[1:25,] %>% 
  mutate(group = 1)
data2 <- data[26:50,] %>% 
  mutate(group = 2)

data0 <- rbind(data1, data2)

3.1 Summary Statistics

In Table 7.2 we can observe how their within means differ to each other:

vtable::sumtable(data0, group = "group",  
         summ=c('mean(x)', 'sd(x)', 'min(x)', 'max(x)'), 
         out = 'return') %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>% 
  add_header_above(c(" " = 1, "Group 1" = 4, "Group 2" = 4))
Table 3.1: Summary Statistics
Group 1
Group 2
Variable Mean Sd Min Max Mean Sd Min Max
group 1 2
V1 0.33 1.1 -2.3 2.1 2.8 1.1 0.55 5
V2 -0.076 1.2 -2 2.1 -4.2 1.3 -6.2 -2

Let’s also plot the groups using ggplot():

data %>%
  ggplot(aes(V1, V2, color = rep(c("A", "B"), each = 25))) +
  geom_point() +
  labs(color = "Group") + theme_bw()

Figure 3.1: True Cluster Groups

3.2 K-Means Clustering

We now perform K-means clustering with K=2K = 2 . To run the kmeans() function in R with multiple initial cluster assignments, we use the nstart argument. If a value of nstart greater than one is used, then K-means clustering will be performed using multiple initial random assignments, and the kmeans() function will report only the best results.

km.out <- kmeans(data, 2, nstart = 20)

The cluster assignments of the 50 observations are contained in km.out$cluster:

km.out$cluster
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
[39] 2 2 2 1 2 2 2 2 2 2 2 2

If we compare their centroids to the means calculated in Table 7.2, we see that they are quite similar:

km.out$centers %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
V1 V2
0.361163 -0.162447
2.877777 -4.262629

We can plot the data, with each observation colored according to its cluster assignment:

data %>%
  ggplot(aes(V1, V2, color = as.factor(km.out$cluster))) +
  geom_point() +
  labs(color = "Cluster") + theme_bw()

Figure 3.2: K-Means Clustering Results K = 2

Comparing the clusters between Figure 3.2 and Figure 3.1, we found that they do not contain exactly the same observations within each cluster.

fviz_cluster(km.out, data=data, geom="point", ggtheme = theme_bw())+
  ggtitle("K-means Clustering (K=2)")

We can run K-means with different values for the number of clusters such as K = 3 and plot the results.

km.out <- kmeans(data, 3, nstart = 20)
Code
fviz_cluster(km.out, data=data, geom="point", ggtheme = theme_bw())+
  ggtitle("K-means Clustering (K=3)")

Figure 3.3: K-Means Clustering Results K = 3

3.2.1 Silhouette Criteria

To select the optimal number of clusters for k-means clustering, we can implement the “elbow” method. The elbow method runs k-means clustering on the dataset for a range of values for k (say from 1-10) and then, for each value of k,computes an average score for all clusters.

When these overall metrics for each model are plotted, it is possible to visually determine the best value for k. If the line chart looks like an arm, then the “elbow” (the point of inflection on the curve) is the best value of k. The “arm” can be either up or down, but if there is a strong inflection point, it is a good indication that the underlying model fits best at that point.

Let’s compute first the mean silhouette coefficient, which measures the mean ratio of intra-cluster and nearest-cluster distance.

# function to compute average silhouette for k clusters
avg_sil <- function(k) {
  km.res <- kmeans(data, centers = k, nstart = 25)
  ss <- silhouette(km.res$cluster, dist(data))
  mean(ss[, 3])
}

# Compute and plot wss for k = 2 to k = 10
k.values <- 2:10

# extract avg silhouette for 2-10 clusters
avg_sil_values <- c(0,map_dbl(k.values, avg_sil))

plot(c(1,k.values), avg_sil_values,
       type = "b", pch = 19, frame = FALSE, 
       xlab = "Number of clusters K",
       ylab = "Average Silhouettes")
abline(v = 2, lty=2)

Figure 3.4: Silhouette Criteria N°1

This process to compute the average silhoutte method has been wrapped up in a single function called fviz_nbclust from the factoextra package:

Code
fviz_nbclust(data, kmeans, method = "silhouette", linecolor="#00BFC4") +
  geom_vline(xintercept = 2, linetype = 2) +
  ggtitle("Silhouette Coefficient") + theme_bw()

Figure 3.5: Silhouette Criteria N°1

3.2.2 Calinski-Harabaz Score

Alternatively, the Calinski-Harabasz (CH) criteria can be used to evaluate the model. The calinski_harabasz score (also known as Variance ratio criterion) computes the ratio of dispersion between and within clusters.

# function to compute average silhouette for k clusters
calinski <- function(k) {
  km.res <- kmeans(data, centers = k, nstart = 25)
  ss <- calinhara(data, km.res$cluster)
}

# Compute and plot wss for k = 2 to k = 10
k.values <- 2:10

# extract avg silhouette for 2-10 clusters
calinski <- c(0, map_dbl(k.values, calinski))
calinksi1 <- data.frame(cbind(kvalues = c(1,k.values), calinksi = calinski))

plot(c(1,k.values), calinski,
       type = "b", pch = 19, frame = FALSE, 
       xlab = "Number of clusters K",
       ylab = "Calinski-Harabaz Score")
abline(v = 2, lty=2)

3.3 Hierarchical Clustering

We can use hierarchical clustering on the dataset we generated in the previous exercise using the hclust() function.

We will use the data to plot the hierarchical clustering dendrogram using complete, single, and average linkage clustering, with Euclidean distance as the dissimilarity measure.

hc.complete <- hclust(dist(data), method = "complete")
hc.average <- hclust(dist(data), method = "average")
hc.single <- hclust(dist(data), method = "single")
par(mfrow = c(1, 3))
plot(hc.complete, main = "Complete Linkage", xlab = "", sub = "", cex = 0.9)
plot(hc.average, main = "Average Linkage", xlab = "", sub = "", cex = 0.9)
plot(hc.single, main = "Single Linkage", xlab = "", sub = "", cex = 0.9)

Figure 3.6: Dendogram: Hierarchical Clustering

The factoextra package also provides the fviz_dend() function to visualize the clustering created:

Code
complete_plot1 <- hc.complete %>% fviz_dend(main = "Complete Linkage", k = 2)
average_plot1 <- hc.average %>% fviz_dend(main = "Average Linkage", k = 2)
single_plot1 <- hc.single %>% fviz_dend(main = "Single Linkage", k = 2)

Figure 3.7: Dendogram: Hierarchical Clustering (K=2)

3.3.1 Number of Clusters

We can color according to k = 4 clusters and we get the following separations:

Code
complete_plot1 <- hc.complete %>% fviz_dend(main = "Complete Linkage", k = 4)
average_plot1 <- hc.average %>% fviz_dend(main = "Average Linkage", k = 4)
single_plot1 <- hc.single %>% fviz_dend(main = "Single Linkage", k = 4)
ggarrange(complete_plot1, average_plot1, single_plot1, ncol = 2, nrow = 2)

Figure 3.8: Dendogram: Hierarchical Clustering (K=4)

3.3.2 Scaling the Data

If we don’t know the importance of the different predictors in data set it could be beneficial to scale the data such that each variable has the same influence. We will use a recipe and workflow to do this.

xsc <- scale(data)

hc.xsc.complete <- hclust(dist(xsc), method = "complete")
hc.xsc.average <- hclust(dist(xsc), method = "average")
hc.xsc.single <- hclust(dist(xsc), method = "single")

complete_plot1 <- hc.xsc.complete %>% fviz_dend(main = "Complete Linkage", k = 2)
average_plot1 <- hc.xsc.average %>% fviz_dend(main = "Average Linkage", k = 2)
single_plot1 <- hc.xsc.single %>% fviz_dend(main = "Single Linkage", k = 2)

Figure 3.9 report the results after scaling our covariates:

Code
ggarrange(complete_plot1, average_plot1, single_plot1, ncol = 2, nrow = 2)

Figure 3.9: Dendogram: Hierarchical Clustering with Scaled Covariates (K=2)

3.3.3 Correlated-Based Distance

Correlation-based distance can be computed using the as.dist() function, which converts an arbitrary square symmetric matrix into a form that the hclust() function recognizes as a distance matrix. However, this only makes sense for data with at least three features since the absolute correlation between any two observations with measurements on two features is always 1. Hence, we will cluster a three-dimensional data set. Our results can be visually seen in Figure 3.10.

set.seed(3)
x <- matrix(rnorm(30 * 3), ncol = 3)
dd <- as.dist(1 - cor(t(x)))
plot(hclust(dd, method = "complete"), main = "Complete Linkage with Correlation -Based Distance", xlab = "", sub = "")

Figure 3.10: Dendogram: Correlation-Based Hierarchical Clustering

4 ISLR: NCI60 Cancer Data

Let us now see what happens if we perform clustering on the nci60 data set, with the goal of finding out whether or not the observations cluster into distinct types of cancer.

Before we start it would be good if we create a scaled version of this data set.

nci.data <- NCI60$data
nci.labs <- NCI60$labs
sd.data <- scale(nci.data)
data_dist <- dist(sd.data)

4.1 Hierarchical Clustering

par(mfrow = c(1, 3))
plot(hclust(data_dist, method = "complete"), labels = nci.labs, main = "Complete Linkage")
plot(hclust(data_dist, method = "average"), labels = nci.labs, main = "Average Linkage")
plot(hclust(data_dist, method = "single"), labels = nci.labs, main = "Single Linkage")

Figure 4.1: Dendogram: Hierarchical Clustering

4.1.1 Choosing the Optimal K Clusters

# Ward
complete_HC <- function(x,k){hcut(x, k, hc_method = "ward.D2", hc_metric="euclidian")}
sil_plot1 <- fviz_nbclust(sd.data, complete_HC, method = "silhouette", linecolor="#00BFC4") + ggtitle("Ward.D2") + theme_bw()

# Complete
complete_HC <- function(x,k){hcut(x, k, hc_method ="complete" , hc_metric="euclidian")}
sil_plot2 <- fviz_nbclust(sd.data, complete_HC, method = "silhouette", linecolor="#00BFC4") + ggtitle("Complete") + theme_bw()

# Average
average_HC <- function(x,k){hcut(x, k, hc_method ="average" , hc_metric="euclidian")}
sil_plot3 <- fviz_nbclust(sd.data, average_HC, method = "silhouette", linecolor="#00BFC4") + ggtitle("Average") + theme_bw()

# Single
single_HC <- function(x,k){hcut(x, k, hc_method ="single" , hc_metric="euclidian")}
sil_plot4 <- fviz_nbclust(sd.data, single_HC, method = "silhouette", linecolor="#00BFC4") + ggtitle("Single") + theme_bw()
Code
ggarrange(sil_plot1, sil_plot2, sil_plot3, sil_plot4, ncol = 2, nrow = 2)

Figure 4.2: Dendogram: Hierarchical Clustering

We cut the tree to give us the correspondent clusters:

nci.hc.ward <- hclust(data_dist, method = "ward.D2")
nci.hc.complete <- hclust(data_dist, method = "complete")
nci.hc.average <- hclust(data_dist, method = "average")
nci.hc.single <- hclust(data_dist, method = "single")

ward_plot2 <- nci.hc.ward %>% fviz_dend(main = "Ward.D2 Linkage", k = 7)
complete_plot2 <- nci.hc.complete %>% fviz_dend(main = "Complete Linkage", k = 2)
average_plot2 <- nci.hc.average %>% fviz_dend(main = "Average Linkage", k = 2)
single_plot2 <- nci.hc.single %>% fviz_dend(main = "Single Linkage", k = 2)
Code
ggarrange(ward_plot2, complete_plot2, average_plot2, single_plot2, ncol = 2, nrow = 2)

Figure 4.3: Dendogram: Hierarchical Clustering (K=7)

4.2 K-Means Clustering

How do these NCI60 hierarchical clustering results compare to what we get if we perform K-means clustering with K=4?

Code
fviz_nbclust(sd.data, kmeans, method = "silhouette", linecolor="#00BFC4") +
  ggtitle("Silhouette Coefficient") + theme_bw()

Figure 4.4: Silhouette Criteria

Let’s save the cluster ids in nci.cluster_kmeans and nci.hc.clusters and then use plot_confusion_matrix() to build a some kind of “confusion matrix” between the two methods:

set.seed(2)
km.out <- kmeans(sd.data, 2, nstart = 20)
nci.km.clusters <- km.out$cluster
nci.hc.clusters <- cutree(nci.hc.complete, 2)
Code
cfm <- tibble(kmeans = nci.km.clusters, hclust = nci.hc.clusters) 
basic_table <- table(cfm)
plot_confusion_matrix(as_tibble(basic_table), 
                      target_col = "kmeans", 
                      prediction_col = "hclust",
                      counts_col = "n",
                      add_normalized = FALSE) + 
  ylab("Hierarchical") + xlab("K-Means")

Figure 4.5: Confusion Matrix

Even is there may be not a lot of agreement between labels which makes sense (the labels themselves are arbitrarily added), they tend to agree quite a lot.

4.3 PCA & Hierachical Clustering

One last thing is that it is sometimes useful to perform dimensionality reduction before using the clustering method. Let us use the recipes package to calculate the PCA of nci60 and keep the 5 first components.

pr_out <- prcomp(NCI60$data, scale = TRUE)
pr_out_x <- pr_out$x %>% as_tibble(rownames = "Variable")
hc.out <- hclust(dist(pr_out_x[, 1:5]))
plot(hc.out, labels = nci.labs, main = "Hier. Clust. on First 5 Score Vectors ")

We will only use the complete linkage:

Code
# Complete
complete_HC <- function(x,k){hcut(x, k, hc_method ="complete" , hc_metric="euclidian")}

fviz_nbclust(pr_out_x[, 1:5], complete_HC, method = "silhouette", linecolor="#00BFC4") + ggtitle("Complete Linkage") + theme_bw()

Figure 4.6: Silhouette Criteria

Figure 4.7 shows our results using K=3 clusters:

nci.pca.complete <- hclust(data_dist, method = "complete")
Code
nci.pca.complete %>% fviz_dend(main = "PCA - Complete Linkage", k = 3)

Figure 4.7: Dendogram: Hierarchical Clustering (K=3)

5 DSM Lab: Clustering Methods

The following two exercises are based on employment dataset employment.csv. This dataset is adapted from the LISS panel (Longitudinal Internet Studies for the Social sciences).

See https://www.lissdata.nl/about-panel for a detailed description about LISS. Our dataset contains survey information from 1980 employees in year 2015.

employment <- read_csv("employment.csv")
employment <- na.omit(employment)

5.1 Sampling & Scaling the Data

5.1.1 Questions 6-8

We want to group employees based on all the 7 features:

  • Because clustering command involves random assignment of the initial clusters, we use set.seed(5829) to keep results replicable.

  • We would like work on a random subset of the original dataset. Use sample() to draw 60 random individuals to form the new small dataset.

  • Scale the new data with scale(). Think it over: should we scale the data? Why and why not?

set.seed(5829)
employeesmall <- employment[sample(1:nrow(employment), 60, replace=FALSE),]
sd.employee <- scale(employeesmall)

5.2 Hierarchical Clustering

5.2.1 Question 9

We want to perform a hierarchical clustering to get a feeling of how the individuals are clustered. Use hclust() and "average" linkage to do it.

set.seed(5829)
hc.average <- hclust(dist(sd.employee),method="average")

5.2.2 Question 10

Plot the dendrogram.

If we cut the dendrogram at the height of 3.6. How many clusters do we get?

Code
hc.average %>% fviz_dend(main = "Average Linkage", h = 3.6,
                         rect = TRUE, rect_fill = TRUE) +
geom_hline(yintercept = 3.6, linetype = 2)

Figure 5.1: Dendogram: Hierarchical Clustering (h = 3.6)
hc.cluster <- cutree(hc.average, h=3.6)
Number of Clusters: 6

How many observations are in cluster 3? Which are they?

# observation 4, 36 and 49 fall into cluster 3
which(hc.cluster == 3)
[1]  4 36 49
Number of Observations in Cluster N°3:  3

5.2.3 Question 11

Instead of using the Euclidean distance, we now use the correlation-based distance to redo the hierarchical clustering.

First use cor() to calculate the 60 by 60 correlation matrix of these individuals. Note that correlation measures “similarity” among individuals. You need to transform it into “dissimilarity”. The use as.dist() to transform dissimilarity matrix into distance matrix. To indicate observations in cluster, you may use which().

dd <- as.dist(1-cor(t(sd.employee)))
hc.corr <- hclust(dd, method = "average")

Plot the dendrogram. Does the plot change much? If we would like to have 5 clusters?

Code
hc.corr %>% fviz_dend(main = "Average Linkage", k = 5,
                         rect = TRUE, rect_fill = TRUE)

Figure 5.2: Correlation-Based Distance: Hierarchical Clustering (K = 5)

How many observations are in cluster 3?

# we can prune the tree such that we will have 5 branches
hc.cluster.cor <- cutree(hc.corr, k=5)
which(hc.cluster.cor == 3)
 [1]  4  5  9 11 12 22 24 25 37 47 49 60
Number of Observations in Cluster N°3:  12

5.3 K-Means Clustering

5.3.1 Question 12

We now come back to the full sample of 1965 employees. We decide to group them into 3 clusters. Scale the variables and perform a K-means clustering with kmeans(). Choose nstart to be 50. Display the total within-cluster sum of squares.

sd.employeefull <- scale(employment)
km.out <- kmeans(sd.employeefull, 3, nstart=50)
Total Within-Cluster Sum of Squares: 9942.9

5.3.2 Question 13

Plot income against tenure. Use different colors for the three groups. Interpret the three groups.

Code
as.data.frame(sd.employeefull[,2:3]) %>%
  ggplot(aes(income, tenure, color = as.factor(km.out$cluster))) +
  geom_point(size = 2, alpha = 0.65) +
  labs(color = "Cluster") + theme_bw() + ylab("Tenure") + xlab("Income")

Figure 5.3: K-Means Clustering Results (K = 3)