set.seed(2)
data <- tibble(
V1 = rnorm(n = 50, mean = rep(c(0, 3), each = 25)),
V2 = rnorm(n = 50, mean = rep(c(0, -4), each = 25))
)
data1 <- data[1:25,] %>%
mutate(group = 1)
data2 <- data[26:50,] %>%
mutate(group = 2)
data0 <- rbind(data1, data2)2 Clustering Methods
3 ISLR: Simulated Data
We begin with a simple simulated example in which there truly are two clusters in the data: the first 25 observations have a mean shift relative to the next 25 observations:
3.1 Summary Statistics
In Table 7.2 we can observe how their within means differ to each other:
vtable::sumtable(data0, group = "group",
summ=c('mean(x)', 'sd(x)', 'min(x)', 'max(x)'),
out = 'return') %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
add_header_above(c(" " = 1, "Group 1" = 4, "Group 2" = 4))Let’s also plot the groups using ggplot():
3.2 K-Means Clustering
We now perform K-means clustering with . To run the kmeans() function in R with multiple initial cluster assignments, we use the nstart argument. If a value of nstart greater than one is used, then K-means clustering will be performed using multiple initial random assignments, and the kmeans() function will report only the best results.
km.out <- kmeans(data, 2, nstart = 20)The cluster assignments of the 50 observations are contained in km.out$cluster:
km.out$cluster [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
[39] 2 2 2 1 2 2 2 2 2 2 2 2
If we compare their centroids to the means calculated in Table 7.2, we see that they are quite similar:
km.out$centers %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| V1 | V2 |
|---|---|
| 0.361163 | -0.162447 |
| 2.877777 | -4.262629 |
We can plot the data, with each observation colored according to its cluster assignment:
data %>%
ggplot(aes(V1, V2, color = as.factor(km.out$cluster))) +
geom_point() +
labs(color = "Cluster") + theme_bw()Comparing the clusters between Figure 3.2 and Figure 3.1, we found that they do not contain exactly the same observations within each cluster.
fviz_cluster(km.out, data=data, geom="point", ggtheme = theme_bw())+
ggtitle("K-means Clustering (K=2)")We can run K-means with different values for the number of clusters such as K = 3 and plot the results.
km.out <- kmeans(data, 3, nstart = 20)Code
fviz_cluster(km.out, data=data, geom="point", ggtheme = theme_bw())+
ggtitle("K-means Clustering (K=3)")3.2.1 Silhouette Criteria
To select the optimal number of clusters for k-means clustering, we can implement the “elbow” method. The elbow method runs k-means clustering on the dataset for a range of values for k (say from 1-10) and then, for each value of k,computes an average score for all clusters.
When these overall metrics for each model are plotted, it is possible to visually determine the best value for k. If the line chart looks like an arm, then the “elbow” (the point of inflection on the curve) is the best value of k. The “arm” can be either up or down, but if there is a strong inflection point, it is a good indication that the underlying model fits best at that point.
Let’s compute first the mean silhouette coefficient, which measures the mean ratio of intra-cluster and nearest-cluster distance.
# function to compute average silhouette for k clusters
avg_sil <- function(k) {
km.res <- kmeans(data, centers = k, nstart = 25)
ss <- silhouette(km.res$cluster, dist(data))
mean(ss[, 3])
}
# Compute and plot wss for k = 2 to k = 10
k.values <- 2:10
# extract avg silhouette for 2-10 clusters
avg_sil_values <- c(0,map_dbl(k.values, avg_sil))
plot(c(1,k.values), avg_sil_values,
type = "b", pch = 19, frame = FALSE,
xlab = "Number of clusters K",
ylab = "Average Silhouettes")
abline(v = 2, lty=2)This process to compute the average silhoutte method has been wrapped up in a single function called fviz_nbclust from the factoextra package:
3.2.2 Calinski-Harabaz Score
Alternatively, the Calinski-Harabasz (CH) criteria can be used to evaluate the model. The calinski_harabasz score (also known as Variance ratio criterion) computes the ratio of dispersion between and within clusters.
# function to compute average silhouette for k clusters
calinski <- function(k) {
km.res <- kmeans(data, centers = k, nstart = 25)
ss <- calinhara(data, km.res$cluster)
}
# Compute and plot wss for k = 2 to k = 10
k.values <- 2:10
# extract avg silhouette for 2-10 clusters
calinski <- c(0, map_dbl(k.values, calinski))
calinksi1 <- data.frame(cbind(kvalues = c(1,k.values), calinksi = calinski))
plot(c(1,k.values), calinski,
type = "b", pch = 19, frame = FALSE,
xlab = "Number of clusters K",
ylab = "Calinski-Harabaz Score")
abline(v = 2, lty=2)3.3 Hierarchical Clustering
We can use hierarchical clustering on the dataset we generated in the previous exercise using the hclust() function.
We will use the data to plot the hierarchical clustering dendrogram using complete, single, and average linkage clustering, with Euclidean distance as the dissimilarity measure.
hc.complete <- hclust(dist(data), method = "complete")hc.average <- hclust(dist(data), method = "average")hc.single <- hclust(dist(data), method = "single")par(mfrow = c(1, 3))
plot(hc.complete, main = "Complete Linkage", xlab = "", sub = "", cex = 0.9)
plot(hc.average, main = "Average Linkage", xlab = "", sub = "", cex = 0.9)
plot(hc.single, main = "Single Linkage", xlab = "", sub = "", cex = 0.9)The factoextra package also provides the fviz_dend() function to visualize the clustering created:
Code
complete_plot1 <- hc.complete %>% fviz_dend(main = "Complete Linkage", k = 2)
average_plot1 <- hc.average %>% fviz_dend(main = "Average Linkage", k = 2)
single_plot1 <- hc.single %>% fviz_dend(main = "Single Linkage", k = 2)3.3.1 Number of Clusters
We can color according to k = 4 clusters and we get the following separations:
Code
complete_plot1 <- hc.complete %>% fviz_dend(main = "Complete Linkage", k = 4)
average_plot1 <- hc.average %>% fviz_dend(main = "Average Linkage", k = 4)
single_plot1 <- hc.single %>% fviz_dend(main = "Single Linkage", k = 4)3.3.2 Scaling the Data
If we don’t know the importance of the different predictors in data set it could be beneficial to scale the data such that each variable has the same influence. We will use a recipe and workflow to do this.
xsc <- scale(data)
hc.xsc.complete <- hclust(dist(xsc), method = "complete")
hc.xsc.average <- hclust(dist(xsc), method = "average")
hc.xsc.single <- hclust(dist(xsc), method = "single")
complete_plot1 <- hc.xsc.complete %>% fviz_dend(main = "Complete Linkage", k = 2)
average_plot1 <- hc.xsc.average %>% fviz_dend(main = "Average Linkage", k = 2)
single_plot1 <- hc.xsc.single %>% fviz_dend(main = "Single Linkage", k = 2)Figure 3.9 report the results after scaling our covariates:
4 ISLR: NCI60 Cancer Data
Let us now see what happens if we perform clustering on the nci60 data set, with the goal of finding out whether or not the observations cluster into distinct types of cancer.
Before we start it would be good if we create a scaled version of this data set.
nci.data <- NCI60$data
nci.labs <- NCI60$labs
sd.data <- scale(nci.data)
data_dist <- dist(sd.data)4.1 Hierarchical Clustering
par(mfrow = c(1, 3))
plot(hclust(data_dist, method = "complete"), labels = nci.labs, main = "Complete Linkage")
plot(hclust(data_dist, method = "average"), labels = nci.labs, main = "Average Linkage")
plot(hclust(data_dist, method = "single"), labels = nci.labs, main = "Single Linkage")4.1.1 Choosing the Optimal K Clusters
# Ward
complete_HC <- function(x,k){hcut(x, k, hc_method = "ward.D2", hc_metric="euclidian")}
sil_plot1 <- fviz_nbclust(sd.data, complete_HC, method = "silhouette", linecolor="#00BFC4") + ggtitle("Ward.D2") + theme_bw()
# Complete
complete_HC <- function(x,k){hcut(x, k, hc_method ="complete" , hc_metric="euclidian")}
sil_plot2 <- fviz_nbclust(sd.data, complete_HC, method = "silhouette", linecolor="#00BFC4") + ggtitle("Complete") + theme_bw()
# Average
average_HC <- function(x,k){hcut(x, k, hc_method ="average" , hc_metric="euclidian")}
sil_plot3 <- fviz_nbclust(sd.data, average_HC, method = "silhouette", linecolor="#00BFC4") + ggtitle("Average") + theme_bw()
# Single
single_HC <- function(x,k){hcut(x, k, hc_method ="single" , hc_metric="euclidian")}
sil_plot4 <- fviz_nbclust(sd.data, single_HC, method = "silhouette", linecolor="#00BFC4") + ggtitle("Single") + theme_bw()Code
ggarrange(sil_plot1, sil_plot2, sil_plot3, sil_plot4, ncol = 2, nrow = 2)We cut the tree to give us the correspondent clusters:
nci.hc.ward <- hclust(data_dist, method = "ward.D2")
nci.hc.complete <- hclust(data_dist, method = "complete")
nci.hc.average <- hclust(data_dist, method = "average")
nci.hc.single <- hclust(data_dist, method = "single")
ward_plot2 <- nci.hc.ward %>% fviz_dend(main = "Ward.D2 Linkage", k = 7)
complete_plot2 <- nci.hc.complete %>% fviz_dend(main = "Complete Linkage", k = 2)
average_plot2 <- nci.hc.average %>% fviz_dend(main = "Average Linkage", k = 2)
single_plot2 <- nci.hc.single %>% fviz_dend(main = "Single Linkage", k = 2)4.2 K-Means Clustering
How do these NCI60 hierarchical clustering results compare to what we get if we perform K-means clustering with K=4?
Code
fviz_nbclust(sd.data, kmeans, method = "silhouette", linecolor="#00BFC4") +
ggtitle("Silhouette Coefficient") + theme_bw()Let’s save the cluster ids in nci.cluster_kmeans and nci.hc.clusters and then use plot_confusion_matrix() to build a some kind of “confusion matrix” between the two methods:
set.seed(2)
km.out <- kmeans(sd.data, 2, nstart = 20)
nci.km.clusters <- km.out$cluster
nci.hc.clusters <- cutree(nci.hc.complete, 2)Code
cfm <- tibble(kmeans = nci.km.clusters, hclust = nci.hc.clusters)
basic_table <- table(cfm)
plot_confusion_matrix(as_tibble(basic_table),
target_col = "kmeans",
prediction_col = "hclust",
counts_col = "n",
add_normalized = FALSE) +
ylab("Hierarchical") + xlab("K-Means")Even is there may be not a lot of agreement between labels which makes sense (the labels themselves are arbitrarily added), they tend to agree quite a lot.
4.3 PCA & Hierachical Clustering
One last thing is that it is sometimes useful to perform dimensionality reduction before using the clustering method. Let us use the recipes package to calculate the PCA of nci60 and keep the 5 first components.
pr_out <- prcomp(NCI60$data, scale = TRUE)
pr_out_x <- pr_out$x %>% as_tibble(rownames = "Variable")hc.out <- hclust(dist(pr_out_x[, 1:5]))
plot(hc.out, labels = nci.labs, main = "Hier. Clust. on First 5 Score Vectors ")We will only use the complete linkage:
Code
# Complete
complete_HC <- function(x,k){hcut(x, k, hc_method ="complete" , hc_metric="euclidian")}
fviz_nbclust(pr_out_x[, 1:5], complete_HC, method = "silhouette", linecolor="#00BFC4") + ggtitle("Complete Linkage") + theme_bw()Figure 4.7 shows our results using K=3 clusters:
nci.pca.complete <- hclust(data_dist, method = "complete")5 DSM Lab: Clustering Methods
The following two exercises are based on employment dataset employment.csv. This dataset is adapted from the LISS panel (Longitudinal Internet Studies for the Social sciences).
See https://www.lissdata.nl/about-panel for a detailed description about LISS. Our dataset contains survey information from 1980 employees in year 2015.
employment <- read_csv("employment.csv")
employment <- na.omit(employment)5.1 Sampling & Scaling the Data
5.1.1 Questions 6-8
We want to group employees based on all the 7 features:
Because clustering command involves random assignment of the initial clusters, we use
set.seed(5829)to keep results replicable.We would like work on a random subset of the original dataset. Use
sample()to draw 60 random individuals to form the new small dataset.Scale the new data with
scale(). Think it over: should we scale the data? Why and why not?
set.seed(5829)
employeesmall <- employment[sample(1:nrow(employment), 60, replace=FALSE),]
sd.employee <- scale(employeesmall)5.2 Hierarchical Clustering
5.2.1 Question 9
We want to perform a hierarchical clustering to get a feeling of how the individuals are clustered. Use hclust() and "average" linkage to do it.
set.seed(5829)
hc.average <- hclust(dist(sd.employee),method="average")5.2.2 Question 10
Plot the dendrogram.
If we cut the dendrogram at the height of 3.6. How many clusters do we get?
Code
hc.average %>% fviz_dend(main = "Average Linkage", h = 3.6,
rect = TRUE, rect_fill = TRUE) +
geom_hline(yintercept = 3.6, linetype = 2)hc.cluster <- cutree(hc.average, h=3.6)Number of Clusters: 6
How many observations are in cluster 3? Which are they?
# observation 4, 36 and 49 fall into cluster 3
which(hc.cluster == 3)[1] 4 36 49
Number of Observations in Cluster N°3: 3
5.2.3 Question 11
Instead of using the Euclidean distance, we now use the correlation-based distance to redo the hierarchical clustering.
First use cor() to calculate the 60 by 60 correlation matrix of these individuals. Note that correlation measures “similarity” among individuals. You need to transform it into “dissimilarity”. The use as.dist() to transform dissimilarity matrix into distance matrix. To indicate observations in cluster, you may use which().
dd <- as.dist(1-cor(t(sd.employee)))
hc.corr <- hclust(dd, method = "average")Plot the dendrogram. Does the plot change much? If we would like to have 5 clusters?
Code
hc.corr %>% fviz_dend(main = "Average Linkage", k = 5,
rect = TRUE, rect_fill = TRUE)How many observations are in cluster 3?
# we can prune the tree such that we will have 5 branches
hc.cluster.cor <- cutree(hc.corr, k=5)
which(hc.cluster.cor == 3) [1] 4 5 9 11 12 22 24 25 37 47 49 60
Number of Observations in Cluster N°3: 12
5.3 K-Means Clustering
5.3.1 Question 12
We now come back to the full sample of 1965 employees. We decide to group them into 3 clusters. Scale the variables and perform a K-means clustering with kmeans(). Choose nstart to be 50. Display the total within-cluster sum of squares.
sd.employeefull <- scale(employment)
km.out <- kmeans(sd.employeefull, 3, nstart=50)Total Within-Cluster Sum of Squares: 9942.9
5.3.2 Question 13
Plot income against tenure. Use different colors for the three groups. Interpret the three groups.