Stack Overflow Asked by JSouza on February 2, 2021
Could you help me find the ideal number of clusters using the clusGap
function? There is a similar example in this link: https://www.rdocumentation.org/packages/factoextra/versions/1.0.7/topics/fviz_nbclust
But I would like to do it for my case. My code is below:
library(cluster)
df <- structure(
list(Propertie = c(1,2,3,4,5,6,7,8), Latitude = c(-24.779225, -24.789635, -24.763461, -24.794394, -24.747102,-24.781307,-24.761081,-24.761084),
Longitude = c(-49.934816, -49.922324, -49.911616, -49.906262, -49.890796,-49.8875254,-49.8875254,-49.922244),
waste = c(526, 350, 526, 469, 285, 433, 456,825)),class = "data.frame", row.names = c(NA, -8L))
df<-scale(df)
hcluster = clusGap(df, FUN = hcut, K.max = 100, B = 50)
Clustering k = 1,2,..., K.max (= 100): .. Error in sil.obj[, 1:3] : incorrect number of dimensions
The issue here is that you have specified K.max
as 100, however, you only have eight observations in your dataset. As noted in the clusGap
documentation, K.max
is the
the maximum number of clusters to consider, hence, in your case, K.max
cannot be greater than seven.
It is unclear to me that clustering is appropriate on a dataset of such small size. Nevertheless, please see below a working implementation. I have modified the plot_clusgap
function from the R/Bioconductor phyloseq
package to visualize the results.
library(data.table)
library(cluster)
library(factoextra) # for hcut function
df <- data.table(properties = c(1,2,3,4,5,6,7,8),
latitude = c(-24.779225, -24.789635, -24.763461, -24.794394, -24.747102,-24.781307,-24.761081,-24.761084),
longitude = c(-49.934816, -49.922324, -49.911616, -49.906262, -49.890796,-49.8875254,-49.8875254,-49.922244),
waste = c(526, 350, 526, 469, 285, 433, 456,825))
df <- scale(df)
# perform clustering, B = 500 is recommended
hcluster = clusGap(df, FUN = hcut, K.max = 7, B = 500)
# extract results
dat <- data.table(hcluster$Tab)
dat[, k := .I]
# visualize gap statistic
p <- ggplot(dat, aes(k, gap)) + geom_line() + geom_point(size = 3) +
geom_errorbar(aes(ymax = gap + SE.sim, ymin = gap - SE.sim), width = 0.25) +
ggtitle("Clustering Results") +
labs(x = "Number of Clusters", y = "Gap Statistic") +
theme(plot.title = element_text(size = 16, hjust = 0.5, face = "bold"),
axis.title = element_text(size = 12, face = "bold"))
Here is the resulting figure:
I should note that all the gap statistic values are negative. This indicates that the optimal number of clusters is k = 1 (i.e., no clustering).
Answered by Ayush Noori on February 2, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP