Data Science Asked by sh_student on April 1, 2021
I want to cluster my data via k-means/modes. As the variables in my data are not normal distributed, I am not using the z-transformation to scale my data. I am scaling my data by categorizing each column of the data by its quantiles (0, 0.2, 0.4, 0.6, 0.8, 1 quantile) – e.g. if the value is between the 0 and 0.2 quantile, it gets labelled as 1. Here an example data frame – each column represents percentages (sorry for the long code but I need to include a certain amount of data points to still get a similar distribution of the variables compared to the original data):
mydf <- structure(list(perc1 = c(0.639, 0, 0, 0, 0, 100, 0, 0, 0, 0,
0, 0, 0, 0, 5.5556, 0, 0, 0, 11.1111, 0, 0, 3.3058, 0, 0, 0,
0, 0, 0, 0.9901, 0, 0, 2.5641, 0, 16.6667, 0, 0, 0, 0, 0, 0,
33.3333, 0, 0, 0, 0, 100, 0, 0, 6.25, 8.6957, 11.1111, 0, 0,
0, 19.0476, 0, 3.8462, 0, 0, 100, 0, 0, 14.2857, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0.2041, 16.6667, 0, 4.878, 15.3846, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 37.5, 0, 0, 0, 0, 0, 0, 100, 0, 0),
perc2 = c(1.278, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 88.8889, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.9901, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 62.5,
0, 0, 0, 0, 0, 0, 0, 7.6923, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
13.3333, 0, 0, 0, 0, 0, 0, 0.8163, 16.6667, 0, 0, 0, 0, 0,
0, 0, 28.5714, 0, 0, 0, 100, 0, 0, 50, 0, 0, 0, 0, 0, 0,
0, 0, 0), perc3 = c(97.4441, 0, 0, 0, 0, 0, 68.5185, 0, 0,
0, 0, 76.4706, 0, 25, 33.3333, 30.7692, 0, 71.4286, 0, 0,
0, 76.0331, 0, 0, 0, 0, 0, 0, 95.5446, 0, 0, 64.1026, 0,
0, 92.3077, 88.8889, 0, 66.6667, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 31.5789, 0, 0, 47.619, 97.6077, 46.1538, 0,
0, 0, 0, 0, 0, 0, 55.5556, 0, 0, 0, 0, 20, 0, 35.7143, 50,
0, 98.6735, 0, 38.4615, 78.0488, 0, 100, 0, 0, 100, 0, 0,
100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), perc4 = c(0,
30, 50, 0, 0, 0, 5.5556, 40, 35.1351, 100, 0, 0, 16.6667,
0, 55.5556, 38.4615, 75, 7.1429, 0, 80, 100, 2.4793, 57.1429,
0, 0, 0, 0, 0, 0.495, 0, 0, 17.9487, 100, 25, 7.6923, 0,
100, 16.6667, 0, 100, 33.3333, 0, 50, 16.6667, 20, 0, 42.8571,
0, 0, 86.9565, 22.2222, 21.0526, 50, 33.3333, 4.7619, 0,
19.2308, 0, 71.4286, 0, 50, 25, 42.8571, 40, 11.1111, 100,
14.2857, 20, 0, 20, 0, 0, 50, 40, 0, 33.3333, 38.4615, 7.3171,
30.7692, 0, 0, 0, 0, 28.5714, 22.2222, 0, 88.8889, 0, 42.1053,
0, 12.5, 75, 0, 0, 0, 100, 50, 0, 18.75, 0), perc5 = c(0.639,
70, 50, 100, 100, 0, 25.9259, 60, 64.8649, 0, 100, 23.5294,
83.3333, 75, 5.5556, 30.7692, 25, 21.4286, 0, 20, 0, 18.1818,
42.8571, 100, 100, 100, 100, 100, 1.9802, 100, 100, 15.3846,
0, 58.3333, 0, 11.1111, 0, 16.6667, 100, 0, 33.3333, 100,
50, 83.3333, 80, 0, 57.1429, 100, 31.25, 4.3478, 66.6667,
47.3684, 50, 66.6667, 28.5714, 2.3923, 23.0769, 100, 28.5714,
0, 50, 75, 42.8571, 60, 33.3333, 0, 85.7143, 66.6667, 100,
60, 100, 64.2857, 0, 60, 0.3061, 33.3333, 23.0769, 9.7561,
53.8462, 0, 100, 100, 0, 42.8571, 77.7778, 0, 11.1111, 0,
57.8947, 100, 0, 25, 100, 100, 100, 0, 50, 0, 81.25, 100)), class = "data.frame", row.names = c(NA, -100L))
When checking the distribution of the 5 variables, we can see that variable 1-3 have mostly zero values and the last variable has lots of 100% percent values:
> quantile(mydf[,1], probs = 0:5/5)
0% 20% 40% 60% 80% 100%
0.0000 0.0000 0.0000 0.0000 1.3049 100.0000
> quantile(mydf[,2], probs = 0:5/5)
0% 20% 40% 60% 80% 100%
0 0 0 0 0 100
> quantile(mydf[,3], probs = 0:5/5)
0% 20% 40% 60% 80% 100%
0.00000 0.00000 0.00000 0.00000 39.99996 100.00000
> quantile(mydf[,4], probs = 0:5/5)
0% 20% 40% 60% 80% 100%
0.00000 0.00000 0.29700 21.52044 50.00000 100.00000
> quantile(mydf[,5], probs = 0:5/5)
0% 20% 40% 60% 80% 100%
0.00000 0.57242 28.57140 60.00000 100.00000 100.00000
Now I scale my variables and use k-modes (with 10 clusters):
require(klaR)
mydf_scaled <- do.call(cbind, lapply(mydf, function (x) {
return(as.character(.bincode(x, quantile(x, probs = 0:5/5), include.lowest = T)))
}))
mymodel <- klaR::kmodes(mydf_scaled, modes = 10)
Then I get the following 10 clusters:
> mymodel$modes
perc1 perc2 perc3 perc4 perc5
1 1 1 1 3 4
2 1 1 5 1 1
3 1 1 1 5 1
4 5 1 1 5 2
5 1 1 1 1 4
6 1 1 1 4 3
7 1 1 4 3 3
8 1 1 5 3 2
9 5 1 5 3 2
10 5 1 1 1 1
The problem I am having now is that for perc1
I only get values 1
or 5
due to the mostly zero quantiles and for perc2
I only get ones as most values for that variable are zero. For perc5
I never get category 5
as the 80% quantile is already 100%.
Therefore, I do not get a good differentiation of certain variables. For perc2
I cannot get any difference although there are non-zero values which are of interest to me. Similar for perc1
, I would want a more detailed differentiation between the positive values compared to only having two values 1
and 5
(I can only say it is either a zero value or something positive, rather than getting an actual feeling on how the positive values differ in clusters).
How can I refine my clusters to give me more information about how positive values differ in the clusters without getting a completely wrong picture of my data? I do not want to delete any data.
One idea I had was to only take the quantiles of positive values within my data frame to scale my variables (and add a zero in the beginning to account for the zero values – so I would take 0 and then the 0.2, 0.4, 0.6, 0.8 and 1 quantile of the positive values):
mydf_scaled2 <- do.call(cbind, lapply(mydf, function (x) {
return(as.character(.bincode(x, c(0, quantile(x[x > 0], probs = 1:5/5)), include.lowest = T)))
}))
mymodel2 <- klaR::kmodes(mydf_scaled2, modes = 10)
Which returns the following clusters:
> mymodel2$modes
perc1 perc2 perc3 perc4 perc5
1 1 1 1 5 1
2 2 2 2 2 1
3 1 1 1 1 4
4 1 1 1 4 2
5 1 1 1 3 3
6 1 3 1 3 2
7 2 4 1 1 1
8 1 1 1 3 2
9 1 1 5 1 1
10 3 1 1 2 3
This would result in more detailed information about non-zero values within my variables. However, I am not sure whether it makes sense to use this approach and the outcome I get represents my data or whether it over-represent the non-zero values due to the different approach in calculating the quantiles.
Has someone an idea how I could tackle my problem (of not being able to differ between the positive values within my clusters) and still get clusters which well represent my data? Should I use a different approach to scale my variables? Thanks!
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP