How to handle categorical features in K-means?

Question

I am working on clustering algorithms. I am working with titanic dataset. It contains 6 categorical features. I used k-means algorithm on this dataset. I am using label encoding for categorical features. But I found that categorical features should use euclidean distance. It should use Hamming distance. So, how to make k-means work finely on mixed features? I don't need other algorithm. I just want to work with k-means only on mixed features dataset.

Victor Luu · Answer

You can quantify correlation, or more precisely association, between categorical variables using something like cross-entropy. There’s an available library dython to compute such association values. Also I am curious why do you want to do clustering ? What is your expected output?

Answered by Victor Luu on January 11, 2021

Kasra Manshaei · Answer

Label encoding is not a good idea if the nature of categories are not ordinal (it is actually not my favorite anyways). Use one-hot encoding and see how it works. You may apply a feature extraction on top of it, e.g. PCA, to reduce the noise coming from sparsity. The other idea is to label categories by their fraction in the feature, for example:
[a,b,b,c,a,a] --> [3/6, 2/6, 2/6, 1/6, 3/6, 3/6]

Ubaid Usmani · Answer

The best way to encode the data will be through any encoding mechanism like label encoder etc. But before handling the categorical variable check the correlation of a categorical variable with the target variable using the feature selection methods like chi square test with selectKbest.

Answered by Ubaid Usmani on January 11, 2021

How to handle categorical features in K-means?

3 Answers

Add your own answers!

Ask a Question