Data Science Asked by Sathya on January 11, 2021
I am working on clustering algorithms. I am working with titanic dataset. It contains 6 categorical features. I used k-means algorithm on this dataset. I am using label encoding for categorical features. But I found that categorical features should use euclidean distance. It should use Hamming distance. So, how to make k-means work finely on mixed features? I don’t need other algorithm. I just want to work with k-means only on mixed features dataset.
You can quantify correlation, or more precisely association
, between categorical variables using something like cross-entropy. There’s an available library dython
to compute such association values. Also I am curious why do you want to do clustering ? What is your expected output?
Answered by Victor Luu on January 11, 2021
Label encoding is not a good idea if the nature of categories are not ordinal (it is actually not my favorite anyways). Use one-hot encoding and see how it works. You may apply a feature extraction on top of it, e.g. PCA, to reduce the noise coming from sparsity. The other idea is to label categories by their fraction in the feature, for example:
[a,b,b,c,a,a] --> [3/6, 2/6, 2/6, 1/6, 3/6, 3/6]
Answered by Kasra Manshaei on January 11, 2021
The best way to encode the data will be through any encoding mechanism like label encoder etc. But before handling the categorical variable check the correlation of a categorical variable with the target variable using the feature selection methods like chi square test with selectKbest.
Answered by Ubaid Usmani on January 11, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP