Will one hot encoding / unbalanced columns cause bias to Clustering Analysis?

Question

I'm wondering if having too many columns about one certain feature is gonna cause bias to the clustering analysis.

For example, if my dataset has columns = ['incoming calls', 'outgoing calls', 'missing calls', 'age'], and if I run clustering algorithms such as K-means or Mixture Model, will the clustering results be biased since it splits datasets mainly based on calls?

Another example is if I have two categorical columns: color ('red','blue','green'), and shape ('circle','square'), after one hot encoding, color will expand into three columns and shape will expand into two. If I cluster on the one-hot encoded dataset, will color have a larger weight than shape in terms of splitting the data?

Pushkaraj Joshi · Answer

To answer your question we need to understand what the aim of the clustering analysis that you are doing. Some of goal's of clustering analysis are:

Outlier Detection,
Pattern Detection,
Grouping Data together, etc

Now depending on the type of data, we can choose the algorithm that best fits the data at hand. If you have only numerical features, then you can go for KMeans. If there is a only categorical features we can choose K-Prototypes algorithm and if there is a mix of both categorical and numerical, we can use the K-Medoids algorithm.

To answer your question specifically, we need to choose an appropriate clustering algorithm while using one-hot encoded data to avoid the issues that you are worried about. Since one-hot encoding is intended for categorical features, we have to go with an algorithm that is specifically designed for that. For example K-Prototypes or K-Medoids.

I would recommend you to understand the exact business problem that you are trying to solve. Correlate that to the data at hand and then use an appropriate mentioned algorithm to solve the problem that you are facing.

Regards,

Nicholas James Bailey · Answer

With purely one-hot encoded data this isn’t a problem. For example, the distance between a red square and a blue square in your second example (assuming you’re using Euclidean distance) is 1 in the red dimension and 1 in the blue dimension, so sqrt(1+1) overall (by pythagoras). Similarly, the distance between a red square and a red circle is 1 in the circle dimension and 1 in the square dimension. However, things are messier if you have a mixture of one-hot and continuous features. In these cases you might get interesting results by making your features continuous rather than binary (e.g. describe shape based on number of vertices and colour by rgb colour space, although I know that’s a made-up example). However you decide to engineer your features, you can reduce the risk of any particular feature dominating the clustering by scaling your features appropriately and by using dimensionality reduction to avoid accidental unbalanced feature weighting through colinear features you hadn’t noticed.

Answered by Nicholas James Bailey on September 29, 2021

Will one hot encoding / unbalanced columns cause bias to Clustering Analysis?

2 Answers

Add your own answers!

Ask a Question