TransWikia.com

Hot Encode vs Binary Encoding for Binary attribute when clustering

Data Science Asked by onelostdude on September 26, 2021

I am planning to use data for a clustering problem that contains a column with a binary value BUY/SELL.

Should I be converting this attribute and assign it binary values (BUY=1, SELL=0), and keep it on the same column, thus reducing the number dimensions

OR

Hot encode the attribute (adding two columns BUY and SELL and putting 1 on the appropriate column)?

How do these two methods of nominal to numeric conversion affect the final model for popular clustering algorithms (K means, Hierarchical, etc…)

2 Answers

Not much of difference in your case. The difference is in just 1 dimension which does not affect much. The only point I can add is that if the number of BUY and SELL values are not the same, you can replace them with their frequencies i.e. if 40% BUY and 60% SELL, then replace BUY with 0.4 and SELL with 0.6

Answered by Kasra Manshaei on September 26, 2021

If one value has more priority than the other, then you can go with binary encoding. e.x) If the values are based on education level, you can assign 0 to school-level education and 1 to college-level education.

If the values do not have any arithmetical dependency, then you need to go for one-hot encoding.

In your case, hot encoding is better.

Edit: If we have only two values, either binary encoding or hot encoding will work. This edit is based on the comment from @beamsadept.

Answered by Venkat on September 26, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP