Data Science Asked by kjtheron on December 14, 2020
I have a dataset that contains 15 categorical features (2 and 3 level factors which are non-ordinal) and 3 continuous numeric features. Seeing as most machine learning algorithms require numerical data as input features, and actually automatically One-Hot encodes them on the fly (random forest, glmnet etc.), should you not perform One-Hot encoding during data pre-processing to allow exploration of the relationship of the encoded feature data? Or is it best to rather explore relationships between raw categorical data and then only encode before running algorithms?
Basically my question evolves around data exploration and data understanding, and whether this needs to be performed on the raw or encoded categorical features?
To me it depends, because I would separate some types of Categorical Variables :
These are the kind of changes you can do, the choice of OHE directly, or pre-process before, it depends on the variable...
If your question is, for example, to know if you make feature selection before OHE or after, I'd suggest you mainly making it after : Remove useless variable (with no info), then OHE/preprocess remaining ones, and then make feature selection again.
Let's take an example : a variable called Age being classes like [0;10], [10;20], ... it's often significative if the value is >80 or <20, but doesn't care if it's 35 or 45, so the OHE will only select Age_[0;10], Age_[10;20], Age[80_90] and Age_90+
Correct answer by BeamsAdept on December 14, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP