When to One-Hot encode categorical data when following Crisp-DM

Question

I have a dataset that contains 15 categorical features (2 and 3 level factors which are non-ordinal) and 3 continuous numeric features. Seeing as most machine learning algorithms require numerical data as input features, and actually automatically One-Hot encodes them on the fly (random forest, glmnet etc.), should you not perform One-Hot encoding during data pre-processing to allow exploration of the relationship of the encoded feature data? Or is it best to rather explore relationships between raw categorical data and then only encode before running algorithms?
Basically my question evolves around data exploration and data understanding, and whether this needs to be performed on the raw or encoded categorical features?

BeamsAdept · Accepted Answer

To me it depends, because I would separate some types of Categorical Variables :

Categorical variables with few classes : OneHot as fast as you can
Categorical variable with some highly-represented classes and some low-represented classes : You can pre-process and regroup both low-represented classes in a huge "Other" class, and then OneHot and get a reasonable number of variables
Categorical variables with A LOT of low-represented class : If you OneHot directly, you'll create a lot of variables, so this feels impossible. You can, for example, browse those data so you calculate, for each class, the rate of "1" classes on your X_train. You then transform your class by this number, which is continuous, between 0 and 1, and so have information and is accepted by all models. This is called Target Encoding, and some packages built to be compatible with sklearn exist to do it automatically (like TargetEncoder, LeaveOneOut, WeightOfEvidence or JamesStein).

These are the kind of changes you can do, the choice of OHE directly, or pre-process before, it depends on the variable...
If your question is, for example, to know if you make feature selection before OHE or after, I'd suggest you mainly making it after : Remove useless variable (with no info), then OHE/preprocess remaining ones, and then make feature selection again.
Let's take an example : a variable called Age being classes like [0;10], [10;20], ... it's often significative if the value is >80 or <20, but doesn't care if it's 35 or 45, so the OHE will only select Age_[0;10], Age_[10;20], Age[80_90] and Age_90+

When to One-Hot encode categorical data when following Crisp-DM

One Answer

Add your own answers!

Ask a Question