How to perform feature selection on dataset with categorical and numerical features?

Question

I am working on a dataset with 30 columns (29 numerical, 1 non-ordinal categorical). I hot-encoded the categorical feature and reached at 35 columns. To improve training efficiency, I want to perform feature selection on my dataset. However , I am confused with how to handle a dataset with categorical and numerical features combined.

I read that it is not reasonable to apply PCA on dummies given they are discrete. Is it reasonable to apply PCA first on numerical features then concatenate them with dummies?
I tried to implement recursive feature elemination with cross-validation (RFECV) to the entire feature space. But I don't think it is reasonable to remove some but not all dummy features given they are generated out of one category.

Any suggestions? Any help is appreciated.
python pandas scikit-learn feature-selection

10xAI · Answer

Feature selection or Feature engineering is more of an Art than just applying readily available techniques.
I will suggest you to do/learn intelligent EDA and try to eliminate/create/merge features.
- Kaggle has many kernels/discussions on this topic.
- For an enriched intuition, please read this book esp. chapter#04. Feature Engineering and Selection. Observe how the author walks through different findings in EDA.
Categorical Features Encoding -
- You have only 1 Categorical feature that also with a small cardinality and 29 Numerical Features. I will suggest eliminating Numerical Features. You can try PCA on a Subset of Features. Ref. Try it on the 29 and see the results.
-  Try other approaches for Categorical encoding. Use these links category_encoders. Read the links under reference to gain understandings. Even for OHE, you will like this Library.

it is reasonable to remove some but not all dummy features are given they are generated out of one category

Once you encode a Categorial feature, you have a new set of Features. You treat each one as an independent Feature. It is quite possible based on the analysis that just a few of them are not useful and we remove them.

PCA on One Hot Encoded data

-You will get an output but I am not very sure of the addition in predictive power. There are a few conflicting references. Ref - Reddit $hspace{1cm}$Ref - SE
- There are other techniques suggested for Categorical and Mix data. Ref-SE $hspace{1cm}$Library
Try different combinations and see.
 Last, Try the Feature Importance technique using Random Forest.Ref - MachineLearning Mastery

Deepak · Answer

It is fine to apply feature selection technique on one hot encoded variables. Because if one particular segment of that variable is correlated with your target, then it is a good news. Your model will understand the scenario better.
Or, You can label encode your categorical variable first so that you still have 30 variables (29 numerical + 1 label-encoded categorical variable). Now try to find the importance value of each variable, and take the relevant ones (Use any method for it: be it RFE, random forest feature selection, pearson's correlation etc). Once you have the final list of variables, and the label-encoded variable is also coming as relevant, it is ok to put it into the model.

How to perform feature selection on dataset with categorical and numerical features?

2 Answers

Add your own answers!

Ask a Question