Represent Integer Categorical feature as both Numeric and Categorical

Cross Validated Asked by user2991421 on February 27, 2021

I’m dealing with tabular datasets where it’s really hard to tell if the integer column is Numeric or Categorical. My main consideration is the accuracy of the model that I am building (no deep learning). Thus, I’m wondering if I can treat the integer column as both Numeric (use as it is) and Categorical (do one-hot encoding or use a decision tree with set-based split). i.e. give multiple representations at the same time and let the model figure out the suitable features.

My question is: Are there any scenarios where doing this multiple representation approach makes sense or does not make sense? And if so, how does it relate to the model you are training and the bias-variance tradeoff? For instance, Logistic (high bias) vs Random Forest (high variance). Are there any established theories or best practices out there that show the advantage/disadvantage of doing this? I’m asking this question in the context of classification problems.

categorical data categorical encoding continuous data machine learning random forest

Add your own answers!

Ask a Question

Get help from others!