TransWikia.com

Continuous VS Categorical variable

Data Science Asked by Shipra Sharan on December 8, 2020

In the dataset I have a continuous variable AGE and categorical variable AGE_CATEGORY as well. They are both highly correlated.

Which method should I use to identify the feature to be dropped AGE OR AGE_CATEGORY?

3 Answers

If your goal is to use them to train a supervised machine learning, the best solution is to find out which one is more efficient in predicting your output.

AGE has more information than AGE_CATEGORY. So, If I were to remove one of them I would remove AGE_CATEGORY.

In Addition, if your goal is to train tree-based models AGE_CATEGORY is not gonna be that much efficient.

You can use A/B test to find out which feature is more efficient in predicting your output.

Answered by nimar on December 8, 2020

It depends on the task at hand, as well as the type of modeling you are doing.

If the relationship between the response and predictors are non-linear, and the type of modelling used can not capture that non-linearity, converting continuous variables to categorical ones can be useful.

If you are going to predict how much people travel or earn, then having categories makes more sense for ages if the type of modelling used is something like linear regression. Young and old people may not earn as much as those who are in between.

If the type of modelling used is something like a tree-based model, then having the variable as continuous could be more useful as it has more information and the modelling can handle the non-linearity.

You can pick which one is better by using cross-validation only using the training data.

Answered by Suren on December 8, 2020

I see you mentioned in nimar's answer that You want a statistical method to identify which of Age and Age_category is better. I assume that the "better" here means a stronger relation with the dependent variable/response/target. The good news is that various methods exist to quantify the bond between target and feature. However, all these methods use numerical values to measure the strength of the bound. Because different methods are calculated under different magnitudes, they are not comparable. Age and Age_category have different data types. They can't be measured under the same measurement. Value 1 obtained from a method that measures Age is not comparable to value 1 obtained from a method that measures Age_category.

Answered by Tbone on December 8, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP