TransWikia.com

What to do if one out of 2 one-hot encoding variables have a very high p-value?

Data Science Asked by Trina Ghosh on April 18, 2021

I ran an OLS model on a dataset with 2 categorical variables. One of them was gender. The other one had 3 different categories. I used one-hot encoding for it during pre-processing before running my model.

Variables in the image: Embarked_C and Embarked_Q.
The results showed a p-value for Embarked_Q as 0.785. In this case, should I remove both Embarked_Q and C or just Q?

Regression Results

2 Answers

By all means keep Embarked_C. Consider the following example: You're predicting whether or not someone's favorite color is blue. You know the color of their favorite shirt- it's either blue, yellow, or red. Color_Blue is going to be significant, the other two one hot encoded variables would not be. You'd still want to keep Color_Blue as a feature.

Answered by Brandon Schabell on April 18, 2021

You should keep all of levels as they collectively describe the feature. Removing the insignificant ones will bias your coefficients and distort your interpretation (e.i. change the reference level).

Here are some stats.exchange references:

Answered by nwaldo on April 18, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP