Handling categorical missing values ML

Question

I have gone through this regarding handling missing values in categorical data.
Dataset has about 6 categorical columns with missing values. This would be for a binary classification problem
I see different approaches where one is to just leave the missing values in category column as such, other to impute using from sklearn.preprocessing import Imputer, but unsure which is better option.
In case if imputing is better option, which libraries could I use before applying the model like LR,Decision Tree, RandomForest.
Thanks!

Tinu · Answer

First of all I would look how many missing values there are in the column. If there are too much (~20%, generally its difficult to say how much is too much), I would drop the column because imputing 20 % of your data (without prior expert knowledge) or even more probably does not give you meaningful information anymore.

Secondly I would look at correlations between missing values and other features. Maybe you are lucky and there is some correlation between missing values in column x and a categorical value in column y. Simply look at conditional distributions.

If you choose to impute, check the distribution of the categorical values for non-missing entries. If the distribution is heavily skewed, say there is 95% value 0 and only 5% value 1, you can use the median to impute. Again, the question is how informative this is in the end. Otherwise create an additional categorical value which simply represents a missing value.

vipin bansal · Answer

How about training a ML classification model where all the features are used as an input and label is your categorical values. In that way we can predict the missing value.

10xAI · Answer

The first question we must ask is “why are these values missing?”

Skip the feature if it's > 25%
Try to know the reason from the data source/provider. They might give a clue and you may use that e.g. One city has Power failure during data collection.
Simply create a new category for the missing and check the result. This will only work when there is an underlying reason for missing
Try Calculating/Guessing on domain knowledge the correlation with other Feature and then fill with respective values. I am making this point to avoid mean/median on full columns e.g. In below data, Mean on full column ~750 but we should fill with ~100

K Nearest Neighbour - This can do both the steps of #3 in one go. Fortunately, SciKitLearn has an Imputer. e.g. sklearn.impute.KNNImputer (Keep one categorical at a time)
Blind approach - Simply replace with Mean/Median. For categorical - most_frequent (Mode)
SimpleImputer(strategy="most_frequent")
Try a few and monitor the result to decide the best approach
A thoughtful read - 
Max Kuhn and Kjell Johnson

One point that I wanted to make is to look into the data as an event/cause-effect and try to figure out things before directly looking for hammer/guns esp. if its a real project. It's ok if it is learning stuff.

Handling categorical missing values ML

3 Answers

Add your own answers!

Ask a Question