TransWikia.com

Handling categorical missing values ML

Data Science Asked by omdurg on December 29, 2020

I have gone through this regarding handling missing values in categorical data.

Dataset has about 6 categorical columns with missing values. This would be for a binary classification problem

I see different approaches where one is to just leave the missing values in category column as such, other to impute using from sklearn.preprocessing import Imputer, but unsure which is better option.

In case if imputing is better option, which libraries could I use before applying the model like LR,Decision Tree, RandomForest.

Thanks!

3 Answers

First of all I would look how many missing values there are in the column. If there are too much (~20%, generally its difficult to say how much is too much), I would drop the column because imputing 20 % of your data (without prior expert knowledge) or even more probably does not give you meaningful information anymore.

Secondly I would look at correlations between missing values and other features. Maybe you are lucky and there is some correlation between missing values in column x and a categorical value in column y. Simply look at conditional distributions.

If you choose to impute, check the distribution of the categorical values for non-missing entries. If the distribution is heavily skewed, say there is 95% value 0 and only 5% value 1, you can use the median to impute. Again, the question is how informative this is in the end. Otherwise create an additional categorical value which simply represents a missing value.

Answered by Tinu on December 29, 2020

How about training a ML classification model where all the features are used as an input and label is your categorical values. In that way we can predict the missing value.

Answered by vipin bansal on December 29, 2020

The first question we must ask is “why are these values missing?”

  1. Skip the feature if it's > 25%

  2. Try to know the reason from the data source/provider. They might give a clue and you may use that e.g. One city has Power failure during data collection.

  3. Simply create a new category for the missing and check the result. This will only work when there is an underlying reason for missing

  4. Try Calculating/Guessing on domain knowledge the correlation with other Feature and then fill with respective values. I am making this point to avoid mean/median on full columns e.g. In below data, Mean on full column ~750 but we should fill with ~100

                                            enter image description here

  1. K Nearest Neighbour - This can do both the steps of #3 in one go. Fortunately, SciKitLearn has an Imputer. e.g. sklearn.impute.KNNImputer (Keep one categorical at a time)

  2. Blind approach - Simply replace with Mean/Median. For categorical - most_frequent (Mode) SimpleImputer(strategy="most_frequent")

  3. Try a few and monitor the result to decide the best approach

  4. A thoughtful read -
    Max Kuhn and Kjell Johnson

    One point that I wanted to make is to look into the data as an event/cause-effect and try to figure out things before directly looking for hammer/guns esp. if its a real project. It's ok if it is learning stuff.

Answered by 10xAI on December 29, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP