Can I deal with a missing not at random column by creating a new column? (Feature engineering)

Question

Task: Binary classification
Example problem:
Let's say we have two feature columns A and B. A has no nulls and is a binary column if a user completed an action (=1), 0 if they didn't. For all users that completed the action, B is the resultant score. As a result, B has nulls for those that didn't complete the action (missing not at random as the nulls are dependent in B are due to A).
To deal with this missing not at random problem, is it possible to create a new variable that is equal to 1 if the user completes the action and achieves a certain score, 0 otherwise?
The column B is valuable but I'm trying to find a good way of dealing with the nulls.

A
B

1
94

0

1
45

Fab · Answer

Creating a new variable as you described would be redundant as it is a function of the other two variables. In other words it is not adding any information.
The below suggestion assumes the model cannot deal with missing values, but a lot of the best models (ex. xgboost is typically one of the best for classification) will deal with this in a smart way for you.
Just fill the missing values via imputation, ex. by using the mean score or training a model using all the other features to predict the scores, and then predicting the missing ones. The model will be able to take into account that some users didn't complete the action, so it won't just treat them as if they got the imputed score, since their value in column A is 0 and that is useful information.

Can I deal with a missing not at random column by creating a new column? (Feature engineering)

One Answer

Add your own answers!

Ask a Question