How to treat the undefined values which make sense?

Question

I'm currently trying to create a few features to improve the performances of a model. One of those features that I would like to create corresponds to the difference in days between a customer's purcharse and his last one. To create this feature is not a problem. However, I don't know which value to set if this is the first purcharse of a customer. Which value should I set and, more generally, how to treat these cases ?

customer_id date_purchase  diff_last_purchase  first_purchase
0            1    2018.02.12                 NaN               1
1            1    2018.02.18                 6.0               0
2            2    2018.02.25                 NaN               1
3            3    2018.03.15                 NaN               1
4            3    2018.03.18                 3.0               0

ludan · Answer

Recently had a discussion on the same topic at work. It boiled down to encoding missing values as impossible values (negatives, very high) or as information inferred from the dataset (mean, median). Some more sophisticated methods use models build on rest of data (non-missing columns) to predict missing ones.

If using tree-based approach setting it to -1 should be fine as a start, as as there might be observations where previous purchase was in same day (0).

With mean-trending models (linear regression), setting it to the mean might be also fine, but you need to calculate the mean on the train set only and propagate it to the test set separately.

TQA · Answer

In general, if we don't know the reason of missing data, it's hard to treat them properly. This reason can highly affect our conclusion. So my first recommendation is always trying to figure out why the data is missing in the first place.

Normally, there are 3 kinds of missing data: (Definitions on Wikipedia)

Missing completely at random

Values in a data set are missing completely at random (MCAR) if the events that lead to any particular data-item being missing are independent both of observable variables and of unobservable parameters of interest, and occur entirely at random. When data are MCAR, the analysis performed on the data is unbiased; however, data are rarely MCAR.

Missing at random

Missing at random (MAR) occurs when the missingness is not random, but where missingness can be fully accounted for by variables where there is complete information. Since MAR is an assumption that is impossible to verify statistically, we must rely on its substantive reasonableness.

Missing not at random

Missing not at random (MNAR) (also known as nonignorable nonresponse) is data that is neither MAR nor MCAR (i.e. the value of the variable that's missing is related to the reason it's missing).

Also from that Wikipedia page:

Missing data reduces the representativeness of the sample and can therefore distort inferences about the population. Generally speaking, there are three main approaches to handle missing data: (1) Imputation—where values are filled in the place of missing data, (2) omission—where samples with invalid data are discarded from further analysis and (3) analysis—by directly applying methods unaffected by the missing values.

"In the case of MCAR, the missingness of data is unrelated to any study variable", so you can just drop them or do some reasonable imputation and an continue to analyze the data. I know only one test to test MCAR, which is Little's Test.

I don't know how to deal with the second and third cases, so all I can say is please treat them with care. If the ratio of missing data is too large, I simply drop the whole dataframe.

How to treat the undefined values which make sense?

2 Answers

Add your own answers!

Ask a Question