What is the best solution to replace NaN values?

Question

I'm thinking about using the normal distribution of a specific column that has missing values and replace them by random values generated using the normal distribution function of numpy on that specific column ? Replacing by zeros or the mode doesn't really make sense sometimes... When is it relevant to do so ?

German C M · Answer

You are right in saying that replacing with a simple mean, mode... is a common but unreliable imputation strategy in many cases.
You have in scikit learn some utilities for imputation of missing values (have a look at https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute) using for instance the knn imputer as an additional strategy.
Take into account you cannot assume your feature of interest follows a normal distribution, so instead you can actually apply a kernel density estimator to model such distribution, see here: http://scikit-learn.org/stable/modules/density.html

What is the best solution to replace NaN values?

One Answer

Add your own answers!

Ask a Question