In feature selection, I came across a situation where NaN were filled by median of the column values

Question

Why the median value is used for NaN? Why not something else like mean? What is the logic behind using the median value?

Darshan Jain · Answer

The process you described is known as imputation. Whether it makes sense to impute missing values with mean or median depends entirely on the dataset and the context of your problem.

Usually, it does not hurt to impute missing values with the mean. However, if there are outliers in the dataset that adversely impact the mean, then it is probably a good idea to impute with the median, as the median is a metric that is not influenced by the presence of outliers in the dataset.

Allohvk · Answer

There is no rule that only mean or median should be used. Based on the situation sometimes mean is better and sometimes median. In fact there are occasions when mode would be better.
These are not the only techniques to fill NaNs. There are several other imputation methods. If you are starting out, then an excellent hands-on training material would be the Titanic data set which contains a number of NaNs in the 'Age' feature. You can try your hands at finding out which is the best way to impute missing data there. You can refer to: https://www.kaggle.com/c/titanic/discussion/157929 - Missing Ages on the Titanic - Few perspectives from basic to the advanced for some of the advanced strategies (specific to the Titanic scenario)

In feature selection, I came across a situation where NaN were filled by median of the column values

2 Answers

Add your own answers!

Ask a Question