Data Science Asked by Darshan Jain on March 22, 2021
Why the median value is used for NaN? Why not something else like mean? What is the logic behind using the median value?
The process you described is known as imputation. Whether it makes sense to impute missing values with mean or median depends entirely on the dataset and the context of your problem.
Usually, it does not hurt to impute missing values with the mean. However, if there are outliers in the dataset that adversely impact the mean, then it is probably a good idea to impute with the median, as the median is a metric that is not influenced by the presence of outliers in the dataset.
Answered by Darshan Jain on March 22, 2021
There is no rule that only mean or median should be used. Based on the situation sometimes mean is better and sometimes median. In fact there are occasions when mode would be better.
These are not the only techniques to fill NaNs. There are several other imputation methods. If you are starting out, then an excellent hands-on training material would be the Titanic data set which contains a number of NaNs in the 'Age' feature. You can try your hands at finding out which is the best way to impute missing data there. You can refer to: https://www.kaggle.com/c/titanic/discussion/157929 - Missing Ages on the Titanic - Few perspectives from basic to the advanced for some of the advanced strategies (specific to the Titanic scenario)
Answered by Allohvk on March 22, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP