TransWikia.com

What is the best solution to replace NaN values?

Data Science Asked by user90379 on May 25, 2021

I’m thinking about using the normal distribution of a specific column that has missing values and replace them by random values generated using the normal distribution function of numpy on that specific column ? Replacing by zeros or the mode doesn’t really make sense sometimes… When is it relevant to do so ?

One Answer

You are right in saying that replacing with a simple mean, mode... is a common but unreliable imputation strategy in many cases. You have in scikit learn some utilities for imputation of missing values (have a look at https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute) using for instance the knn imputer as an additional strategy.

Take into account you cannot assume your feature of interest follows a normal distribution, so instead you can actually apply a kernel density estimator to model such distribution, see here: http://scikit-learn.org/stable/modules/density.html

Answered by German C M on May 25, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP