How to impute Missing values not the usual way?

Question

I have a dataset of 4712 records working on binary classification. Label 1 is 33% and Label 0 is 67%. I can't drop records because my sample is already small. Because there are few columns which has around 250-350 missing records.

How do I know whether this is missing at random, missing completely at random or missing not at random. For ex: 4400 patients have the readings and 330 patients don't have the readings. But we expect these 330 to have the readings because it is a very usual measurement. So what is this called?

In addition, for my dataset it doesn't make sense to use mean or median straight away to fill missing values. I have been reading about algorithms like Multiple Imputation and Maximum Likelihood etc.

Is there any other algorithms that is good in filling the missing values in a robust way?

Is there any python packages for this?

Can someone help me with this?

vienna_kaggling · Accepted Answer

To decide which strategy is appropriate, it is important to investigate the mechanism that led to the missing values to find out whether the missing data is missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR).

MCAR means that there is no relationship between the missingness of the data and any of the values.
MAR means that that there is a systematic relationship between the propensity of missing values and the observed data, but not the missing data.
MNAR means that there is a systematic relationship between the propensity of a value to be missing and its values.

Given what you have told its likely that its MCAR. (assumption is that you already tried to find this propensity yourself (domain knowledge) or build a model between the missing columns and other features and failed in doing so)

Some other techniques to impute the data, I would suggest looking at KNN imputation (from experience always solid results) but you should try different methods

fancy impute supports such kind of imputation, using the following API:

from fancyimpute import KNN

# Use 10 nearest rows which have a feature to fill in each row's missing features
X_fill_knn = KNN(k=10).fit_transform(X)

Here are different methods also supported by this package:

•SimpleFill: Replaces missing entries with the mean or median of each
  column.
  
  •KNN: Nearest neighbor imputations which weights samples using the
  mean squared difference on features for which two rows both have
  observed data.
  
  •SoftImpute: Matrix completion by iterative soft thresholding of SVD
  decompositions. Inspired by the softImpute package for R, which is
  based on Spectral Regularization Algorithms for Learning Large
  Incomplete Matrices by Mazumder et. al.
  
  •IterativeSVD: Matrix completion by iterative low-rank SVD
  decomposition. Should be similar to SVDimpute from Missing value
  estimation methods for DNA microarrays by Troyanskaya et. al.
  
  •MICE: Reimplementation of Multiple Imputation by Chained Equations.
  
  •MatrixFactorization: Direct factorization of the incomplete matrix
  into low-rank U and V, with an L1 sparsity penalty on the elements of
  U and an L2 penalty on the elements of V. Solved by gradient descent.
  
  •NuclearNormMinimization: Simple implementation of Exact Matrix
  Completion via Convex Optimization by Emmanuel Candes and Benjamin
  Recht using cvxpy. Too slow for large matrices.
  
  •BiScaler: Iterative estimation of row/column means and standard
  deviations to get doubly normalized matrix. Not guaranteed to converge
  but works well in practice. Taken from Matrix Completion and Low-Rank
  SVD via Fast Alternating Least Squares.

EDIT: MICE was deprecated and they moved it to sklearn under iterative imputer

FrancoSwiss · Answer

A trick I have seen on Kaggle.

Step 1: replace NAN with the mean or the median. The mean, if the data is normally distributed, otherwise the median.

In my case, I have NANs in Age.

Step 2: Add a new column "NAN_Age." 1 for NAN, 0 otherwise. If there's a pattern in NAN, you help the algorithm catch it. A nice bonus is that this strategy doesn't care if it's MAR or MNAR (see above).

vivek · Answer

scikit learn itself has some good ready to use packages for imputation. details here
MICE is not available in scikit learn as far as i know. Please check statsmodel for MICE
statsmodels.imputation.mice.MICEDATA

BigDataScientist · Answer

A small remark to the often suggested mean/median imputation.

Applying this method would assume that your analysis is only dependent on the first moment of your variable´s distribution.

Just imagine you would impute all values of your variable with mean/median. The mean/median probably would have very low bias. But the variance would go (close to) zero. Skewness / Kurtosis would also be biased significantly.

A way around this would be to add a random value x to each imputation, with E(x) = 0 and E(x^2) > 0.

How to impute Missing values not the usual way?

4 Answers

Add your own answers!

Ask a Question