Data Science Asked on July 26, 2021
I have a dataset of 4712 records working on binary classification. Label 1 is 33% and Label 0 is 67%. I can’t drop records because my sample is already small. Because there are few columns which has around 250-350 missing records.
How do I know whether this is missing at random
, missing completely at random
or missing not at random
. For ex: 4400 patients have the readings and 330 patients don’t have the readings. But we expect these 330 to have the readings because it is a very usual measurement. So what is this called?
In addition, for my dataset it doesn’t make sense to use mean
or median
straight away to fill missing values. I have been reading about algorithms like Multiple Imputation
and Maximum Likelihood
etc.
Is there any other algorithms that is good in filling the missing values in a robust way?
Is there any python packages for this?
Can someone help me with this?
To decide which strategy is appropriate, it is important to investigate the mechanism that led to the missing values to find out whether the missing data is missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR).
Given what you have told its likely that its MCAR. (assumption is that you already tried to find this propensity yourself (domain knowledge) or build a model between the missing columns and other features and failed in doing so)
Some other techniques to impute the data, I would suggest looking at KNN imputation (from experience always solid results) but you should try different methods
fancy impute supports such kind of imputation, using the following API:
from fancyimpute import KNN
# Use 10 nearest rows which have a feature to fill in each row's missing features
X_fill_knn = KNN(k=10).fit_transform(X)
Here are different methods also supported by this package:
•SimpleFill: Replaces missing entries with the mean or median of each column.
•KNN: Nearest neighbor imputations which weights samples using the mean squared difference on features for which two rows both have observed data.
•SoftImpute: Matrix completion by iterative soft thresholding of SVD decompositions. Inspired by the softImpute package for R, which is based on Spectral Regularization Algorithms for Learning Large Incomplete Matrices by Mazumder et. al.
•IterativeSVD: Matrix completion by iterative low-rank SVD decomposition. Should be similar to SVDimpute from Missing value estimation methods for DNA microarrays by Troyanskaya et. al.
•MICE: Reimplementation of Multiple Imputation by Chained Equations.
•MatrixFactorization: Direct factorization of the incomplete matrix into low-rank U and V, with an L1 sparsity penalty on the elements of U and an L2 penalty on the elements of V. Solved by gradient descent.
•NuclearNormMinimization: Simple implementation of Exact Matrix Completion via Convex Optimization by Emmanuel Candes and Benjamin Recht using cvxpy. Too slow for large matrices.
•BiScaler: Iterative estimation of row/column means and standard deviations to get doubly normalized matrix. Not guaranteed to converge but works well in practice. Taken from Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares.
EDIT: MICE was deprecated and they moved it to sklearn under iterative imputer
Correct answer by vienna_kaggling on July 26, 2021
A trick I have seen on Kaggle.
Step 1: replace NAN with the mean or the median. The mean, if the data is normally distributed, otherwise the median.
In my case, I have NANs in Age.
Step 2: Add a new column "NAN_Age." 1 for NAN, 0 otherwise. If there's a pattern in NAN, you help the algorithm catch it. A nice bonus is that this strategy doesn't care if it's MAR or MNAR (see above).
Answered by FrancoSwiss on July 26, 2021
scikit learn itself has some good ready to use packages for imputation. details here
MICE is not available in scikit learn as far as i know. Please check statsmodel for MICE statsmodels.imputation.mice.MICEDATA
Answered by vivek on July 26, 2021
A small remark to the often suggested mean/median imputation.
Applying this method would assume that your analysis is only dependent on the first moment of your variable´s distribution.
Just imagine you would impute all values of your variable with mean/median. The mean/median probably would have very low bias. But the variance would go (close to) zero. Skewness / Kurtosis would also be biased significantly.
A way around this would be to add a random value x
to each imputation, with E(x) = 0
and E(x^2) > 0
.
Answered by BigDataScientist on July 26, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP