Dealing with missing data

Question

I have a question about data cleaning. I am a novice and have just started learning in this field so please pardon my ignorance. Suppose there are two columns and based on some samples taken from both the columns you find the correlation coefficient to be high. Now for the values that aren't there, can you use linear regression to predict or find them out, by using the values you know as training data?

Donald S · Accepted Answer

Hi Soumyadeep and welcome to Data Science/Stack Exchange
What you are describing is called regression imputation, and it is a valid method to use on missing data. However, if the data is sparse (lots of missing values), this issue will be more difficult to handle.
In general, missing data can be handled in several ways (row deletion, imputation, substitution, etc). Regression imputation can be used if you have little or no knowledge about the data, but usually it is better to use another method. If you have some domain knowledge about the missing values, like you have an idea what the value should be, usually you can use that knowledge to fill in the missing values. Try some different methods and see which one works best.

A person pointed out that I should check for multicollinearity if the
features are both independent. Does it basically mean that one feature
is falling in the span of the other feature?

Definition of multicollinearity:
There exist one or more exact linear relationships among some of the variables

References:
https://en.wikipedia.org/wiki/Multicollinearity
https://stats.stackexchange.com/questions/234870/is-multicollinearity-the-issue-here

Dealing with missing data

One Answer

Add your own answers!

Ask a Question