Data Science Asked on December 4, 2021
I have a question about data cleaning. I am a novice and have just started learning in this field so please pardon my ignorance. Suppose there are two columns and based on some samples taken from both the columns you find the correlation coefficient to be high. Now for the values that aren’t there, can you use linear regression to predict or find them out, by using the values you know as training data?
Hi Soumyadeep and welcome to Data Science/Stack Exchange
What you are describing is called regression imputation, and it is a valid method to use on missing data. However, if the data is sparse (lots of missing values), this issue will be more difficult to handle.
In general, missing data can be handled in several ways (row deletion, imputation, substitution, etc). Regression imputation can be used if you have little or no knowledge about the data, but usually it is better to use another method. If you have some domain knowledge about the missing values, like you have an idea what the value should be, usually you can use that knowledge to fill in the missing values. Try some different methods and see which one works best.
A person pointed out that I should check for multicollinearity if the features are both independent. Does it basically mean that one feature is falling in the span of the other feature?
Definition of multicollinearity: There exist one or more exact linear relationships among some of the variables
References: https://en.wikipedia.org/wiki/Multicollinearity https://stats.stackexchange.com/questions/234870/is-multicollinearity-the-issue-here
Answered by Donald S on December 4, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP