Cross Validated Asked by Amnon on February 19, 2021
It is a common practice in data analysis to remove features (independent variables) with low variance for dimensionality reduction, with the justification that a feature with low variance cannot explain much of the variance in the response variable (dependent variable).
However, I don’t exactly understand this reasoning.
Here is a counter example (in R syntax):
> independent_variable <- c(100000, 100000.01, 100000.02, 100000.03, 100000.04, 100000.05 )
> dependent_variable <- c(1,2,3,4,5,6)
> cor(independent_variable , dependent_variable)
[1] 1 #pearsons correlation = 1
> var(independent_variable )
[1] 0.00035
> var(dependent_variable)
[1] 3.5 # low variance of independent variable compared to dependent variable
> var(independent_variable/mean(independent_variable))
3.499998e-14 # very low variance
> var(dependent_variable/mean(dependent_variable))
[1] 0.2857143 # variance of scaled variables with mean=1
What I try to demonstrate in this example is a case where the dependent and independent variables have correlation=1 i.e. the independent variable explains 100% of the variance of the dependent variable, and yet, both in the original and in the mean=1 scaled variables, the variance of the independent variable is much lower than the variance of other variables (in this case, the dependent variable) and therefore it would have been removed according to this reasoning.
What do I miss here?
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP