Cross Validated Asked by kellyyang on December 18, 2021
I have a Box-Cox regression where the explanatory variables are almost all dummy variables. If I want to see if there is multicollinearity among them, what would be an appropriate test? Do variance inflation factor (VIF) tests work here? What about the Pearson correlation coefficient matrix?
The VIF is probably the best way to go here. The Pearson correlation will give you a lousy measure here because it behaves somewhat weirdly for categorical variables like this. Another possibility is to use a matrix of a different measure like cosine similarity: $sum x_i*x_j / sqrt{sum x_i^2 * sum x_j^2}$. I think that is equivalent to Spearman's Rho or Kendall's Tau but am not sure off the top of my head.
I'd stick to the VIF though because it will tell you for each variable whether the other variables combined are highly colinear. But if you want a visual diagnostic of which pairwise variables are similar, those other metrics are better than Pearson for categorical data.
----EDIT---
Sure. This has to do primarily with the fact that Pearson's correlation can swing up or down or go negative very easily. Here's an example:
> cor(c(0,1,1,1,0,1,0,1,0),c(1,1,0,1,1,0,1,1,0))
[1] -0.1581139
> cor(c(0,1,1,1,0,1,0,1,0),c(0,1,0,1,1,0,1,1,0))
[1] 0.1
Here, by changing just one of the entries to zero we have swung the correlation from positive to negative. But the VIF uses $1/(1-R_{i}^2)$ where the $R_{i}^2$ is for the regression of the other variables on the one in question. I would have to work it out but I think that is basically a linear combination of something similar to the cosine measure I posted above, or a transform of it. Essentially though, it can't go negative.
I don't know any literature on it off the top of my head, but I will think about it.
Answered by Mike Nute on December 18, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP