TransWikia.com

Pearson correlation coefficient - is correlation estimator acceptable?

Data Science Asked by I.D.M on March 5, 2021

As far as I know when it comes to theory, we use Pearson correlation when we want to check the correlation between two variables, which are both continuous or discrete. For a mixed case it’s not so easy to use it to compute correlation coefficient. On the other hand, we have Pearson correlation estimators, where we can calculate mixed case without any problems (based on samples). Does the Pearson correlation coefficient give deceptive results in this case ?

One Answer

If the discrete variable has a lot of discrete values then it is almost the same as a continuous variable, because continuous variables are technically discrete due to the way how numbers are represented in computers (float64 for Python).

The worst case is binary, but, in my experience, Pearson coefficient work well with binary and continuous data together. I know that asymptotic distributions could lead to biased estimators of linear regression. But Pearson coefficient is the way to calculate, it is not an estimation of something, so I cannot say if it is biased or non-biased. I know that if the relationship is linear then you get a strong Pearson coefficient, and if not then you get a small number.

For continuous data and binary data it means that you need to have a strong threshold. If everything below the threshold means 0 and everything above means 1 for the binary variable, then the correlation is strong. Though if I remember correctly it will still not be 1, because you cannot explain 100% of variance of a continuous variable with one binary variable. But you probably can get a number about 80%. This all depends on where the threshold is, and how the variables are distributed. If you have two distinct clouds of continuous variable that are separated by a big distance, then you will get a number close to 100%. Similar logic could be used for discrete variables that have several values.

Pearson coefficient gives a stronger weight to where the most points are, due to the way it is calculated. So if you do not have points in particular regions then there is no way for the coefficient to reflect the correlation there. In practice it means that you usually have points in a limited range, and you calculate the linear correlation in this range.

Answered by keiv.fly on March 5, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP