# Which test should I use to compare 2 unrelated dichotomous variables?

Cross Validated Asked by Anna on August 10, 2020

I have two variables, both of which are categorical and binary/dichotomous, and I need to determine if there is any ‘correlation’ between them. Note that I am using this term carefully, as I know ‘correlation’ technically measures the change in one variable, which is not applicable for categorical, dichotomous variables. I simply want to know if there is a link between my two sets of data.

One variable is whether a gene is a ‘pseudogene’ or not (1 for pseudogene, and 0 for non-pseudogene), and the other is whether the gene is a ‘complement’ gene or not (1 for complement, and 0 for non-complement).

An example of the data is as follows, where each row is a single gene (imagine this but on a scale of about 500,000 rows):

pseudo    complement
0          1
0          0
1          1
0          1
0          1
1          0


Many of my extensive Google searches have told me that, for two categorical variables, the chi-square test is appropriate. I’ve tried using this but my results don’t seem to be very reliable – more research has told me that the context of my data is also not appropriate, as the test concerns comparing different populations, whilst my variables are unrelated. So chi-square is probably completely out of the ballpark.

Similarly, I see some suggestions that the phi coefficient test is designed for comparing 2 dichotomous variables – however it seems again that the context of the test is not appropriate for my data.

Which statistical test should I be using for testing any correlation/link between these two variables? (Bonus if you can tell me how to do this in R, but my main concern is just deciding which test is appropriate.)

Correct me if I'm misunderstanding, but it seems that your data can be exactly summarized in a $2 times 2$ table of counts, and your notion of "correlation" is actually independence: whether $$Pr(text{pseudo} = a, text{complement} = b) = Pr(text{pseudo} = a) Pr(text{complement} = b)$$ for all $a, b in {0, 1}$.

This is the setting of Fisher's exact test, to which what's called a chi-squared test is a pretty good approximation. With 500,000 samples, as long as your probabilities aren't incredibly skewed, they'll probably give exactly the same result, though a G-test is supposedly a better approximation.

more research has told me that the context of my data is also not appropriate, as the test concerns comparing different populations, whilst my variables are unrelated. So chi-square is probably completely out of the ballpark.

One can use the test to compare different problems, but there's nothing in the formulation of the test that requires that; it's just a property of contingency tables, no matter how you arrive at the table.

I see some suggestions that the phi coefficient test is designed for comparing 2 dichotomous variables

I'm not sure what you mean by "the phi coefficient test," but the usual phi coefficient is exactly the chi-squared test statistic divided by the sample size, so presumably it's the same as a chi-squared test.

Basically, chi-squared seems to be an appropriate test.

I've tried using this but my results don't seem to be very reliable

What are you seeing that makes you think the chi squared test isn't reliable on your data?

Answered by Dougal on August 10, 2020