How to Make Meaningful Conclusions here?

Question

I recently appeared for an Interview for my college and I was asked the following question. The Interviewer said that this question was a Data Science question.
The question-
Suppose 7.5% of the population has a certain Bone Disease. During COVID pandemic you go to a hospital and see the records. 25% of the COVID Infected patients also had the Bone Disease. Can we say for sure if the Bone Disease is a symptom of COVID-19?
My Reponse-
I said No, and explained it as it's not necessary that COVID-19 is causing these symptoms, it could very well be possible that the 7.5% of the country's population which already had the disease is more susceptible to the virus due to lowered immunity. Hence making conclusions is not possible.
Then the interviewer asked me How can we be sure if it is a symptom or not?
I replied saying we can go to more Hospitals, collect more data and see if it correlates everywhere.
The Interviewer then said If we have the same results everywhere will you conclude it's a symptom?
I had no good answer but I replied that Just correlation of data is not sufficient, we also need to check if the people who have COVID-19 had the bone disease prior to getting infected or not. See if that percentage also correlates and stuff.
Here he stopped questioning however I couldn't judge If I was right or wrong.
I am in Grade-12 so I have no experience in Data Science as such. I do know a fair bit of statistics however I have never solved such questions. Can someone provide me insights on how to solve such questions and make meaningful conclusions?

Benji Albert · Accepted Answer

It is very difficult (arguably impossible, if you want to get philosophical about it) to be absolutely, 100%, for sure about anything. For this reason, we talk in terms of probability/significance/confidence sets. A refresher on statistical hypothesis testing might help.
So to answer this type of question, people would usually try to attain a well-agreed-upon p-value for their problem, below which we can reject the null hypothesis, and above which we accept the null hypothesis. The null hypothesis in this case is that the bone disease is not a symptom of COVID, and the alternative would be that it is a symptom.

Edit for demonstration as requested in the comments:
Firstly, these methods are purely for association analysis, not for proving whether bone disease is a symptom of COVID—again, correlation $ne$ causation!
Given that we are dealing with binary variables, you could use the Phi coefficient to measure the association of bone disease with COVID.
consider this contingency matrix:
|----------|---------|---------|-----------|
|          | Bone =0 | Bone =1 | total     |
| COVID =0 |    A    |  B      | I=A+B     |
| COVID =1 |    C    |  D      | J=C+D     |
|------------------------------|-----------|
|  total   |  K=A+C  |  L=B+D  | E=I+J+K+L |
|----------|---------|---------|-----------|

Which we can represent visually via a Venn diagram:

then you can calculate
$ phi=frac{AD-BC}{sqrt{IJKL}}=frac{ED-IK}{sqrt{IK(E-I)(E-K)}} $
This is related to the Chi-squared test: $ phi= sqrt{frac{chi^2}{n}} $ So you can easily retrieve the p-value given that you know the degrees of freedom (in this case, it is just 1).
And you interpret it similar to the Pearson correlation coefficient (both from the same statistician—Pearson).

So given $7.5%$ of the total has the bone disease, and $25%$ of COVID patients have it, we can construct our contingency table in terms of $N$ (the number of samples). Where $Q$ is the percent of people with COVID:
$ I=N(1-Q)= $ number of people without COVID
$ J=NQ= $ number of people with COVID
$ K=N(1-0.075)=$ number of people without bone disease
$ L=N(0.075)=$ number of people with bone disease
We know that 25% of people with COVID also have the bone disease, so $D=J(0.25)Rightarrow$
$phi=frac{E(0.25cdot J)-IK}{sqrt{IK(E-I)(E-K)}}$
Finally, we can calculate:
$phi=frac{0.25cdot EQ-N(1-Q)(1-0.075)}{sqrt{(1-Q)(1-0.075)(E-N(1-Q))(E-N(1-0.075))}}$
From here, we can find the associated p-value easily by looking it up in a Chi-Square p-value table, such as this one: http://chisquaretable.net/. Then you can accept/reject the null hypothesis given your predefined $alpha$ threshold.

Nikos M. · Answer

This is a valid question for hypothesis testing. However we only have partial data needed for the hypethesis test, so we can do one of two things:

Make a rough plausible estimate of the missing data (see Fermi problem)
Treat the missing data as parameters of the problem so we provide a parametrised answer.

The null hypothesis is that Bone disease and COVID19 are independent.
The alternative hypothesis is that they are not independent.
First approach

Percent of USA COVID-19 patients over total population can be estimated to be around $9%$.
Percent of patients with Bone disease is given to be $7.5%$.
Percent of patients having both COVID19 and Bone disease is given to be $25%$.

We can assume, for simplicity and maximal unbiasedness, that the joint probability is a uniform random variable on $[0,1]$ interval.
If the null hypothesis is valid, then COVID19 and Bone disease are independent and the joint probability of a patient having both is the product of probability having one and the other. In other words:
$$P(COVID & BONE) = P(COVID) cdot P(BONE) = 9% cdot 7.5% = 0.675%$$
So by chance alone a patient has only $0.675%$ chance to have both diseases.
Given percent of patients with both diseases is estimated as $25% cdot 9% = 2.25%$
Since we get $2.25%$ we can reject the null hypothesis and conclude (with some significance) that COVID19 and Bone disease are not independent.
Second approach
Same as first approach but now take COVID19 percent as parameter and make parametrised conclusion.
Finally an even faster approach which goes directly to the point.
If the null hypothesis holds then the $7.5%$ of patients with Bone disease over the population should remain same over the population of COVID19 patients (it is simply a subset of population and assuming plausible unbiased sampling, the same percent would prevail). In other words, we have no reason to believe the sample of COVID19 patients is not a representative sample of the population (this is a reflection of the fact that anybody can catch the virus the same, which is corroborrated by both virology and historical/statistical data on epidemics). So the percent of COVID19 patients having Bone disease should still be $7.5%$. Since we get $25%$ (it is with some significance that) we can reject the null hypothesis and the diseases are not independent.
Note the above of course do not constitute a complete statistical study, but only provided as example outline of a systematic process to follow (that can mark the beginning of a complete statistical study).

How to Make Meaningful Conclusions here?

2 Answers

Add your own answers!

Ask a Question