Cross Validated Asked on November 2, 2021
Why in this formula $chi^2 = sum_i frac{(x_i – m_i)^2}{m_i}$, we divide by $m_i$ instead of $m_i^2$? Dividing by the squared value has a clear logic for me that we are comparing the difference against the mean, in terms of ratio (I am not saying that it is correct :), but dividing by the means themselves is not. Are we doing some kind of normalization here, like variance per unit of mean value?
There are several possible explanations. Here is one of them. It should be viewed as partly intuitive rather than entirely rigorous.
Suppose you have $K$ categories and your null hypothesis is that the number of occurrences of the $i$th category is $mathsf{Pois}(lambda_i).$ Then the count in the $i$th category is $X_i sim mathsf{Pois}(lambda_i).$
For sufficiently large counts, $X_i$ is nearly normal with $mu_i = E(X_i) = lambda_i$ and $sigma_i^2 = Var(X_i) = lambda_i.$ Standardizing, you get that $Z_i = frac{(X_i - lambda_i)}{sqrt{lambda_i}} stackrel{aprx}{sim} mathsf{Norm}(0,1).$ And then $Z_i^2 =frac{(X_i - lambda_i)^2}{lambda_i}stackrel{aprx}{sim}mathsf{Chisq}(1).$
Then you estimate the $lambda_i$ from data according to the null hypothesis. If, for example, the null hypothesis is that all $K$ categories are equally likely, then we would use $E_i = hatlambda_i = frac{sum X_i}{K} = frac{T}{K}.$ If the terms $C_i = frac{(X_i-E_i)^2}{E_i},$ were independent, then the chi-squared statistic $Q = sum_i C_i$ would be approximately distributed as $mathsf{Chisq}(K).$
However, the terms are not quite independent because the $sum E_i = sum X_i = T.$ So it turns out that $Q stackrel{aprx}{sim}mathsf{Chisq}(K-1)$ (arm-waving here) because of the one linear constraint on the $E_i$s.
Answered by BruceET on November 2, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP