Cross Validated Asked on December 18, 2021
Suppose I have two samples $S_1,S_2$ of categorical data, and I’d like to design a hypothesis test to check the null hypothesis that they are both iid samples from the same underlying distribution. Normally, we would select an appropriate test statistic (e.g., the total variation distance between the two samples) and then perform a hypothesis test. This requires a choice of test statistic that can occasionally feel a little bit arbitrary.
Is it reasonable to use the likelihood (under the null) as our test statistic? Here by the "likelihood" I mean the probability that partioning $S_1 cup S_2$ into two samples of the appropriate size will yield the split $S_1,S_2$, and I’d imagine we’d use this likelihood in a permutation test.
Let me justify and formalize this a bit more. Assume $S_1$ contains $n_1$ data points and $S_2$ contains $n_2$ data points. Under the null hypothesis, we can think of the random process as generating a larger sample $X$ of $n_1+n_2$ data points drawn iid from the underlying distribution, then partitioning $X$ uniformly at random into $X_1$ (of size $n_1$) and $X_2$ (of size $n_2$) (this is equivalent to first drawing $X_1$, then drawing $X_2$, iid from the underlying distribution). Thus we have
$$Prnolimits_{mathcal{H}_0}[X_1,X_2]
= Prnolimits_{mathcal{H}_0}[X_1,X_2 mid X=X_1 cup X_2] times Prnolimits_{mathcal{H}_0}[X=X_1 cup X_2].$$
The null hypothesis $mathcal{H}_0$ does not give us enough information to compute $Pr_{mathcal{H}_0}[X]$, but it does give us enough information to compute $Pr_{mathcal{H}_0}[X_1,X_2 mid X]$ — this is just a multinomial distribution, since $X_1,X_2$ are obtained by a uniformly random partition of $X$. Fortunately, in some sense we don’t care about $Pr_{mathcal{H}_0}[X]$, because that is constant for all possible partitions of the data.
So, I propose that we define the likelihood of $X_1,X_2$ to be
$$ell(X_1,X_2) = Prnolimits_{mathcal{H}_0}[X_1,X_2 mid X=X_1cup X_2],$$
i.e., the probability that partitioning $X_1 cup X_2$ into two samples of size $n_1,n_2$ will yield $X_1,X_2$. This can be computed for any pair of samples, using a computer algorithm.
Now, I propose that we do a hypothesis test using $ell(X_1,X_2)$ as our test statistic. Thus, we estimate the distribution of $ell(X_1,X_2)$ when $X_1,X_2$ are random partitions of $S_1 cup S_2$, and then we check the tail probability (the probability that the resulting likelihood is smaller than $ell(S_1,S_2)$). That gives us our $p$-value. We can estimate the distribution of $ell(X_1,X_2)$ using a bootstrap, where in each iteration of the bootstrap we choose uniformly at random a partition of $S_1 cup S_2$ into $X_1,X_2$, then compute its likelihood.
Question: Is this a reasonable approach? Does it yield an approximately "optimal"/"most sensitive" test statistic for testing whether $S_1,S_2$ came from the same distribution? Is there some flaw in my reasoning? Is there any reason to prefer total variation distance instead of likelihood?
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP