Data Science Asked by Molitoris on May 14, 2021
I am trying to better understand the formula model for two sample t-tests in R. When I calculate the test in the formula model I get a wrong result.
set.seed(41)
df = data.frame(x1=c(rep(1, 10), rep(0, 10))+ rnorm(20, mean = 0, sd = 0.1),
x2=c(rep(0, 10), rep(1, 10)))
t.test(x1 ~ x2, data=df)
Output
Welch Two Sample t-test
data: x1 by x2
t = 22.365, df = 17.85, p-value = 1.668e-14
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.9247087 1.1165780
sample estimates:
mean in group 0 mean in group 1
1.0530115 0.0323681
If I use the variable model, I get the expected result.
t.test(x = df$x1, y = df$x2)
Output
Welch Two Sample t-test
data: df$x1 and df$x2
t = 0.2581, df = 37.945, p-value = 0.7977
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.2921655 0.3775450
sample estimates:
mean of x mean of y
0.5426898 0.5000000
```
Which result is right or wrong here depends on your objective. You have created two variables (vectors) $x_1, x_2$.
Assuming $x_1, x_2$ are two samples of i.i.d random variables $X_1$ and $X_2$, respectively. Now, with some more assumptions, you want to test the null hypothesis: $mathbb E(X_1) = mathbb E(X_2)$.
For this, your second output is the correct one. However, based on the data that you have generated, this is not applicable because each of your samples, $x_1, x_2$ are not coming from the same distribution, as the mean of your first five values is different from the last five.
Ignoring your data, this analysis can be done using the formula approach as well. Join the two vectors $x_1,x_2$ to $x$ and add another column (say, y) which identifies which data point is coming from which sample. Call this new data frame df1
. Then an equivalent way of doing the above mentioned test is t.test(x~y, data = df1)
The second approach is helpful when your data is organized in such a format. For example, say, you have data frame with two columns: height ($x$) and gender ($y$). Then running t.test(x~y, data = df1)
will test whether the mean height is different between genders.
Your first approach can be considered right only when your $x_2$ is a factor variable which identifies the group or sample of the data point in vector $x_1$.
Correct answer by Dayne on May 14, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP