# Help with choosing appropriate way to test hypothesis

Cross Validated Asked by sleepy on March 2, 2021

This is undoubtedly a basic question but I suffer from being in the situation where I do not even know what to google so I can’t solve this one myself. On the data below I want to test the hypothesis that species distributions between communities are different than would be expected from a random distribution of each species across the study site.

I have a list of species counts in different communities, and I have the proportions of these communities across the study site.

Can I calculate the expected distribution for each community by multiplying the total count of each species by the proportion of that community across the study site (species1_total*studysite_c1). In my mind this is a rational way to calculate the likely distribution of each species in each community were they randomly situated across the study site.

Can I then calculate do a chi-squared test on this data where the species1_total*studysite_c1 is the expected value, and species1_c1 is the actual value?

           c1   c2  c3  c4  c5  c6  c7  c8  c9  c10 total
species1    0   38  0   6   94  2   0   0   12  6   158
species2    1   7   0   0   0   0   0   0   0   1   9
species3    3   30  0   0   1   1   0   0   11  3   49
species4    7   5   1   3   11  0   0   0   1   2   30
species5    5   2   0   0   0   4   0   0   9   0   20
species6    24  78  0   0   7   2   5   0   19  242 377
species7    3   13  0   0   0   3   0   0   28  9   56
species8    0   29  0   0   4   16  0   0   2   2   53
species9    44  66  13  0   1   0   0   0   37  10  171
species10   0   20  0   0   3   4   0   0   6   0   33
species11   1   0   0   0   0   0   0   8   0   0   9
species12   0   0   0   0   0   0   0   0   5   0   5
study site  0.22 0.40 0.01 0.01 0.03 0.01 0.00 0.00 0.07 0.25 1


I guess you are on the right track, but I am not familiar with your data and study site, so I can't be sure. I can be sure that your terminology is not quite right. You can't use the numbers in your last row study site as expected counts because they are estimated probabilities adding to $$1.$$

study.site = c(0.22, 0.40, 0.01, 0.01, 0.03, 0.01, 0.00, 0.00, 0.07, 0.25)
sum(study.site)
[1] 1


One-category chi-squared test in R. In the R procedure chisq.test, there is provision for a parameter p of probabilities against which counts x are to be compared.

Thus, suppose I have a fair die with faces re-labeled so that there are two 1's and faces 2 through 5 then the probabilities of outcomes should be p.d = c(1/3, 1/6, 1/6, 1/6, 1/6) and suppose I have counts x from 60 rolls of this relabeled die. Then I should expect chisq.test not to reject the null hypothesis the p.d has the correct probabilties. Indeed, this is what happens below. The P-value is higher than 5%.

x = c(24,7,6,14,7)
p.d = c(2,1,1,1,1)/6
chisq.test(x, p=p.d)

Chi-squared test for given probabilities

data:  x
X-squared = 5.931, df = 4, p-value = 0.2044


Not enough data for Species 1. So if I guess correctly what you have done to get the vector study.site, and if the counts in species. are indeed not randomly distributed, I might expect chisq.test to reject. However, there is a difficulty. You have only 158 specimens in Species 1, with none at all in many communities.

sum(species.1)
[1] 158


This means you do not have enough data for the chi-squared test to work properly. In particular, R is finding 'expected counts' for various communities, and too many of them are below the minimum required (some authors say all should be above 5, others say most should be above 5 and all should be above 3.) The technical difficulty is that the chi-squared statistic has only approximately a chi-squared distribution, and a good approximation requires a certain amount of data.

species.1 = c(0, 38, 0, 6, 94, 2, 0, 0, 12, 6)
chisq.test(species.1, study.site)

Pearson's Chi-squared test

data:  species.1 and study.site
X-squared = 38.333, df = 30, p-value = 0.1414

Warning message:
In chisq.test(species.1, study.site) :
Chi-squared approximation may be incorrect


Combine communities or species? A common remedy for such sparse data is to combine categories (communities). If some communities are adjacent, then it might make sense to combine them. You might also consider whether it is appropriate to combine counts for several species, especially of some species are similar to others.

Simulated P-value for sparse data. Another remedy, for the implementation of chisq.test in R, is to let let the program simulate a P-value, but we still don't get a rejection with simulation.

 chisq.test(species.1, study.site, sim=T)

Pearson's Chi-squared test
with simulated p-value
(based on 2000 replicates)

data:  species.1 and study.site
X-squared = 38.333, df = NA, p-value = 0.1644


Somewhat better results with higher counts. Trying again for Species 6, which has more specimens. This time we reject at at the 10% level, not at the 5% level.

species.6 = c(24, 78, 0, 0, 7, 2, 5, 0, 19, 242)
chisq.test(species.6, study.site, sim=T)

Pearson's Chi-squared test
with simulated p-value
(based on 2000 replicates)

data:  species.6 and study.site
X-squared = 54.444, df = NA, p-value = 0.07696


Correct answer by BruceET on March 2, 2021