Help with choosing appropriate way to test hypothesis

Question

This is undoubtedly a basic question but I suffer from being in the situation where I do not even know what to google so I can't solve this one myself. On the data below I want to test the hypothesis that species distributions between communities are different than would be expected from a random distribution of each species across the study site.

I have  a list of species counts in different communities, and I have the proportions of these communities across the study site.

Can I calculate the expected distribution for each community by multiplying the total count of each species by the proportion of that community across the study site (species1_total*studysite_c1). In my mind this is a rational way to calculate the likely distribution of each species in each community were they randomly situated across the study site.

Can I then calculate do a chi-squared test on this data where the species1_total*studysite_c1 is the expected value, and species1_c1 is the actual value?

c1   c2  c3  c4  c5  c6  c7  c8  c9  c10 total
species1    0   38  0   6   94  2   0   0   12  6   158
species2    1   7   0   0   0   0   0   0   0   1   9
species3    3   30  0   0   1   1   0   0   11  3   49
species4    7   5   1   3   11  0   0   0   1   2   30
species5    5   2   0   0   0   4   0   0   9   0   20
species6    24  78  0   0   7   2   5   0   19  242 377
species7    3   13  0   0   0   3   0   0   28  9   56
species8    0   29  0   0   4   16  0   0   2   2   53
species9    44  66  13  0   1   0   0   0   37  10  171
species10   0   20  0   0   3   4   0   0   6   0   33
species11   1   0   0   0   0   0   0   8   0   0   9
species12   0   0   0   0   0   0   0   0   5   0   5
study site  0.22 0.40 0.01 0.01 0.03 0.01 0.00 0.00 0.07 0.25 1

BruceET · Accepted Answer

I guess you are on the right track, but I am not familiar with
your data and study site, so I can't be sure. I can be sure
that your terminology is not quite right. You can't use the
numbers in your last row study site as expected counts
because they are estimated probabilities adding to $1.$

study.site = c(0.22, 0.40, 0.01, 0.01, 0.03, 0.01, 0.00, 0.00, 0.07, 0.25)
sum(study.site)
[1] 1

One-category chi-squared test in R. In the R procedure chisq.test, there is provision for a parameter p of probabilities against
which counts x are to be compared.

Thus, suppose I have a fair die with faces re-labeled so that there are
two 1's and faces 2 through 5 then the probabilities of
outcomes should be p.d = c(1/3, 1/6, 1/6, 1/6, 1/6)
and suppose I have counts x from 60 rolls of this relabeled die.
Then I should expect chisq.test not to reject the null hypothesis
the p.d has the correct probabilties. Indeed, this is what happens
below. The P-value is higher than 5%.

x = c(24,7,6,14,7)
p.d = c(2,1,1,1,1)/6
chisq.test(x, p=p.d)

Chi-squared test for given probabilities

data:  x
X-squared = 5.931, df = 4, p-value = 0.2044

Not enough data for Species 1. So if I guess correctly what you have done to get the vector
study.site, and if the counts in species. are indeed not
randomly distributed, I might expect chisq.test to reject.
However, there is a difficulty. You have only 158 specimens
in Species 1, with none at all in many communities.

sum(species.1)
[1] 158

This
means you do not have enough data for the chi-squared test
to work properly. In particular, R is finding 'expected counts'
for various communities, and too many of them are below the
minimum required (some authors say all should be above 5, others
say most should be above 5 and all should be above 3.)
The technical difficulty is that the chi-squared statistic
has only approximately a chi-squared distribution, and
a good approximation requires a certain amount of data.

species.1 = c(0, 38, 0, 6, 94, 2, 0, 0, 12, 6)
chisq.test(species.1, study.site)

Pearson's Chi-squared test

data:  species.1 and study.site
X-squared = 38.333, df = 30, p-value = 0.1414

Warning message:
In chisq.test(species.1, study.site) :
  Chi-squared approximation may be incorrect

Combine communities or species? A common remedy for such sparse data is to combine categories (communities).
If some communities are adjacent, then it might make sense to combine them.  You might also consider whether it is appropriate
to combine counts for several species, especially of some
species are similar to others.

Simulated P-value for sparse data. Another remedy, for the implementation of chisq.test in R,
is to let let the program simulate a P-value, but we still don't
get a rejection with simulation.

chisq.test(species.1, study.site, sim=T)

Pearson's Chi-squared test 
         with simulated p-value 
         (based on 2000 replicates)

data:  species.1 and study.site
X-squared = 38.333, df = NA, p-value = 0.1644

Somewhat better results with higher counts. Trying again for Species 6, which has more specimens. This time
we reject at at the 10% level, not at the 5% level.

species.6 = c(24, 78, 0, 0, 7, 2, 5, 0, 19, 242)
chisq.test(species.6, study.site, sim=T)

Pearson's Chi-squared test 
        with simulated p-value 
        (based on 2000 replicates)

data:  species.6 and study.site
X-squared = 54.444, df = NA, p-value = 0.07696

Help with choosing appropriate way to test hypothesis

One Answer

Add your own answers!

Ask a Question