Bioinformatics Asked by RMM on September 30, 2021
How would one determine the significance of a variable in a glm model?
If I, for example, have a dataframe like seen below, how would I determine if the origin of the sample has a significant effect on the value? (this is the number of enzymes capable of degrading the substrate f that matters)
Substrate variable value origin
cellulose M09 8 free
mannan M12 2 free
glycogen M65 2 free
chitin M87 4 free
cellulose M90 2 isolate
manan M78 1 isolate
glycogen M21 4 isolate
chitin M21 1 isolate
So far I have tried:
mcomp = glm.nb(value ~ origin, data = my_data)
summary(mcomp)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.9625 -0.9047 -0.9047 0.1212 3.5232
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.01657 0.06571 -0.252 0.80097
originisolate -0.21911 0.08180 -2.679 0.00739 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(0.3418) family taken to be 1)
Null deviance: 2053.5 on 2679 degrees of freedom
Residual deviance: 2046.3 on 2678 degrees of freedom
AIC: 6517.5
Number of Fisher Scoring iterations: 1
Theta: 0.3418
Std. Err.: 0.0186
2 x log-likelihood: -6511.4590
So free becomes the intercept and then isolate if significantly different from that. Does this mean Origin has a significant effect on the value?
Would the better approach be to do the following?:
mcomp = glm.nb(value ~ origin + Substrate, data = comb_data)
summary(aov(mcomp))
Df Sum Sq Mean Sq F value Pr(>F)
origin 1 23 22.55 6.612 0.0102 *
Substrate 44 1445 32.84 9.631 <2e-16 ***
Residuals 2634 8981 3.41
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
This shows me that origin and substrate have an effect on value if I understand correctly?
There is no better method, it's a matter of what you want to test or what is your question.
Using the anova()
or aov()
, test the terms collectively. For example, in your example with Substrate, the null hypothesis is that the coefficients are all zero, meaning cellulose =0, mannan =0 , ....
If the question is, "do the isolate samples have a higher value than origin samples?", then you can use your first model, where free
is set as the reference and you test whether the effect of isolate
is non-zero. Likewise you can do this for substrate and set of them as your reference. You can also do other pairwise comparisons using this model.
If the question is, "does origin have a significant effect on value, after controlling for substrate?", then you can use your second model.
Answered by StupidWolf on September 30, 2021
Second viewing of the question from what I can see -0.22 as a coefficient of origin is a strong negative association, so yeah it has a major impact. Its not how I would have done it, but that looks to be the result.
First viewing,
I'm going to throw my hat in here. We don't know what 'origin' is about, anyway just throw everything, i.e. each substrate and the origin into the same regression calculation. Check for a low-residual and preferably do a Q-Q plot, transform your data it this doesn't look good.
The key and the thing you are missing is your regression weights, without that I couldn't say very much. If the regression weight is near zero for 'origin' then it has zero impact. If the regression weight of 'origin' is positively greater than everything else ... I assume there are skewed distributions of 'substrates' between the 'origins'. If the regression weight of 'origin' is negative but still greater than all other regression weights then it is adversely affecting the 'value' you are seeking.
I don't know the experiment, the biological system or really the 'substrate' assays, so I can't comment any further.
The two issues I have are:
Answered by M__ on September 30, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP