Biology Asked by DaveRowan on February 1, 2021
The genetic relationship matrix (GRM) can estimate the genetic relationship between two individuals ($j$ and $k$) over $m$ SNPs and $i$ representing a specific SNP. What I don’t understand from their equation is why we divide our summation by $m$ (the number of SNPs). $x_{ij}$ is the number of copies of the minor allele for the $j$-th individual in SNP $i$. $p_i$ is the frequency of the minor allele for SNP $i$.
This expression is a mean
$$frac{1}{m}sum_{i=1}^m ...$$
($m$ is the number of SNPs) of the ratio
$$frac{numerator}{denominator}$$
where the numerator is a covariance
$$(a_i-C)(b_i-C)$$
and the denominator is the expected heterozygosity (it is also the variance of binomial distribution with n=2
)
$$2p_i(1-p_i)$$
Therefore, it represents how much do two individuals covary $(x_{ij})(x_{ik}-p_i)$ respectively to what is expected on average $2p_i(1-p_i)$ averaged over all SNPs $frac{1}{m}sum_{i=1}^m...$, where $m$ is the number of SNPs.
It is a relative measure (relative to the expected heterozygosity) of covariance between each individual (averaged over all SNPs).
Does it help?
Answered by Remi.b on February 1, 2021
$2p_i$ is the expectation of $SNP_i$:
$$E(SNP_i) = 0 times (1-p_i)^2 + 1 times 2p_i(1 - p_i) + 2 times p_i^2 = 2 p_i$$
$(x_{ij} - 2p_i)(x_{ik} - 2p_i)$ measures how the two SNPs covary. I have no idea why they divide it by $2p_i(1 - p_i)$, but if you leave that one out, you have the plain definition of covariance.
Further readings:
Answered by qed on February 1, 2021
The matrix gives you an estimate of the average linear relationship between any two individuals genomes, it's essentially taking the average of the betas (like linear regression betas) across each locus. One of the formulas for 'beta' is covariance divided by the sample variance, which is exactly what is happening. Each locus beta predicts the state of a person's genome at that locus from another person's genome at the same locus. Taking the average of these betas across the entire genome gives you a coefficient that can be thought of as a measure of how well we can predict one person's genome from another.
Answered by user3642885 on February 1, 2021
Though the question was posted a long ago, I feel that a clarification could be beneficial.
Heterozygosity / gene diversity
The factor in the denominator is twice the variance of the number of major alleles at site $i$, also known as heterozygosity or gene diversity:
$$
H_i = 2p_i(1-p_i).
$$
What might appear confusing here is the factor $1-p_i$ that is explicitly written instead of traditional $q_i=1-p_i$: $$H=2pq$$ (since the probabilities of minor and major alleles sum to $1$).
Covariance
In statistical terms this factor can be interpreted as twice the variance of the average number of major/minor alleles on the site, which is the motivation for including it in the denominator of the covariance function, as, e.g., when converting covariance to a correlation coefficient. It is necessary however to note that this inclusion can be done in different ways, depending on what statistical aspects one wants to emphasize.
This is however different from a more traditional definition of Genetic relationship matrix, as in VanRaden's work, where such division is done after calculating the covariance: $$ G_{jk} = frac{frac{1}{m}sum_i(x_{ij}-2p_i)(x_{ik}-2p_i)}{frac{1}{m}sum_i 2p_i(1-p_i)} = frac{sum_i(x_{ij}-2p_i)(x_{ik}-2p_i)}{sum_i 2p_i(1-p_i)}. $$
Finally, as a word of caution, let me mention that $x_{ij}, x_{ik}$ in these expressions can take only values $0$ or $1$.
Answered by Vadim on February 1, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP