Bioinformatics Asked on May 31, 2021
I was able to compute the significance of the overlap between 2 gene sets using the cdf function of scipy hypergeometric distribution.
I wish to be able to perform the same calculation for more than 2 gene sets; should I use the multivariate hypergeometric distribution cdf function for that?
Are there any websites that provides the same calculations over gene sets so I can validate my results?
This is a shot at it, first an example dataset:
import matplotlib.pyplot as plt
import numpy as np
import functools
from matplotlib_venn import venn3
# define universe
uni = ["gene"+str(i) for i in range(1000)]
# some overlap
gs1 = uni[250:300] + uni[900:950]
gs2 = uni[:300]
gs3 = uni[250:500]
The ever amazing venn diagram:
venn3([set(gs1),set(gs2),set(gs3)],set_labels=["gs1","gs2","gs3"])
Then a function to draw a set with length equivalent of each set, randomly from the universe and find length of intersection (all 3):
def sim_intersect(uni,set_lengths):
randomsets = [np.random.choice(uni,n) for n in set_lengths]
return len(functools.reduce(np.intersect1d,randomsets))
We run this 1000 times:
permuted_values = [sim_intersect(uni,[len(gs1),len(gs2),len(gs3)]) for i in range(1000)]
plt.hist(permuted_values,bins=range(50))
The probability of observing the starting result, using (B+1)/(M+1) as estimator, see this post:
(sum(np.array(permuted_values)>obs_n)+1)/(1000+1)
0.000999000999000999
Correct answer by StupidWolf on May 31, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP