Bioinformatics Asked by Iriel on December 17, 2020
I am trying to run the ADMIXTURE software for the first time to analyse the structure of an in-house dataset of samples comparing to 1000 genomes project ancestral populations. I am trying to determine ancestry just from the X chromosome of those individuals.
To do that, I have downloaded the X chromosome file from 1000 genomes (ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz). I have removed duplicated with bcftools and converted to plink binary with the following command in plink 1.9:
plink --bcf ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.bcf --keep-allele-order --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed
Next, I modified the .bim files to change the polymorphisms ID so them where the same as their position, as a way to standarize their names. Then I filtered related individuals, also with plink and found the intersection of .bim files from 1000 genomes and my in-house dataset (a bed file with microarray genotyped variants only from X non PAR region), that was already filtered to contain only variants genotyped against the plus strand (without indels).
Then I merged both datasets with the following plink command:
plink --bfile filename1 --bmerge filename2.bed filename2.bim filename2.fam --make-bed --out filemerged
That command generated a file with all the SNPS that would have more than 3 alleles in the merged file so I removed all them from both datasets and merged them again. On that merged file I pruned SNPs by linkage disequilibrium:
plink --bfile cromXmerged2 --indep-pairwise 50 10 0.1
, assign sample sex on the .fam file and filtered individual with missingnes higher than 10%:
plink --file mergedfile --mind 0.1 --make-bed --out
I also set all heterozigous calls as missing with –set-hh-missing plink command.
With the final input I runned a supervised ADMIXTURE as following:
./admixture finalinput.bed 3 --haploid="male:23" --supervised -j4
Before running of course I created the finalinput.pop file in open office calc, by creating a list with the assigned population for each sample, with my in-house dataset left blanc. I just used african, european and east asian 1000 genomes samples because my own dataset has brazilian admixed samples. The final dataset has approximately 1500 SNPs and about 1600 individuals. My computer set up is CPU: i3-2370M with two cores, 4 [email protected] GHz, RAM 8 GB@1333 MHz and an SSD 120 GB Sata 3.0. OS Ubuntu 20.04 64 Bits.
The thing is that ADMIXTURE have been running for three days without showing any error or producing result yet. I would like to know if it is normal to this analysis to take so long or if it could be any problem with my dataset.
I would appreciate any suggestion or advice!!!
Thanks in advance,
Iriel
Apparently there is a bug in ADMIXTURE that prevents for converging when using at the same time the haploid mode and multithreading (flags --haploid and -j respectively). The problem is solved by avoiding using one of them.
Answered by Iriel on December 17, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP