Bioinformatics Asked by CoderGuy123 on April 11, 2021
I am working with data from the HGDP public files. These are massive VCFs from whole genome sequencing. To begin with, I converted these to BEDs using plink (1.9). The next problem is that these files use rsid names instead of chrpos names. I would like to change to chrpos. I am wondering whether one can simply edit the BIMs or whether I have to do use plink’s --update-name
call. The problem with the latter is that it doesn’t support duplicates. The HGDP files are full of duplicate variant names because many of the variants don’t have rsid’s and are just named ".":
$> plink --bfile hgdp_chr21 --update-name updateids --make-bed --out hgdp_chr21_chrpos
PLINK v1.90b6.17 64-bit (28 Apr 2020) www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to hgdp_chr21_chrpos.log.
Options in effect:
--bfile hgdp_chr21
--make-bed
--out hgdp_chr21_chrpos
--update-name updateids
64259 MB RAM detected; reserving 32129 MB for main workspace.
1086522 variants loaded from .bim file.
929 people (0 males, 0 females, 929 ambiguous) loaded from .fam.
Ambiguous sex IDs written to hgdp_chr21_chrpos.nosex .
Error: Duplicate ID '.'.
Thus, one gets an error when trying --update-name
and I am wondering if I can simply edit the BIM files to be chrpos instead of rsid. I can’t find an answer to this anywhere. It seems that if the BED files don’t contain a reference to the variant IDs, but only to the row in the BIM file, then this approach should be possible.
Here’s the current BIM file:
user@desktop:/media/luks8tb1/data/genomics/HGDP$ head hgdp_hg38.bim
1 rs1331781659 0 10153 G A
1 rs1179689498 0 10163 C T
1 rs201694901 0 10180 C T
1 rs199706086 0 10250 C A
1 rs111200574 0 10257 C A
1 rs145427775 0 10291 T C
1 rs868413313 0 10297 T C
1 rs112750067 0 10327 C T
1 rs1035171912 0 10330 A C
1 rs868804735 0 10333 T C
As can be seen, very few lines have the "." variant name. In fact, in the first 100 lines there are 0.
I would reccomend using bcftools
on the original vcf files before you convert them to plink, to fill in missing IDs using the command:
bcftools annotate --set-id +'%CHROM_%POS_%REF_%FIRST_ALT' file.vcf
This means you won't have any duplicate ID names and plink should be happy about that. It will preserve the positional information in the .bim file as well.
Answered by user438383 on April 11, 2021
I found this 2019 question on biostars in which user zx8754 mentions that plink2 has a command for this purpose --set-all-var-ids
, from the plink2 docs:
Whole-exome and whole-genome sequencing results frequently contain variants which have not been assigned standard IDs. If you don't want to throw out all of that data, you'll usually want to assign them chromosome-and-position-based IDs.
--set-missing-var-ids (which just replaces missing IDs) and --set-all-var-ids (which overwrites everything) provide one way to do this. The parameter taken by these flags is a special template string, with a '@' where the chromosome code should go, and a '#' where the base-pair position belongs. (Exactly one @ and one # must be present.) For example, given a .pvar file starting with
#CHROM POS ID REF ALT chr1 10583 . G A chr1 886817 . T C chr1 886817 . C CATTTT
"--set-missing-var-ids @:#[b37]" would name the first variant 'chr1:10583[b37]', the second variant 'chr1:886817[b37]'... and the third variant also gets the name 'chr1:886817[b37]'.
Thus, in my case, I used:
user@desktop:/media/luks8tb1/data/genomics/HGDP$ plink2 --bfile hgdp_hg38 --set-all-var-ids @:# --make-bed --out hgdp_hg38_chrpos
PLINK v2.00a3LM AVX2 Intel (27 Jul 2020) www.cog-genomics.org/plink/2.0/
(C) 2005-2020 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to hgdp_hg38_chrpos.log.
Options in effect:
--bfile hgdp_hg38
--make-bed
--out hgdp_hg38_chrpos
--set-all-var-ids @:#
Start time: Tue Aug 25 19:16:10 2020
64259 MiB RAM detected; reserving 32129 MiB for main workspace.
Using up to 8 compute threads.
929 samples (0 females, 0 males, 929 ambiguous; 929 founders) loaded from
hgdp_hg38.fam.
56475980 variants loaded from hgdp_hg38.bim.
Note: No phenotype data present.
Writing hgdp_hg38_chrpos.fam ... done.
Writing hgdp_hg38_chrpos.bim ... done.
Writing hgdp_hg38_chrpos.bed ... done.
End time: Tue Aug 25 19:17:41 2020
Results, BIM:
emil@test-desktop:/media/luks8tb1/data/genomics/HGDP$ head hgdp_hg38_chrpos.bim
1 1:10153 0 10153 G A
1 1:10163 0 10163 C T
1 1:10180 0 10180 C T
1 1:10250 0 10250 C A
1 1:10257 0 10257 C A
1 1:10291 0 10291 T C
1 1:10297 0 10297 T C
1 1:10327 0 10327 C T
1 1:10330 0 10330 A C
1 1:10333 0 10333 T C
Answered by CoderGuy123 on April 11, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP