TransWikia.com

How is the GT field in a VCF file defined?

Bioinformatics Asked on January 7, 2021

As my question in SO was closed and asked to be posted in this forum, I am posting it here

I am not from the bioinformatics domain. However, for the sake of analysis, I am trying to pick up certain basics related to the GT field in the VCF file.

I know we have a field called Alleles. May I know under what circumstances GT takes a certain values and how they are called? Can you confirm my understanding?

Ref Alt    GT   Name 
A    A     0/0  Homozygous
A    G     0/1  Heterozygous (does 1/0 also mean the same?) What's this 0 and 1 actually?
A   [C,CA] ??    ??
??   ??    1/1  HOM_ALT? How and why? 

Can the experts here help me to fill the question marks and also help me understand with the Ref and Alt combinations when a genotype can take a value of 0/0 or 1/0 or 1/1 or 1/2 etc and what are the names for those values? Like when it is called home_alt etc

Any simple explanation for beginner like me (with no background in bioinformatics/biology) can be helpful

2 Answers

You can get most of the info from this paper. See Fig. 1 and the surrounding text. Quoting from there, "GT, genotype, encodes alleles as numbers: 0 for the reference allele, 1 for the first allele listed in ALT column, 2 for the second allele listed in ALT and so on."

In your case, the reference allele, here a single nucleotide, is A. When the alternate allele is also A, the genotype GT is reference, or 0. There are 2 copies of each allele in the human genome in non-sex chromosomes chr1-chr22, hence 0/0, or homozygous reference or HOM_REF.

When ALT=G, and the GT column is 0/1, this means that you have 1 reference allele (0), and 1 alternate allele (1). This means that you have A on one copy of this locus, and G on another. The convention is write GT field in ascending order, so 0/1 rather than 1/0. This is called heterozygous, or HET.

When ALT=C,CA, the GT is probably 1/2, because there are 2 alternate alleles, and I assume we continue with the same chromosome present in 2 copies. This means there are no reference alleles here at all, only alternate alleles. It is a heterozygous genotype composed of two different ALT alleles, or HET_ALT. Note that it is not enclosed in square brackets in the vcf file format: A <tab> C,CA <tab> 1/2 ....

Finally, these are some examples of HOM_ALT:

A    C     1/1
A    G     1/1
A    CA    1/1

This means that the same ALT allele (either C, or G, or CA) is present in 2 copies. There is no reference allele present. This is called homozygous alternate genotype.

In general, the name homo means the same, and hetero means different, in the context of genotypes.

REFERENCES:

Danecek P, Auton A, Abecasis G, et al. The variant call format and VCFtools. Bioinformatics. 2011;27(15):2156-2158. doi:10.1093/bioinformatics/btr330 : https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3137218/

SEE ALSO:

VCF - Variant Call Format
What Does Genotype ("0/0", "0/1" Or "1/1") In *.Vcf File Represent?
Difference between 0/0 and ./. for genotype in VCF
1/2 in VCF genotype field?

Correct answer by Timur Shtatland on January 7, 2021

0 and 1 is just a way of coding the reference (0) and alternate (1) allele. These could be A/G, C/T etc. It's just a simplified way of expressing the different alleles.

0/1 and 1/0 functionally mean the same thing (that the individual is a heterozygote) - since the genotype is unphased, the alleles aren't ordered. The / symbol tells you the genotype is unphased. However, in practice, a heterozygous genotype is always written as 0/1 as a matter of convention. 0/0 is referred to as homozygous reference, and 1/1 as homozygous alternative.

0|1 and 1|0 do mean something different, since the pipe symbol | tell us the order of the alleles matters.

Occasionally you will have multi allelic positions, where you have more than one alternate allele, and therefore the field will look like:

Ref   Alt
A   [C,GT]

The extra alt allele is given the number 2, so in this case if an individual had the genotype CT then their genotype code would be 0/2.

Full detail is given in the official VCF specification page.

Answered by user438383 on January 7, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP