How can I subset WGS data to the level of WES variants?

Question

I would like to compare mutational signatures1 in patients from different studies, however some studies are based on exome seq (i.e. ~20,000 coding variants) and some are from whole genome seq (i.e. ~22,000 coding variants) - is there a way to 'downsample' WGS data to better reflect WES data and effectively 'ignore' the coordinates of those ~2000 coding variants in the VCF files?

1Alexandrov, L.B., Kim, J., Haradhvala, N.J., Huang, M.N.,
Ng, A.W.T., Wu, Y., Boot, A., Covington, K.R., Gordenin, D.A.,
Bergstrom, E.N. and Islam, S.A., 2020. The repertoire of mutational
signatures in human cancer. Nature, 578(7793), pp.94-101.

Carambakaracho · Answer

bcftools would be my choice, I'm sure bedtools could do the trick, too.
Something along this
bcftools view --regions-file

or  --targets, --regions-file might require a tabix index.

user438383 · Answer

I'll expand slightly on the previous answer. First print off the positions for the exome file and then use bcftools view to filter the variants from the whole genome file. You could also index the whole_genome.vcf file to make the filtering faster.
bcftools query 
    -f'%CHROMt%POSn' 
    exome.vcf > exome_variants.txt

bcftools view 
    -T exome_variants.txt 
    whole_genome.vcf > whole_genome.exome_positions.vcf

How can I subset WGS data to the level of WES variants?

2 Answers

Add your own answers!

Ask a Question