Bioinformatics Asked by jared_mamrot on February 12, 2021
I would like to compare mutational signatures1 in patients from different studies, however some studies are based on exome seq (i.e. ~20,000 coding variants) and some are from whole genome seq (i.e. ~22,000 coding variants) – is there a way to ‘downsample’ WGS data to better reflect WES data and effectively ‘ignore’ the coordinates of those ~2000 coding variants in the VCF files?
1Alexandrov, L.B., Kim, J., Haradhvala, N.J., Huang, M.N.,
Ng, A.W.T., Wu, Y., Boot, A., Covington, K.R., Gordenin, D.A.,
Bergstrom, E.N. and Islam, S.A., 2020. The repertoire of mutational
signatures in human cancer. Nature, 578(7793), pp.94-101.
bcftools would be my choice, I'm sure bedtools could do the trick, too. Something along this
bcftools view --regions-file
or --targets, --regions-file might require a tabix index.
Answered by Carambakaracho on February 12, 2021
I'll expand slightly on the previous answer. First print off the positions for the exome file and then use bcftools view to filter the variants from the whole genome file. You could also index the whole_genome.vcf file to make the filtering faster.
bcftools query
-f'%CHROMt%POSn'
exome.vcf > exome_variants.txt
bcftools view
-T exome_variants.txt
whole_genome.vcf > whole_genome.exome_positions.vcf
Answered by user438383 on February 12, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP