Bioinformatics Asked by user3390486 on March 19, 2021
I had my DNA tested by Myheritage and they sent me a csv file with RSID, Chromosome, position and result (which base) with about 700,000 rows.
I understand most analyses of DNA use VCF files but is there anything i can do with this csv file i.e. check for genetic health-related genes?
I am not a bioinformaticist but I am a scientist and I can use R and python. Ive heard of the Gnomad database but not sure if i can match things to my csv file.
A csv file with RSID, chromosome, position and result is enough for what you want to do and these are the core columns of a VCF (which is just a TSV with some headers describing how it was made).
Given you have ~700k rows I suspect that yes there will be genetic health-related SNVs (single nucleotide variant). Disclaimer; these data could include info pertaining to your health and your family's health, I strongly advise that you speak to a genetic counsellor to understand these things.
Gnomad is a population database so generally speaking (plenty of exceptions of course) if there is a SNV in there more than a few times its probably not causing disease. https://www.ncbi.nlm.nih.gov/clinvar/ and https://omim.org/ are example of disease databases.
Here's a pandas example with clinvar (https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/ you'll need to match your version).
import pandas as pd
df = pd.read_csv(
'clinvar.vcf.gz',
sep='t',
comment = '#',
header=None,
names = ['CHROM','POS','ID','REF','ALT','QUAL','FILTER','INFO']
)
df.head()
CHROM POS ID REF ALT QUAL FILTER INFO
0 1 930188 846933 G A . . ALLELEID=824438;CLNDISDB=MedGen:CN517202;CLNDN...
1 1 930203 972363 C T . . ALLELEID=959431;CLNDISDB=MedGen:CN517202;CLNDN...
2 1 930248 789256 G A . . AF_ESP=0.00347;AF_EXAC=0.00622;AF_TGP=0.00280;...
3 1 930275 969662 T G . . ALLELEID=959432;CLNDISDB=MedGen:CN517202;CLNDN...
4 1 930336 843786 G A . . ALLELEID=824439;CLNDISDB=MedGen:CN517202;CLNDN...
Then lets say one of your 700k SNVs is at chrom 1, position 930336 and you have an A
print(*df.query('(CHROM == 1) & (POS == 930336) & (ALT == "A")')['INFO'].str.split(';'))
Gives a list from which you could pull out CLNSIG which here is Uncertain_significance
['ALLELEID=824439', 'CLNDISDB=MedGen:CN517202', 'CLNDN=not_provided', 'CLNHGVS=NC_000001.11:g.930336G>A', 'CLNREVSTAT=criteria_provided,_single_submitter', 'CLNSIG=Uncertain_significance', 'CLNVC=single_nucleotide_variant', 'CLNVCSO=SO:0001483', 'GENEINFO=SAMD11:148398', 'MC=SO:0001583|missense_variant', 'ORIGIN=1']
So you could easily parse your 700k SNVs and get their respective clinical info this way. Please think about this carefully before you do it though, there are many implications including life insurance and mental health! I'm of the opinion that this sort of data is useful but MANY people disagree. Further, this advice is general in nature only, I accept no liabilty for what you or anyone else chooses to do with this publicly available data.
Answered by Liam McIntyre on March 19, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP