Find rsIDs for GRCh37 SNPs and rsIDs for GRCh38 SNPs and compare for overlap

Question

I have a long list of variant IDs that I generated as a result of research done one GRCh37 genomes (e.g. 13_28025615_G_C_b37). I want to get their rsIDs and compare them to a newly-obtained list of GRCh38 rsIDs that have also been retrieved from similarly formatted variant IDs (e.g. chr22_50473581_G_A_b38). How would I go about doing this? I've tried installing biomaRt as a package but I can't quite make heads or tails of it and it's not clear to me that it would support both versions of genome builds.

swbarnes2 · Accepted Answer

it's not clear to me that it would support both versions of genome
  builds.

You can specify what version of the database you are querying.

I'd start with biomart on the ensembl website, see if you can get it to do what you want, then work on getting biomaRt to do what you want.

Maxim Masiutin · Answer

rsIDs change with the dbSNP builds, not with the reference genome builds. rsIDs do not depend on a reference genome. They point to a specified locus regardless of the differences in genomic assemblies. rsIDs numbersare assigned by dbSNP database maintainers. The numbers may be deleted from database (e.g. when a submitter withdraw data) and may change on merges (e.g. when multiple rsIDs were assigned at the same genomic location).
Specifying the location by rsID is stable, and can be only affected by an eventual merge or deletion, if any, with a new dbSNP database build. It cannot be affected by a reference genome build.
However, a base position (offset) of a locus within a chromosome may change between the builds of the reference genome. So, when specifying the location by the chromosome and base position, the reference genome build should be also given, to locate an appropriate rsID.
You can download dbSNP database files (in VCF format) of the database build that was actual at a time when a specific build of a reference genome was also actual, but these files are huge. A possible place to download is https://ftp.ncbi.nih.gov/snp/pre_build152/organisms/ or https://ftp.ncbi.nih.gov/snp/archive/
However, these files are arranged by the dbSNP database build (e.g. 152, 153, 154), not by the reference genome build (e.g. 37 or 38).
If your list is relatively small, you can fetch the data rs numbers via dbSNP API by chromosome and location. The API is available at https://www.ncbi.nlm.nih.gov/snp/docs/eutils_help/
For instance, do “advanced search” of dbSNP at https://www.ncbi.nlm.nih.gov/snp/advanced
When it comes to search by the base position, you can only specify positions for GRCh37 (called “Base Position Previous”) and GRCh38 (“Base Position”). This URL will help you construct the request string that you will further use for the dbSNP API requests.
You may also need to add “NOT MergedRS[All Fields]” to exclude rsIDs that have already been merged. I wish I knew what to type instead of “All Fields” here to refine the search, but the documentation is very sparse.
In your examples, for 13_28025615_G_C_b37 specify Chromosome=13 and Base Position Previous=28025615 in the “Advanced Search”.
As about the API, for 13_28025615_G_C_b37 query https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=snp&term=13[Chromosome]+AND+28025615[Base+Position+Previous]+NOT+MergedRS[All+Fields]
and for chr22_50473581_G_A_b38 query https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=snp&term=22[Chromosome]+AND+50473581[Base+Position]+NOT+MergedRS[All+Fields]
The first rsID that you will get will be the main current rsID, the others (if any) will be the merged ones.
Here is a small perl example on how to retrieve data from the dbSNP. The fetch_GRCh37 function returns the rsID for a given chromosome and base position. For GRCh38, remove the “+Previous” from the URL. You can call the fetch_GRCh37 in a script multiple times. It’s better that you get rsIDs for your both GRCh37 and GRCh38 data to make sure that you refer to the same rsIDs that are actual after eventual merges.
Please make sure to read and comply to the requirements before using the API.
A sample Perl script snippet follows:
use XML::LibXML;
use LWP::UserAgent;

sub fetch_GRCh37 {
        my $chromosome = $_[0];
        my $basepos = $_[1];
        my $url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=snp&term='.$chromosome.'[Chromosome]+AND+'.$basepos.'[Base+Position+Previous]+NOT+MergedRS[All+Fields]';
        my $UserAgent = new LWP::UserAgent ();
        my $GET = $UserAgent->get($url)->content;
        my $dom = XML::LibXML->load_xml(string => $GET);
        my $numberofnodes = int($dom->findnodes('eSearchResult/Count')->to_literal());
        if ($numberofnodes > 0) {foreach my $title ($dom->findnodes('eSearchResult/IdList/Id')){return("rs".$title->to_literal());}}
}

print fetch_GRCh37(13, 28025615);

Find rsIDs for GRCh37 SNPs and rsIDs for GRCh38 SNPs and compare for overlap

2 Answers

Add your own answers!

Ask a Question