Bioinformatics Asked on June 22, 2021
For most of the time, I rely on gene ids to combine different datasets. However, in some instances, I have to combine datasets based on gene names. Then, if I don’t know the source of gene names in the dataset, I get to this issue of choosing a source of gene names, be it Ensembl, HGNC etc for human genes. I wonder if this is a common issue and if there is an reliable method out there to deal with this issue.
To demonstrate the mismatch between different sources, I compared gene names for all human genes. I obtained them from 4 different sources as listed below, using BioMart (pybiomart
) :
+-----------------+-----------------------+--------------------------------------------+
| source | attribute_name | display_name |
+-----------------+-----------------------+--------------------------------------------+
| HGNC | hgnc_symbol | HGNC symbol |
| NCBI | entrezgene_accession | NCBI gene (formerly Entrezgene) accession |
| Uniprot | uniprot_gn_symbol | UniProtKB Gene Name symbol |
| Ensembl (maybe) | external_gene_name | Gene name |
+-----------------+-----------------------+--------------------------------------------+
Upon this comparison, I found several things that are clearly apparent.
I saw that protein coding genes have the best matching (left, measured in terms of Jaccard index) across different sources, with majority of genes having a single unique names (shown on right).
However, there isn’t a good enough matching in the case of not protein coding genes. Here, HGNC and Ensembl have the best match. (I don’t expect Uniprot gene names to match because they are of course only for protein coding genes.) Remarkably most of the genes have 2 unique ids (shown on right).
Comparison of all genes shows that some pairs of the sources do not have a good match e.g. Ensembl and Uniprot, with many genes having 2 unique gene names(!).
I saw similar pattern for genes on chromosomes (autosomes,X,Y) and on the scaffolds.
Mitochondrial genes clearly have different names in different databases. None of the genes have a single unique gene names (!).
How to deal with such a mismatch between different sources?
Should I prefer one particular source or is there a way to make use of the synonymous gene names from different sources?
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP