Bioinformatics Asked by DN1 on February 11, 2021
I have a list of HGNC gene symbols, I am looking to get the gene length of each gene. Although I also describe these genes with lots of UCSC datasets as features, so I am wondering if there is a dataset in UCSC I can use to also get gene length from?
I’ve been looking in the data that is downloadable from UCSC table browser (I’ve been aiming to find start and ends for each gene to subtract to get gene length) but there are a lot of files and I’m not sure which dataset to take from which will also match to my HGNC gene symbols.
You could do a few command-line operations to answer this question. This assumes the use of hg38
assembly.
First, get a list of genes from GENCODE:
$ wget -qO- ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.annotation.gff3.gz
| gunzip --stdout -
| awk '$3 == "gene"' -
| convert2bed -i gff -
> genes.bed
Then use grep
to filter those genes with your list of HGNC symbols:
$ grep -wFf hgnc_symbols.txt genes.bed > filtered_genes.bed
You can run this through awk
to get lengths:
$ awk -vFS="t" '{ print $3-$2 }' filtered_genes.bed > filtered_gene_lengths.txt
Answered by Alex Reynolds on February 11, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP