Bioinformatics Asked by fp1234 on April 13, 2021
I am new to working with the NCBI database so I am not sure how trivial this question is. I wanted to get DNA sequence of a specific version of the Xenopus tropicalis genome of a specific region (ie 1-100bp) without doing it manually each time (going on genome browser). I know there is a faster way and efficient way to do this through the command line but I found the NCBI tutorial hard to follow.
Samtools
Samtools is a toolkit that includes a binary called samtoools
, which can be used to query a FASTA file — to extract a subsequence based on some coordinate range you specify.
If you do not have samtools
installed, you need to add it to your computer. For OS X, you could use Homebrew, for example:
$ brew install samtools
For Ubuntu, you could use apt-get
:
$ sudo apt-get install samtools
Reference assembly
The next step is to download a reference assembly for your genome and genomic build-of-interest.
There is a research group dedicated to making Xenopus datasets available. If you were interested in v10 of the Xentr
genome, for instance, you could do the following or similar to download it:
$ curl --output "XENTR_10.0_genome.fasta.gz" "http://ftp.xenbase.org/pub/Genomics/JGI/Xentr10.0/XENTR_10.0_genome.fasta.gz"
You could also dig through the NCBI site to get a similar link. It's the same idea — just change the web address, accordingly.
Indexing
Indexing makes it possible to extract an arbitrary subsequence of interest from this FASTA file. You need to index, but you only need to do it once:
$ samtools faidx XENTR_10.0_genome.fasta.gz
This creates two new files: XENTR_10.0_genome.fasta.gz.fai
and XENTR_10.0_genome.fasta.gz.gzi
.
Note: You need to keep these two files in the same directory as the original, compressed FASTA file XENTR_10.0_genome.fasta.gz
.
Querying
Now you are ready to query, or extract a subsequence of interest. If you wanted the first hundred bases of the first chromosome, for instance:
$ samtools faidx XENTR_10.0_genome.fasta.gz Chr1:1-100
>Chr1:1-100
cctcagcgtcttcttctttctgctcccactgcattagcatagggagagagggccacagag
gctaggtagcattgccggctagcaatcttcagcgtcctct
The search term used here is Chr1:1-100
. Replace Chr1
with the chromosome of interest and 1
and 100
with the start and stop coordinates, respectively, of interest. Coordinates used with samtools
queries use a one-based indexing scheme.
Note: This uses the chromosome naming scheme in the XENTR_10.0_genome.fasta.gz
file (Chr1
, etc.). Other research groups (NCBI) might use different chromosome names. If you get your FASTA from somewhere else, investigate the FASTA file to determine the naming scheme being used.
If you want to redirect output to a file, use the output redirect operator (>
):
$ samtools faidx XENTR_10.0_genome.fasta.gz Chr1:1-100 > mySubsequence.fa
Correct answer by Alex Reynolds on April 13, 2021
I'd download the fasta, and use samtools, or maybe bedtools.
Answered by swbarnes2 on April 13, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP