How to retrieve the DNA sequence from a particular species and region through command line?

Question

I am new to working with the NCBI database so I am not sure how trivial this question is. I wanted to get DNA sequence of a specific version of the Xenopus tropicalis genome of a specific region (ie 1-100bp) without doing it manually each time (going on genome browser). I know there is a faster way and efficient way to do this through the command line but I found the NCBI tutorial hard to follow.

Alex Reynolds · Accepted Answer

Samtools
Samtools is a toolkit that includes a binary called samtoools, which can be used to query a FASTA file — to extract a subsequence based on some coordinate range you specify.
If you do not have samtools installed, you need to add it to your computer. For OS X, you could use Homebrew, for example:
$ brew install samtools

For Ubuntu, you could use apt-get:
$ sudo apt-get install samtools

Reference assembly
The next step is to download a reference assembly for your genome and genomic build-of-interest.
There is a research group dedicated to making Xenopus datasets available. If you were interested in v10 of the Xentr genome, for instance, you could do the following or similar to download it:
$ curl --output "XENTR_10.0_genome.fasta.gz" "http://ftp.xenbase.org/pub/Genomics/JGI/Xentr10.0/XENTR_10.0_genome.fasta.gz"

You could also dig through the NCBI site to get a similar link. It's the same idea — just change the web address, accordingly.
Indexing
Indexing makes it possible to extract an arbitrary subsequence of interest from this FASTA file. You need to index, but you only need to do it once:
$ samtools faidx XENTR_10.0_genome.fasta.gz

This creates two new files: XENTR_10.0_genome.fasta.gz.fai and XENTR_10.0_genome.fasta.gz.gzi.
Note: You need to keep these two files in the same directory as the original, compressed FASTA file XENTR_10.0_genome.fasta.gz.
Querying
Now you are ready to query, or extract a subsequence of interest. If you wanted the first hundred bases of the first chromosome, for instance:
$ samtools faidx XENTR_10.0_genome.fasta.gz Chr1:1-100
>Chr1:1-100
cctcagcgtcttcttctttctgctcccactgcattagcatagggagagagggccacagag
gctaggtagcattgccggctagcaatcttcagcgtcctct

The search term used here is Chr1:1-100. Replace Chr1 with the chromosome of interest and 1 and 100 with the start and stop coordinates, respectively, of interest. Coordinates used with samtools queries use a one-based indexing scheme.
Note: This uses the chromosome naming scheme in the XENTR_10.0_genome.fasta.gz file (Chr1, etc.). Other research groups (NCBI) might use different chromosome names. If you get your FASTA from somewhere else, investigate the FASTA file to determine the naming scheme being used.
If you want to redirect output to a file, use the output redirect operator (>):
$ samtools faidx XENTR_10.0_genome.fasta.gz Chr1:1-100 > mySubsequence.fa

swbarnes2 · Answer

I'd download the fasta, and use samtools, or maybe bedtools.

How to retrieve the DNA sequence from a particular species and region through command line?

2 Answers

Add your own answers!

Ask a Question