How to create Phylogenetic Trees from fasta files in Python or R?

Question

I have around a hundred Fasta files (and will collect several thousand) with DNA sequences and +50x coverage. What is a recommended method to construct a phylogenetic tree? Solutions in Python or R are sought.
I found Phylo from Biopython only handles already calculated trees.

M__ · Answer

The obvious single answer is R "ape". This will give you access to PhylML for tree building and Clustal/Muscle for alignment building. The paths to the binarys are important. There are several distance methods in there such as NJ and BIONJ. Its distance approaches however don't look mainstream, but I could be wrong.

There are functions within ape which are cool, the tree sorting is very cool and I need to read through this with much greater care. Personally I wouldn't perform a core phylogenetic analysis within R, because the standalones are sufficient and the analysis is intensive.

https://cran.r-project.org/web/packages/ape/ape.pdf

Chris_Rands · Answer

I would not look for a package for this, but instead build a small pipeline calling external tools with something like the following workflow:

Cluster the ~100 sequences with CD-HIT-EST/PSI-CD-HIT or many other options
Take all the sequences that form one individual cluster and build a multiple sequence alignment (MSA) with MAFFT/ClustalOmega or similar
Take the MSA and build a phylogenetic tree with a Maximum-Likelihood approach like iqtree or similar
Visualize the tree file with Jalview or similar

Of course this is rather general and depending on exactly what you're doing you may want a different workflow and/or different tools. You should also explore the parameter space, do not assume the defaults are necessarily good choices

bli · Answer

I agree with Chris Rands that a reasonable approach would be to call external tools.

However, if you really want to do the phylogeny from within Python, you could use the P4 package, which is a bit complicated to handle but gives you lots of options in the way to build MCMC-based bayesian phylogenies:

https://github.com/pgfoster/p4-phylogenetics

You would still need something else to align the sequences before.

To visualize the tree using python, you could use the ete toolkit, which is likely more powerful than what you can find in Biopython: http://etetoolkit.org/

Vass · Answer

(if I understand your situation correctly) https://www.rdocumentation.org/packages/seqinr/versions/3.6-1/topics/read.alignment shows how to use the function read.alignment which can take fasta msf etc. The docs provide the example' read.alignment(file = system.file("sequences/LTPs128_SSU_aligned_First_Two.fasta",  package = "seqinr"), format = "fasta", whole.header = TRUE) but you can use this code below (assumes those files are aligned) to go from reading the tree to getting the distances, producing the neighbor joining phylogenetic tree, and then plotting the tree.
library("Biostrings")
library("seqinr")
library("ape")
library(phylogram)
library("dendextend")

fasta.res <- read.alignment(file = "geneticAlignment.msf", format = "fasta")
fasta.res.dist.alignment = dist.alignment(msf.res, matrix = "identity")
fasta.res.dist.alignment.nj = nj(fasta.res.dist.alignment)
plot(fasta.res.dist.alignment.nj, main = "from fasta files")

How to create Phylogenetic Trees from fasta files in Python or R?

4 Answers

Add your own answers!

Ask a Question