How to create Phylogenetic Trees from fasta files in Python or R?

Bioinformatics Asked on October 3, 2021

I have around a hundred Fasta files (and will collect several thousand) with DNA sequences and +50x coverage. What is a recommended method to construct a phylogenetic tree? Solutions in Python or R are sought.

I found Phylo from Biopython only handles already calculated trees.

4 Answers

The obvious single answer is R "ape". This will give you access to PhylML for tree building and Clustal/Muscle for alignment building. The paths to the binarys are important. There are several distance methods in there such as NJ and BIONJ. Its distance approaches however don't look mainstream, but I could be wrong.

There are functions within ape which are cool, the tree sorting is very cool and I need to read through this with much greater care. Personally I wouldn't perform a core phylogenetic analysis within R, because the standalones are sufficient and the analysis is intensive.

Answered by M__ on October 3, 2021

I would not look for a package for this, but instead build a small pipeline calling external tools with something like the following workflow:

  • Cluster the ~100 sequences with CD-HIT-EST/PSI-CD-HIT or many other options
  • Take all the sequences that form one individual cluster and build a multiple sequence alignment (MSA) with MAFFT/ClustalOmega or similar
  • Take the MSA and build a phylogenetic tree with a Maximum-Likelihood approach like iqtree or similar
  • Visualize the tree file with Jalview or similar

Of course this is rather general and depending on exactly what you're doing you may want a different workflow and/or different tools. You should also explore the parameter space, do not assume the defaults are necessarily good choices

Answered by Chris_Rands on October 3, 2021

I agree with Chris Rands that a reasonable approach would be to call external tools.

However, if you really want to do the phylogeny from within Python, you could use the P4 package, which is a bit complicated to handle but gives you lots of options in the way to build MCMC-based bayesian phylogenies:

You would still need something else to align the sequences before.

To visualize the tree using python, you could use the ete toolkit, which is likely more powerful than what you can find in Biopython:

Answered by bli on October 3, 2021

(if I understand your situation correctly) shows how to use the function read.alignment which can take fasta msf etc. The docs provide the example' read.alignment(file = system.file("sequences/LTPs128_SSU_aligned_First_Two.fasta", package = "seqinr"), format = "fasta", whole.header = TRUE) but you can use this code below (assumes those files are aligned) to go from reading the tree to getting the distances, producing the neighbor joining phylogenetic tree, and then plotting the tree.


fasta.res <- read.alignment(file = "geneticAlignment.msf", format = "fasta")
fasta.res.dist.alignment = dist.alignment(msf.res, matrix = "identity")
fasta.res.dist.alignment.nj = nj(fasta.res.dist.alignment)
plot(fasta.res.dist.alignment.nj, main = "from fasta files")

Answered by Vass on October 3, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP