Bioinformatics Asked on November 26, 2020
I have around 500 annotated proteomes of different bacterial strains and would like to quantify their similarity (or difference). I found gt genomediff
from genometools gives me some scores that I can use to generate nice clusters, but I am not sure whether that tools really works. The fasta-files that I use contain multiple sequences.
I ran some tests and it looks ok.
1_reference.fna:
>1
TAAGTTACT
>2
TAAGTTACA
2_eq_to_ref.fna:
>1
TAAGTTACT
>2
TAAGTTACA
3_tags_diff.fna:
>1asdfadf
TAAGTTACT
>2asdfasdfffa
TAAGTTACA
4_orde_diff.fna:
>2
TAAGTTACA
>1
TAAGTTACT
5_add_subse.fna:
>1
TAAGTTACT
>2
TAAGTTACA
>2
TTACA
6_point_mut.fna:
>1
AAAGTTACT
>2
TAAGTTACA
7_different.fna:
>1
TAAGTTACT
ATTACCTAA
>2
AAAAAAAAA
Then:
gt genomediff --indexname test *fna
7
1_reference.fna 0.000000 0.000000 0.000000 0.000000 0.206969 0.199527 0.794496
2_eq_to_ref.fna 0.000000 0.000000 0.000000 0.000000 0.206969 0.199527 0.794496
3_tags_diff.fna 0.000000 0.000000 0.000000 0.000000 0.206969 0.199527 0.794496
4_orde_diff.fna 0.000000 0.000000 0.000000 0.000000 0.206969 0.199527 0.794496
5_add_subse.fna 0.206969 0.206969 0.206969 0.206969 0.000000 0.212596 1.349206
6_point_mut.fna 0.199527 0.199527 0.199527 0.199527 0.212596 0.000000 0.569180
7_different.fna 0.794496 0.794496 0.794496 0.794496 1.349206 0.569180 0.000000
I am running the clustering on the difference matrix calculated with genomediff: “These distances are Jukes-Cantor corrected divergence between the pairs of genomes, that is, the number of mutations per base between them.”
Currently, we are studying S aureus. We genomes are assembled genomes (three different methods). My guess is that the sequences from the plasmids are present. Furthermore, we do have drug-resistance measured in culture. So, we will be able to compare the genomes and the resistances.
... stuff of original post deleted. On second thoughts what you might be doing is a template assembly of your genomes. It is a possible interpretation of the 1.2. fasta sequences above (i.e. 1 is template).
A microbial genomics professional would advise also performing a de novo assembly, particularly if you are interested in presence and absence. The reason is that if a gene is present in your query and absent in your template, you will miss it. Again it all depends on what bacteria are being assessed, some are more prone to "genetic islands" than others.
You need a collaborator beyond that.
For the onlookers here bacterial genetic behaviour is very different to eukaryotes and what they get up, switching DNA etc.., to would appear bizzare from a higher eukaryotic world.
You mentioned a comparative analysis of 500 strains: I have worked on this in MRSA. Note, the bacteria is important considering the approach you will adopt.
Anyway, you want a single aligned file to produce a phylogeny, Bayes or likelihood. This is a model of point mutations. Bootstrapping doesn't really help because of SNP differences between isolates.
There is a complex problem. Generic phylogeny can hit the buffers in my opinion because often:
Put it all together and you can get a nice tree but the topology of a given MLST is in my opinion unlikely to be correct. A referee may not bother with this, but they might.
You then map the phenotype against the phylogeny (the tips of the tree) and look for clusters. HOWEVER, there are problems:
Hierarchical clustering, has many meanings. It is used in presence/absence data, I demonstrated the method failed (on bacteria) based on cluster analysis. It is currently revised and used in unsupervised deep (or machine) learning prior a training method. The question then is what is what are you modelling? Presence/absence (of genes), point mutations, epidemiological data, drug-resistance/non-drug resistance?
My assessment is that given the potential complexity of these biological scenarios and the quality of the question you should seek formal collaboration and this is before considering what bacteria are being assessed. Some of epidemic drug-resistent bacteria rip up this rule book.
Answered by Michael on November 26, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP