How to quantify similarity of genomes and find differences in set of S aureus genomes?

Question

I have around 500 annotated proteomes of different bacterial strains and would like to quantify their similarity (or difference). I found gt genomediff from genometools gives me some scores that I can use to generate nice clusters, but I am not sure whether that tools really works. The fasta-files that I use contain multiple sequences.

I ran some tests and it looks ok.

1_reference.fna:
>1
TAAGTTACT
>2
TAAGTTACA

2_eq_to_ref.fna:
>1
TAAGTTACT
>2
TAAGTTACA

3_tags_diff.fna:
>1asdfadf
TAAGTTACT
>2asdfasdfffa
TAAGTTACA

4_orde_diff.fna:
>2
TAAGTTACA
>1
TAAGTTACT

5_add_subse.fna:
>1
TAAGTTACT
>2
TAAGTTACA
>2
TTACA

6_point_mut.fna:
>1
AAAGTTACT
>2
TAAGTTACA

7_different.fna:
>1
TAAGTTACT
ATTACCTAA
>2
AAAAAAAAA

Then:

gt genomediff --indexname test *fna
7
1_reference.fna 0.000000        0.000000        0.000000        0.000000        0.206969        0.199527        0.794496
2_eq_to_ref.fna 0.000000        0.000000        0.000000        0.000000        0.206969        0.199527        0.794496
3_tags_diff.fna 0.000000        0.000000        0.000000        0.000000        0.206969        0.199527        0.794496
4_orde_diff.fna 0.000000        0.000000        0.000000        0.000000        0.206969        0.199527        0.794496
5_add_subse.fna 0.206969        0.206969        0.206969        0.206969        0.000000        0.212596        1.349206
6_point_mut.fna 0.199527        0.199527        0.199527        0.199527        0.212596        0.000000        0.569180
7_different.fna 0.794496        0.794496        0.794496        0.794496        1.349206        0.569180        0.000000

I am running the clustering on the difference matrix calculated with genomediff: "These distances are Jukes-Cantor corrected divergence between the pairs of genomes, that is, the number of mutations per base between them."

Currently, we are studying S aureus. We genomes are assembled genomes (three different methods). My guess is that the sequences from the plasmids are present. Furthermore, we do have drug-resistance measured in culture. So, we will be able to compare the genomes and the resistances.

Michael · Answer

... stuff of original post deleted.
On second thoughts what you might be doing is a template assembly of your genomes. It is a possible interpretation of the 1.2. fasta sequences above (i.e. 1 is template).

A microbial genomics professional would advise also performing a de novo assembly, particularly if you are interested in presence and absence. The reason is that if a gene is present in your query and absent in your template, you will miss it. Again it all depends on what bacteria are being assessed, some are more prone to "genetic islands" than others.

You need a collaborator beyond that.

For the onlookers here bacterial genetic behaviour is very different to eukaryotes and what they get up, switching DNA etc.., to would appear bizzare from a higher eukaryotic world.

You mentioned a comparative analysis of 500 strains: I have worked on this in MRSA. Note, the bacteria is important considering the approach you will adopt.

Anyway, you want a single aligned file to produce a phylogeny, Bayes or likelihood. This is a model of point mutations. Bootstrapping doesn't really help because of SNP differences between isolates.

There is a complex problem. Generic phylogeny can hit the buffers in my opinion because often:

low numbers of SNP differences between isolates.
The error across the genome is poorly defined
Multi-clonal infection ignored and your isolates not cloned. 
The other issue is the QC of the genome, I've been fairly astonished by the variation (it affects the tree)

Put it all together and you can get a nice tree but the topology of a given MLST is in my opinion unlikely to be correct. A referee may not bother with this, but they might.

You then map the phenotype against the phylogeny (the tips of the tree) and look for clusters. HOWEVER, there are problems:

Drug-resistant e.g. mec cassette (methicillin-resistant ) are often on plasmids and these are lost during isolation. Albeit they can integrate onto the genome. So an isolate can be drug resistance but you fail to find the gene.
Drug-resistance transmission is unlikely to cluster, its too quick, so you don't see nice tight clusters against a theoretically perfect tree
The best approach is drug-resistance in culture.

Hierarchical clustering, has many meanings. It is used in presence/absence data, I demonstrated the method failed (on bacteria) based on cluster analysis. It is currently revised and used in unsupervised deep (or machine) learning prior a training method.  The question then is what is what are you modelling? Presence/absence (of genes), point mutations, epidemiological data, drug-resistance/non-drug resistance?

My assessment is that given the potential complexity of these biological scenarios and the quality of the question you should seek formal collaboration and this is before considering what bacteria are being assessed. Some of epidemic drug-resistent bacteria rip up this rule book.

How to quantify similarity of genomes and find differences in set of S aureus genomes?

One Answer

Add your own answers!

Ask a Question