Bioinformatics Asked on April 14, 2021
First of all let me apologize if this is not the place for this question. While it can be quite broad, I will try to make it as specific as I can.
Context
I am testing the correlation between Whole Genome Bisulfite Sequencing and Oxford Nanopore methylation calls for one individual (mammal, blood tissue) with replicates at all CpG sites. In the (very) best cases I find a close to 0.9 correlation. Yet, for the part that does not correlate I would like to know which technology is closer to the "biological truth".
My strategy
My plan would be to first find CpG sites/short CpG regions showing differential methylation states between WGBS and ONT. Then I would validate the methylation states of these sites using a third party technology such as pyromark.
What I am asking to the SE community
First, has anyone done this before ? I would love to hear about your experience
Do you see some critical aspects I should be watching out for ?
Last if you have comments about the strategy I propose here, you are welcome
Methylation can be cell-specific, which makes it difficult to evaluate accuracy on a bulk-cell level (even within the same tissue). How can you tell that the differences you're seeing are due to platform differences, or due to biological variation?
I find that adding more haystacks doesn't help much in working out the truth of a dataset. If you want to investigate biological truth, it would be better to create a biological system with a known methylation (or demethylation) pattern and test that.
Answered by gringer on April 14, 2021
I would assess directionality and accuracy of prediction by 1) WGBS predicting ONT and then 2) ONT predicting WGBS.
Firstly, I would use deep learning (or machine learning) and train WGBS against ONT, parameterise and then test. Then conversely train ONT against WGBS, parameterise and then test. The approach with highest accuracy of prediction (using the 'accuracy' index) would be the approach assumed to be "closer to the biological truth". If both calculations produced comparable accuracies I would conclude what @gringer has stated that natural variation / natural heterogeneity is the predominant signal in the signal. This conclusion depends on the ability of deep learning to map the biological process.
It is an approach highly applicable to a deep learning estimate regardless and may not replace a truely controlled test, this will depend on what controls you have run, but could provide a valuable and easily obtainable insight in its absence.
If you did run a control sample, then you are truely in, because this would provide the training set and the WBGS and ONT provide the test data.
You asked me how it works It doesn't replace good controls, but it is 'trendy workaround'.
Two steps
The idea is if you give it a bit of DNA it will assign whether it is CpG (see caveat below).
The caveat is that you need to supply the training set with sites that are known to be negative under WGBS or ONT. You need to think the best strategy for doing this. If some of those negatives were positve in the other method that would affect the outcome alot.
There is an issue about vectorisation this is often done using k-mers, and is a headache. Basically you have to make a bit of DNA into numbers and that in my opinion is the difficult part. In your case I ain't so sure its so difficult. If you can get that bit correct and biologically meaningful you are in business. My stuff is phylogeny based and I understand the relationship between biology (mutation) and numbers. In your case that needs thought, however someone will have thought about it and solved it.
As a personal statement I would do this via random forests, because ANN is difficult.
One such vectorisation method is here by Li et al (2017), however as they have used it, the method has flaws due to equal weighting of mutation frequencies. However, these flaws do not apply to your work - it would actually be a good method, they singly apply to tree building.
The final thing is that the more experimental data you can feed it the better. You don't have to complicated about it, just create a new column in the input section, e.g. along side teh vectorised DNA and the algorithm will figure out whether it helps. You can't feed it enough data, because it is doesn't help it wouldn't use it.
Answered by M__ on April 14, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP