Bioinformatics Asked on August 31, 2021
I have a dataset of genes I am trying to collect data on from public databases, to use as features in machine learning. I am trying to take some features from UCSC genome browser (e.g. number of CpG islands per gene, number of DNase clusters per gene, regulatory enrichment scores etc.) however I am not sure how to control for bias where a gene that is larger in length – and so will then have more CpG islands or higher regulatory enrichment scores simply due to gene length.
Is there a way to correct for gene length when taking/condensing variant data to individual genes?
For reference, my machine learning model aims to predict whether a gene is the most likely to be causal for a disease (out of all the genes given to the model). The model will score the genes as a regression classification between 0 to 1 (0 being least likely to cause disease and 1 being most likely to cause disease). I plan to later further investigate the genes with the highest scores.
The model uses a variety of multi-omic features (e.g. GTEx gene expression the genes have for many tissues, GWAScatalog data, gene intolerance scores, protein-protein interaction data, drug interaction data, phenotypic scores etc.). However, I am missing epigenetic data to describe my genes so I’ve been looking to collect based on UCSC’s variant data (CpG islands, histone modifications, DNase clusters) – however this leads to my gene length problem when I am trying to reliably take data from the variant level.
I’ve been plotting my features and gene length, and seen that the UCSC epigenetic data does correlate with having a larger gene length if there is a higher count of regulatory sites (0.8 r2 for some), and so this is what I’m looking to correct.
Its very easy, just let the ML sort this out for you and that is its advantage, You're thinking of GLM style calculation where you pre-screen the data with bivariate plots, where there needs to be nice Q-Q plots and low residual.
For ML simply include the gene length as one of your parameters along with CpG etc ... and the ML regression analysis SVC, lasso, ridge, random forest will figure the relationship out between gene length and CpG. You do zero, the ML does everything, hence from a statistical point of view purists object because you don't know the relationshiop the ML has deduced between the variables, but you will get regression weights for non-DNN stuff, which will give you some idea of the impact of length.
There is the issue of transformations and that can be complicated, but I'd try untransformed data first. The only disadvantage of this approach is the user will have to input the gene size when they want to check out your training algorithm.
Correct answer by M__ on August 31, 2021
Get help from others!