Details of DESeq2 modeling a batch effect

Question

When correcting my data for a batch effect using removeBatchEffect, some of the gene expression values become negative.
When searching for differentially expressed genes, I do not use the data above, but rather model the batch using deseq2 (design=~Batch + Condition).
However, I started worrying. When DESeq2 introduces a batch into the model, does it allow for negative values "behind the scenes"?
If it does, I do not understand how that can make sense, in the context of RNA-Seq data.

ATpoint · Accepted Answer

DESeq2 uses the batch information (and everything else in the design) to produce offsets for its GLM. For a background on that please check how linear models work, e.g. using the StatQuest series of statistics videos over at YouTube.
It still operates on the raw counts. The same goes for the normalization factors.
removeBatchEffect fits a linear model to the data including the batch information and then subtracts the batch component from the counts (that is basically the baseline difference).
If you are interested in preserving the integer nature of counts and preserving zeros as smallest values after explicit batch correction you may want to check ComBat-seq() from the sva package. It operates on raw counts and returns batch-corrected raw counts which you could then normalize and calculate CPMs from (if you need batch corrected CPMs with no negative values). I find it useful and prefer it over removeBatchEffect as it avoids the unfortunate negative counts which sometimes messes up plotting scripts that expect zero as the smallest possible value.

Details of DESeq2 modeling a batch effect

One Answer

Add your own answers!

Ask a Question