Data Science Asked by The Half-Blood Prince on January 29, 2021
I am wondering which statistical tools to use when analysing data that have multiple strong batch effects (distributions vary from one batch to another). I would like to correct batch effect when it originates from one variable, without taking off the potential batch effect from other variables.
If this is unclear, taking a short example is probably the best way to go to explain my problem:
Imagine that we have 10 persons taking part in an experiment. The experiment is as follows:
Following the experiment, we will know for all 10 000 tennis ball:
Now, because not everyone has the same capabilities for throwing tennis balls (be it in terms of muscle strength or something else), we can expect to see some strong batch effects within the data (for instance, we could observe that a ball thrown by the first person will, on average, have been launched farther than a ball with the same weight and diameter when it’s been thrown by the second participant, etc…).
Correcting for this kind of batch effect can be done in multiple ways if everyone had been given the same set of balls (in a setting with normal distributions, standardisation would probably work fine). Now, imagine that when organising the experiment, we did not pay enough attention and ended up giving some people heavier tennis balls, some others smaller tennis balls, etc…
At the end of the experiment, we realise indeed using a Chi2 test (or, say, Kruskal-Wallis H test) that everyone was not given a set of balls coming from a random sampling of all 10 000 balls.
How can we then correct for who threw the ball, without taking off the batch effects originating from the fact that the set of balls were different?
The main problem is that by correcting for batch effect using regular standardisation (for instance), we will probably end up removing the effect due to the fact that some people were given heavier or larger balls.
Or, in other words using the example, how could we account for the difference in terms of strength between the first and the second participant while not correcting for the fact that the first participant had in average heavier tennis balls than the second?
At first, I was thinking of running a Generalized Linear Model with the dependent variable being the distance at which the balls were thrown, and all the other variables as regressors, and then subtracting to the dependent variable only the effect of the variable for who threw the ball. I am however unsure of whether this would or not make statistical sense, which is why I ask if other techniques can be used, or if this one would work.
My first try to this would be to cluster my independent variables and use the labels in a supervised learning predictor. But I think this can be a good research problem beyond the scope of a stackoverflow short answer. I would recommend you to start with the Exceptional Model Mining paper.
Hope this helps!
Answered by Adelson Araújo on January 29, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP