Data Science Asked by chekhovana on April 7, 2021
This is a citation from “Hands-on machine learning with Scikit-Learn, Keras and TensorFlow” by Aurelien Geron:
“Bootstrapping introduces a bit more diversity in the subsets that each predictor is trained on, so bagging ends up with a slightly higher bias than pasting, but this also means that predictors end up being less correlated so the ensemble’s variance is reduced.”
I can’t understand why bagging, as compared to pasting, results in higher bias and lower variance. Can anyone provide an intuitive explanation of this?
Let's say we have a set of 40 numbers from 1 to 40. We have to pick 4 subsets of 10 numbers.
Case 1 - Bagging -
We will pick the first number, put it back, and then pick the next. This makes all the draw independent and consequently have very little correlation.
So, if you make a Tree on the first 10 samples and another Tree on the next, both the trees will have little correlation and high variance(among them) (more independent splits).
At the same time, because of selection with replacement the data points will be repeated [~63% unique] [Ref], which will increase the bias of individual Trees.
In case of bagging, the sample size is equal to the size of the dataset but we just created this scenario to compare it with Pasting.
Same logic goes for splitting with Random Features subset i.e. RandomForest.
It might be possible that a Split on a particular Feature may result in the correlated next split(always). So if we randomly pick a subset of features before each split, then this will further reduce the Correlation. But again, we will have increased Bias.
Case 2 - Pasting -
Here, because of selection without replacement, the data points in each sample will be unique which will result in lesser bias of individual Trees.
Answered by 10xAI on April 7, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP