Data Science Asked on September 2, 2021
I am building a model to predict if a customer will use a coupon or not for a given campaign. I am using logistic regression for this model. I took 5 previous campaigns and generally for each campaign conversion rate is around 10%. Thus, to handle this imbalanced data set and to capture more info I took a stratified sample from this data(whole 5 campaigns) such that there are 50% of sample with negatives and 50% positives. Thus, I am oversampling my positives.
My doubt is if I use logistic regression where it estimates coefficients using maximum log likelihood. Will this oversampling will generate bias results?
Also, I think this oversampling won’t create any problem with random forest?
The effect will be increasing the intercept. I don't recommend doing oversampling unless any other solution doesn't work. Besides, 10% is not such a big imbalance.
I've been in kaggle competitions with way more imbalance where no imbalance solutions were adopted, logistic regression and random forest work quite well without the need of these.
Edit
After @Ben Reininger input, here's a theoretical justification on how does the intercept change.
Also, a quick experiment showcasing that over-sampling doesn't help improve a metric like AUC, and that it indeed increases the intercept of the logistic regression model.
Answered by David Masip on September 2, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP