Data Science Asked by towi_parallelism on June 26, 2021
I am using Repeated K-folds (RepeatedKFold(n_splits=10, n_repeats=10, random_state=999)
from sklearn) to provide reliable scores for a linear regression on my dataset.
The dataset has some outliers which should stay and also similar cases can be seen in future observations. When a trained data in a fold tries to predict such observations, I get negative scores (at least, this is my interpretation)
Question: the main question is what should I do with one (or a few) bad score(s) out of many? How should I report them and how useful would that be?
Using 10 splits and 10 repeats for a data of size ~3000 observations, I will get 100 r-squared scores which are all in a good range (0.97
to 0.99
). There is only one guy ruining the game and the score is so bad (-11535
) that I cannot even get an average!
[ 9.87345591e-01 9.73912516e-01 ... -1.15353090e+04 ... 9.72986827e-01]
What shall I do in this case? how to report it and/or how to cure it?
Your result is really a bit strange (it’s the R2, right, so the score makes no sense as R2 should be in a range between 0,1). When you do 10fold cv, each bit of your data will be used in one of the folds. So when 9/10 runs are okay, but one of the 10 scores in a 10th run is very bad, it could be coincidental clustering of your outliers in this one case.
For me this raises the question of robustness. So I would run „a lot“ of 10fold cv over my data to see if the problem occurs again (and to what extent). This is cheap in your case but gives you good arguments.
Also this gives you a good idea of how sensitive your model actually is to potential „strange“ outliers which might be part of real world problems when I understand you correctly.
If true, I would also check where outliers are particularly problematic if possible. I don‘t know the scale of your model, but having a look at predicted values vs actual y, may give you an idea.
Answered by Peter on June 26, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP