Data Science Asked by Tennessee Leeuwenburg on October 5, 2021
I was working through a tutorial on the titanic disaster from Kaggle and I’m getting different results depending on the details of how I use cross_validation.cross_val_score
.
If I call it like:
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
print(scores.mean())
0.801346801347
I get a different set of scores than if I call it like:
kf = KFold(titanic.shape[0], n_folds=3, random_state=1)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)
print(scores.mean())
0.785634118967
These numbers are close, but different enough to be significant. As far as I understand, both code snippets are asking for a 3-fold cross validation strategy. Can anyone explain what is going on under the hood of the second example which is leading to the slightly lower score?
From the sklearn docs for cross_val_score
's cv argument :
For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.
I believe that in the first case, StratifiedKFold
is being used as the default. In the second case, you are explicitly passing a KFold
generator.
The difference between the two is also documented in the docs.
KFold
divides all the samples in $k$ groups of samples, called folds (if $k = n$, this is equivalent to the Leave One Out strategy), of equal sizes (if possible).[...]
StratifiedKFold
is a variation ofk-fold
which returnsstratified
folds: each set contains approximately the same percentage of samples of each target class as the complete set.
This difference in folds is what is causing the difference in scores.
As a side note, I noticed that you are passing a random_state
argument to the KFold
object. However, you should note that this seed is only used if you also set KFold
's shuffle parameter to True
, which by default is False
.
Correct answer by Reii Nakano on October 5, 2021
As mentioned by @Reii Nakano if us estimator is classifier and your $Y$ is binary StratifiedKFold
will be used else KFold
will be used.
Also interesting part here is you are using random_state = 1
in KFold
. So data in splits in case of KFold
is not necessarily same as Splits
in cross_val_score
.
Hence, your final score may differ.
Answered by Sagar Sitap on October 5, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP