TransWikia.com

sklearn.cross_validation.cross_val_score "cv" parameter question

Data Science Asked by Tennessee Leeuwenburg on October 5, 2021

I was working through a tutorial on the titanic disaster from Kaggle and I’m getting different results depending on the details of how I use cross_validation.cross_val_score.

If I call it like:

scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)

print(scores.mean())

0.801346801347

I get a different set of scores than if I call it like:

kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)

print(scores.mean())

0.785634118967

These numbers are close, but different enough to be significant. As far as I understand, both code snippets are asking for a 3-fold cross validation strategy. Can anyone explain what is going on under the hood of the second example which is leading to the slightly lower score?

2 Answers

From the sklearn docs for cross_val_score's cv argument :

For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

I believe that in the first case, StratifiedKFold is being used as the default. In the second case, you are explicitly passing a KFold generator.

The difference between the two is also documented in the docs.

KFold divides all the samples in $k$ groups of samples, called folds (if $k = n$, this is equivalent to the Leave One Out strategy), of equal sizes (if possible).

[...]

StratifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.

This difference in folds is what is causing the difference in scores.

As a side note, I noticed that you are passing a random_state argument to the KFold object. However, you should note that this seed is only used if you also set KFold's shuffle parameter to True, which by default is False.

Correct answer by Reii Nakano on October 5, 2021

As mentioned by @Reii Nakano if us estimator is classifier and your $Y$ is binary StratifiedKFold will be used else KFold will be used.

Also interesting part here is you are using random_state = 1 in KFold. So data in splits in case of KFold is not necessarily same as Splits in cross_val_score.
Hence, your final score may differ.

Answered by Sagar Sitap on October 5, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP