sklearn.cross_validation.cross_val_score "cv" parameter question

Question

I was working through a tutorial on the titanic disaster from Kaggle and I'm getting different results depending on the details of how I use cross_validation.cross_val_score.

If I call it like:

scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)

print(scores.mean())

0.801346801347

I get a different set of scores than if I call it like:

kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)

print(scores.mean())

0.785634118967

These numbers are close, but different enough to be significant. As far as I understand, both code snippets are asking for a 3-fold cross validation strategy. Can anyone explain what is going on under the hood of the second example which is leading to the slightly lower score?

Reii Nakano · Accepted Answer

From the sklearn docs for cross_val_score's cv argument :

For int/None inputs, if the estimator is a classifier and y is either
binary or multiclass, StratifiedKFold is used. In all other cases,
KFold is used.

I believe that in the first case, StratifiedKFold is being used as the default. In the second case, you are explicitly passing a KFold generator.
The difference between the two is also documented in the docs.

KFold divides all the samples in $k$ groups of samples, called folds (if
$k = n$, this is equivalent to the Leave One Out strategy), of equal
sizes (if possible).
[...]
StratifiedKFold is a variation of k-fold which returns
stratified folds: each set contains approximately the same
percentage of samples of each target class as the complete set.

This difference in folds is what is causing the difference in scores.
As a side note, I noticed that you are passing a random_state argument to the KFold object. However, you should note that this seed is only used if you also set KFold's shuffle parameter to True, which by default is False.

Sagar Sitap · Answer

As mentioned by @Reii Nakano if us estimator is classifier and your $Y$ is binary StratifiedKFold will be used else KFold will be used.
Also interesting part here is you are using random_state = 1 in KFold. So data in splits in case of KFold is not necessarily same as Splits in cross_val_score.
Hence, your final score may differ.

sklearn.cross_validation.cross_val_score "cv" parameter question

2 Answers

Add your own answers!

Ask a Question