Data Science Asked by moirenn on April 7, 2021
For a project I want to perform stratified 5-fold cross-validation, where for each fold the data is split into a test set (20%), validation set (20%) and training set (60%). I want the test sets and validation sets to be non-overlapping (for each of the five folds).
This is how it’s more or less described on Wikipedia:
A single k-fold cross-validation is used with both a validation and test set. The total data set is split in k sets. One by one, a set is selected as test set. Then, one by one, one of the remaining sets is used as a validation set and the other k – 2 sets are used as training sets until all possible combinations have been evaluated. The training set is used for model fitting and the validation set is used for model evaluation for each of the hyperparameter sets. Finally, for the selected parameter set, the test set is used to evaluate the model with the best parameter set. Here, two variants are possible: either evaluating the model that was trained on the training set or evaluating a new model that was fit on the combination of the train and the validation set.
Right now, I’ve implemented something that looks like the following (described here):
kf = KFold(n_splits = 5, shuffle = True, random_state = 2)
for train_index, test_index in kf.split(X):
X_tr_va, X_test = X.iloc[train_index], X.iloc[test_index]
y_tr_va, y_test = y[train_index], y[test_index]
X_train, X_val, y_train, y_val = train_test_split(X_tr_va, y_tr_va, test_size=0.25)
print("TRAIN:", list(X_train.index), "VALIDATION:", list(X_val.index), "TEST:", test_index)
Although it does quite nicely provide me with 5 folds whereby for each fold I have a validation set, test set and training set, the validation sets have overlap between each of the five folds (so some instances present in fold 1 are also present in fold 2, etc.). For the test sets this is not the case (i.e. they do not have overlap between folds).
Is there a way to prevent this overlap between folds from occurring for the validation set as well?
You can leave a portion of your dataset as a hold-out test set for the final model validation from the beginning, and proceed with the k-fold cross-validation strategy with the rest of the data as follows:
source: https://scikit-learn.org/stable/modules/cross_validation.html
This way, you do not have the overlap you mention and, mainly, you make sure that the model is finally evaluated with a never-seen before data.
Answered by German C M on April 7, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP