TransWikia.com

Oversampling before Cross-Validation, is it a problem?

Data Science Asked by Debadri Dutta on December 29, 2020

I have a multi-class classification problem to solve which is highly imbalanced. Obviously I’m doing oversampling, but I’m doing cross-validation with the over-sampled dataset, as a result of which I should be having repetition of data in the train as well as validation set. I’m using lightgbm algorithm, but surprisingly there is not much difference between cross-validation score and the score on the unseen dataset.

However I just want to know whether its fine to do cross-validation after oversampling the dataset, if not why am I getting such close score on the validation set and the unseen test set?

Also if its not correct to do oversampling before the cross-validation, then it becomes to lengthy to split the data into validation and training and then again sample the training set, and again during final prediction if you’re looking to use all the data then you’ve to append the validation and the training data and then again oversample. Is there any shortcut method to solve the problem?

2 Answers

Oversampling the training data may help the classifier to better predict on the originally less represented class. This does not mean that it should be applied to performance metrics, as it changes the original target distribution and thus creates bias in the results.

Imagine the problem of cancer detection, where your original dataset is unbalanced: 10% of the patients have cancer y=1 and the remaining 90% don't y=0. If you train a classifier which is prone to error on unbalanced datasets (such as an Artificial Neural Network), you may end up predicting always the majority class: y=0.

If you oversample to a new distribution, let's say 50/50, your classifier is expected to increase the performance, specially on the positive class. Nonetheless, to measure the performance on real data, which is by itself skewed, measure on oversampled may not be the best choice.

Thus, if you are optimizing the hyperparameters or choosing from a set of classifiers, cross-validating with oversampled data may provide you with a different perspective on the classifier's ability to predict on both classes with equal importance. Nonetheless, if you are estimating the real-life prediction capability, I would not advise you to oversample such validation data!

Correct answer by UrbanoFonseca on December 29, 2020

I suggest having a read of this article. The article explains:

When upsampling before cross validation, you will be picking the most oversampled model, because the oversampling is allowing data to leak from the validation folds into the training folds.

Instead, we should first split into training and validation folds. Then, on each fold, we should:

  1. Oversample the minority class
  2. Train the classifier on the training folds
  3. Validate the classifier on the remaining fold

Therefore, to avoid overfitting, try using the imblearn make_pipeline class such that you can upsample as part of the cross validation, like so:

    kf = KFold(n_splits=5, random_state=42, shuffle=False)

    # define parametres for hypertuning
    params = {
        'n_estimators': [50, 100, 200],
        'max_depth': [4, 6, 10, 12],
        'random_state': [13]
    }

    from imblearn.pipeline import Pipeline, make_pipeline
    imba_pipeline = make_pipeline(SMOTE(random_state=42), 
                                  RandomForestClassifier(n_estimators=100, random_state=13))
    cross_val_score(imba_pipeline, X_train, y_train, scoring='recall', cv=kf)
    new_params = {'randomforestclassifier__' + key: params[key] for key in params}
    grid_imba = GridSearchCV(imba_pipeline, param_grid=new_params, cv=kf, scoring='recall',
                            return_train_score=True)
    grid_imba.fit(X_train, y_train);
    
    # check recall on validation set
    grid_imba.best_score_
    
    # check recall on test set
    y_test_predict = grid_imba.predict(X_test)

This will result in the validation set recall being a good estimate of the test set recall.

Answered by sums22 on December 29, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP