How to apply dataset balancing techniques whilst using Pipeline in Sklearn?

Question

I am new to Machine Learning and trying to construct machine learning models that adhere to good practice and not susceptible to biases. I have decided to use Sklearn's Pipeline class to ensure that my model is not prone to data leakage. I am building an ML model that attempts to predict the trend Buy, Hold, Sell for the next hour.

However, my multi-class classification dataset is extremely imbalanced. Whilst it is not necessarily a concern that the test set is imbalanced, it is important that the train set is balanced. However, I have researched properly but I cannot find an answer as to where this dataset rebalancing step should be conducted. Should it be done before scaling or after? Should it be done train/test split or after?

I cannot figure out where this crucial step should be done. For simplicity's sake, I will not be using SMOTE, but rather random minority upsampling. Any answer would be greatly appreciated.

My code is as follows:

#All necessary packages have already been imported

x = df['MACD', 'MFI', 'ROC', 'RSI', 'Ultimate Oscillator', 'Williams %R', 'Awesome Oscillator', 'KAMA', 
    'Stochastic Oscillator', 'TSI', 'Volume Accumulator', 'ADI', 'CMF', 'EoM', 'FI', 'VPT','ADX','ADX Negative', 
    'ADX Positive', 'EMA', 'CRA']

y = df['Label']

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)

pipe = Pipeline([('sc', StandardScaler()), 
                 ('svc', SVC(decision_function_shape = 'ovr'))])

candidate_parameters = [{'C': [0.0001, 0.001, 0.01, 0.1, 1, 2, 3], 
                        'gamma': [0.0001, 0.001, 0.01, 0.1, 1, 2, 3], 'kernel': ['poly'] 
                        }]

clf = GridSearchCV(estimator = pipe, param_grid = candidate_parameters, cv = 5, n_jobs = -1)

clf.fit(X_train, y_train)```

Rusoiba · Answer

Should it be done before scaling or after?

According to this post, you should scale the data first :

My thought would be to standardize (normalizing is typically using the min and max values not mean and standard deviation) the data first and then over-sample if that is what you are thinking in terms of balancing. I say this because you will want to use that same mean/std dev of the original set when you standardize new data so that it mirrors the training set that was used.

Should it be done train/test split or after?

According to this post :

Sampling should always be done on train dataset. If you are using python, scikit-learn has some really cool packages to help you with this. Random sampling is a very bad option for splitting. Try stratified sampling. This splits your class proportionally between training and test set.

You should also set the class_weight parameter of sklearn.svm.SVC :

Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).

Hope this helps.

How to apply dataset balancing techniques whilst using Pipeline in Sklearn?

One Answer

Should it be done before scaling or after?

Should it be done train/test split or after?

Add your own answers!

Ask a Question