Data Science Asked on April 28, 2021
I am new to Machine Learning and trying to construct machine learning models that adhere to good practice and not susceptible to biases. I have decided to use Sklearn’s Pipeline
class to ensure that my model is not prone to data leakage. I am building an ML model that attempts to predict the trend Buy, Hold, Sell for the next hour.
However, my multi-class classification dataset is extremely imbalanced. Whilst it is not necessarily a concern that the test set is imbalanced, it is important that the train set is balanced. However, I have researched properly but I cannot find an answer as to where this dataset rebalancing step should be conducted. Should it be done before scaling or after? Should it be done train/test split or after?
I cannot figure out where this crucial step should be done. For simplicity’s sake, I will not be using SMOTE, but rather random minority upsampling. Any answer would be greatly appreciated.
My code is as follows:
#All necessary packages have already been imported
x = df['MACD', 'MFI', 'ROC', 'RSI', 'Ultimate Oscillator', 'Williams %R', 'Awesome Oscillator', 'KAMA',
'Stochastic Oscillator', 'TSI', 'Volume Accumulator', 'ADI', 'CMF', 'EoM', 'FI', 'VPT','ADX','ADX Negative',
'ADX Positive', 'EMA', 'CRA']
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)
pipe = Pipeline([('sc', StandardScaler()),
('svc', SVC(decision_function_shape = 'ovr'))])
candidate_parameters = [{'C': [0.0001, 0.001, 0.01, 0.1, 1, 2, 3],
'gamma': [0.0001, 0.001, 0.01, 0.1, 1, 2, 3], 'kernel': ['poly']
}]
clf = GridSearchCV(estimator = pipe, param_grid = candidate_parameters, cv = 5, n_jobs = -1)
clf.fit(X_train, y_train)```
According to this post, you should scale the data first :
My thought would be to standardize (normalizing is typically using the min and max values not mean and standard deviation) the data first and then over-sample if that is what you are thinking in terms of balancing. I say this because you will want to use that same mean/std dev of the original set when you standardize new data so that it mirrors the training set that was used.
According to this post :
Sampling should always be done on train dataset. If you are using python, scikit-learn has some really cool packages to help you with this. Random sampling is a very bad option for splitting. Try stratified sampling. This splits your class proportionally between training and test set.
You should also set the class_weight parameter of sklearn.svm.SVC :
Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).
Hope this helps.
Answered by Rusoiba on April 28, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP