Data Science Asked by hanzgs on May 29, 2021
I have the imbalanced dataset:
data['Class'].value_counts()
Out[22]:
0 137757
1 4905
Name: Class, dtype: int64
X_train, X_valid, y_train, y_valid = train_test_split(input_x, input_y, test_size=0.20, random_state=seed)
print(sorted(Counter(y_train).items()))
[(0, 110215), (1, 3914)]
I tried different imblearn functions:
from imblearn.combine import SMOTEENN, SMOTETomek
from imblearn.over_sampling import ADASYN, BorderlineSMOTE, RandomOverSampler, SMOTE
from imblearn.under_sampling import CondensedNearestNeighbour, EditedNearestNeighbours, RepeatedEditedNearestNeighbours
from imblearn.under_sampling import AllKNN, InstanceHardnessThreshold, NeighbourhoodCleaningRule, TomekLinks
smote_enn = SMOTEENN(random_state=27)
smote_tomek = SMOTETomek(random_state=27)
adasyn = ADASYN(random_state=27)
borderline = BorderlineSMOTE(random_state=27)
ran_oversample = RandomOverSampler(random_state=27)
smote = SMOTE(random_state=27)
cnn = CondensedNearestNeighbour(random_state=27)
enn = EditedNearestNeighbours(random_state=27)
renn = RepeatedEditedNearestNeighbours(random_state=27)
allknn = AllKNN(random_state=27)
iht = InstanceHardnessThreshold(random_state=0)
ncr = NeighbourhoodCleaningRule(random_state=27)
tomek = TomekLinks(random_state=27)
Created different trains:
def BalancingData(function):
X_train_resampled, y_train_resampled = function.fit_sample(X_train, y_train)
print(sorted(Counter(y_train_resampled).items()))
return X_train_resampled, y_train_resampled
X_train_smote_enn, y_train_smote_enn = BalancingData(smote_enn)
X_train_smote_tomek, y_train_smote_tomek = BalancingData(smote_tomek)
X_train_adasyn, y_train_adasyn = BalancingData(adasyn)
X_train_borderline, y_train_borderline = BalancingData(borderline)
X_train_ran_oversample, y_train_ran_oversample = BalancingData(ran_oversample)
X_train_smote, y_train_smote = BalancingData(smote)
X_train_cnn, y_train_cnn = BalancingData(cnn)
X_train_enn, y_train_enn = BalancingData(enn)
X_train_renn, y_train_renn = BalancingData(renn)
X_train_allknn, y_train_allknn = BalancingData(allknn)
X_train_iht, y_train_iht = BalancingData(iht)
X_train_ncr, y_train_ncr = BalancingData(ncr)
X_train_tomek, y_train_tomek = BalancingData(tomek)
Then built the Keras model:
def my_model(X,y):
Keras_model = Sequential()
Keras_model.add(Dense(33,activation='sigmoid', kernel_initializer='glorot_uniform', kernel_constraint=maxnorm(12), input_shape=(input_len,)))
Keras_model.add(Dense(1, activation='sigmoid', kernel_initializer='glorot_uniform'))
Keras_model.compile(optimizer = 'Adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
Keras_model.fit(X, y, validation_data=(X_valid,y_valid), batch_size = 1000, epochs = 10, verbose = 0)
scores_valid = Keras_model.evaluate(X_valid, y_valid, verbose=1)
scores_train = Keras_model.evaluate(X, y, verbose=1)
scores_full = Keras_model.evaluate(X_train, y_train, verbose=1)
call with different imb train and valid sets
my_model(X_train, y_train)
my_model(X_train_smote_enn, y_train_smote_enn)
my_model(X_train_smote_tomek, y_train_smote_tomek)
and so on
I find very minor improvement in the accuracy in allKNN. This is the initial benchmark, I know we can still tune the parameters in allKNN and try to improve it. My question is: why can’t I find very much improvement without using the balancing functions the base model itself showing a better accuracy?
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP