Data Science Asked by Srinivas on August 10, 2021
I am participating in a Kaggle multiclass classification competition. The submissions will be scored based on the ‘logloss’ score. I am using Keras and Scikit libraries and a deep learning network model and have taken the below approach.
I have corrected class imbalance in the training data using oversampling the minority classes. I have split the training data into training (X_train, y_train) and validation datasets (X_test, y_test). I have scaled the features and I have done categorical encoding of labels.
When I run the model, I am getting very good Validation loss (1.708) and Validation accuracy (compared to Kaggle leaderboard scores; top logloss score is 1.744), but when I submit my predicted probabilities for different classes for the test_set, I am getting awfully high loss score (4+) (It is a different matter I got a different, decent score – 2.02, using a different model approach, which is reflected in the leaderboard).
Why is this? Any suggestions on what should be done or where I am going wrong?
total classes:
Class_3 51811
Class_7 51811
Class_2 51811
Class_5 51811
Class_1 51811
Class_9 51811
Class_6 51811
Class_8 51811
Class_4 51811
Name: target, dtype: int64
466299
X_train, X_test, y_train, y_test = tts(X, y,test_size =.3, stratify=y, random_state=9)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(326409, 75)
(326409, 9)
(139890, 75)
(139890, 9)
display(X_train.head(3))
display(X_test.head(3))
display(y_train[:3])
display(y_test[:3])
feature_0 feature_1 feature_2 feature_3 feature_4 feature_5 feature_6 feature_7 feature_8 feature_9 ... feature_65 feature_66 feature_67 feature_68 feature_69 feature_70 feature_71 feature_72 feature_73 feature_74
425643 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 3 0 1 0 0 0
303754 2 3 2 2 5 0 0 1 1 1 ... 1 0 0 0 0 0 0 4 6 0
80710 2 8 2 0 18 2 0 2 1 3 ... 0 0 4 1 0 3 0 0 1 0
3 rows × 75 columns
feature_0 feature_1 feature_2 feature_3 feature_4 feature_5 feature_6 feature_7 feature_8 feature_9 ... feature_65 feature_66 feature_67 feature_68 feature_69 feature_70 feature_71 feature_72 feature_73 feature_74
300226 0 0 1 4 0 0 0 4 1 1 ... 1 0 1 0 0 1 0 0 2 2
124793 0 0 0 6 0 0 0 3 7 2 ... 0 0 0 0 0 0 0 0 0 0
439437 0 3 0 0 5 0 0 2 1 1 ... 2 0 0 0 3 0 4 0 0 0
3 rows × 75 columns
array([[0., 0., 0., 0., 0., 0., 0., 0., 1.],
[0., 0., 0., 0., 0., 1., 0., 0., 0.],
[0., 1., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)
array([[0., 0., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 1.]], dtype=float32)
print(X_train.index.isin(X_test.index).sum())
print(X_test.index.isin(X_train.index).sum())
0
0
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)
test_set = scaler.fit_transform(test_set)
from keras.optimizers import Adam
from tensorflow.keras import layers
model = Sequential()
model.add(Dense(1024, input_shape=(75,), activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(9, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=.001), metrics=['accuracy'], )
from tensorflow.keras.callbacks import EarlyStopping
monitor_val_acc = EarlyStopping(monitor='val_loss', patience=5)
model.fit(X_train, y_train, epochs = 50, validation_split=.3, callbacks= [monitor_val_acc], batch_size=1024)
accuracy = model.evaluate(X_test, y_test)[1]
print('Accuracy:', accuracy)
............
Epoch 28/30
45/45 [==============================] - 5s 117ms/step - loss: 1.6676 - accuracy: 0.3626 - val_loss: 1.7675 - val_accuracy: 0.3333
Epoch 29/30
45/45 [==============================] - 5s 114ms/step - loss: 1.6140 - accuracy: 0.3809 - val_loss: 1.7815 - val_accuracy: 0.3357
Epoch 30/30
45/45 [==============================] - 5s 117ms/step - loss: 1.5942 - accuracy: 0.3869 - val_loss: 1.7126 - val_accuracy: 0.3563
4372/4372 [==============================] - 11s 2ms/step - loss: 1.7085 - accuracy: 0.3582
Accuracy: 0.3581957221031189
from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss
preds_val = model.predict(X_test)
preds_val[:3]
array([[1.13723904e-01, 5.20741269e-02, 4.70720865e-02, 1.59640312e-02,
1.92086305e-02, 2.25828230e-01, 1.81854114e-01, 1.99746847e-01,
1.44528091e-01],
[6.04994688e-03, 1.40825182e-01, 9.95656699e-02, 5.96038415e-04,
5.59030111e-09, 4.57442701e-02, 3.05081338e-01, 1.77178025e-01,
2.24959582e-01],
[6.54266328e-02, 9.87399742e-02, 1.07230745e-01, 1.46904245e-01,
6.80148089e-03, 1.52257413e-01, 1.22348621e-01, 1.58026025e-01,
1.42264828e-01]], dtype=float32)
log_loss(y_test, preds_val)
1.708450169537806
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP