Data Science Asked on February 12, 2021
I want to create a classifier for a data frame that has four classes. Each line can only have one class. I have two predictive models: a neural network and a tree classifier. But they put everyone in one class, during training and therefore during testing.
The problem is that the classification from my neural network is:
I call the model here:
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import load_model
model = create_model(x_train.shape[1], y_train.shape[1])
epochs = 30
batch_sz = 64
print("Beginning model training with batch size {} and {} epochs".format(batch_sz, epochs))
checkpoint = ModelCheckpoint("lc_model.h5", monitor='val_acc', verbose=0, save_best_only=True, mode='auto', period=1)
# train the model
history = model.fit(x_train.to_numpy(),
y_train.to_numpy(),
validation_split=0.2,
epochs=epochs,
batch_size=batch_sz,
verbose=2,
# class_weight = weights, # class_weight tells the model to "pay more attention" to samples from an under-represented grade class.
# callbacks=[checkpoint]
)
# revert to the best model encountered during training
model = load_model("lc_model.h5", compile=False)
Here is the architecture of the model:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.constraints import MaxNorm
# from tensorflow.python.compiler.tensorrt import trt_convert as trt
def create_model(input_dim, output_dim):
print(output_dim)
# create model
model = Sequential()
print("sequential")
# input layer
model.add(Dense(100, input_dim=input_dim, activation='relu', kernel_constraint=MaxNorm(3)))
model.add(Dropout(0.2))
# hidden layer
model.add(Dense(60, activation='relu', kernel_constraint=MaxNorm(3)))
model.add(Dropout(0.2))
# output layer
model.add(Dense(output_dim, activation='softmax'))
# Compile model
model.compile(loss=focal_loss(alpha=1), loss_weights=None, optimizer='nadam', metrics=['accuracy'])
return model
Here is a part of x_train
.
id reg 0.0_x 1.0_x 17.0 21.0 30.0 40.0 50.0 60.0 70.0 Célibataire Divorcé(e) Marié et j'ai des enfants à charge Marié et je n'ai pas encore d'enfants à charge Refus de répondre Veuf (ve) 1er cycle universitaire / Licence 2e cycle universtaire / Master 3e cycle universtaire / Doctorat BTS Je n'ai jamais été à l'école Niveau collège Niveau lycée Niveau primaire Autre. Merci de préciser :@NS$ Infirme J'ai une société Je ne travaille pas Je suis commerçant Je suis encore étudiant Je suis independent Je suis journalier, je travaille de temps à autre Je suis retraité Je travaille dans le secteur privé Je travaille dans le secteur public 0.0_y 250.0 3750.0 7500.0 8750.0 11250.0 11500.0 18750.0 25000.0 35000.0 45000.0 50000.0 0.0_x.1 1.0_y 0.0_y.1 1.0_x.1 Je ne suis pas d'accord Je suis d'accord False_x True_y False_y True_x False_x.1 True_y.1 False_y.1 True_x.1 False_x.2 True_y.2 False_y.2 True_x.2 False_x.3 True_y.3 False_y.3 True_x.3 False_x.4 True_y.4 False_y.4 True_x.4 False_x.5 True_y.5 False_y.5 True_x.5 0.0_x.2 1.0_y.1 0.0_y.2 1.0_x.2 0.0_x.3 1.0_y.2 0.0_y.3 1.0
0 NaN 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1
1 NaN 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0
...
And here is part of y_train
:
Voting intention in 2021_Cast a blank vote Voting intention in 2021_I know who I will be voting for in 2021 Voting intention in 2021_I won't vote Voting intention in 2021_I'm going to vote in 2021 but don't know for who
0 0 1 0 0
1 0 0 0 1
...
So when I try to test this model it’s not better than random:
sequential
Beginning model training with batch size 64 and 30 epochs
WARNING:tensorflow:`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of samples seen.
Train on 768 samples, validate on 192 samples
Epoch 1/30
768/768 - 1s - loss: -inf - acc: 0.2448 - val_loss: -inf - val_acc: 0.2708
Epoch 2/30
768/768 - 0s - loss: -inf - acc: 0.2409 - val_loss: -inf - val_acc: 0.2708
...
Epoch 30/30
768/768 - 0s - loss: -inf - acc: 0.2409 - val_loss: -inf - val_acc: 0.2708
Indeed, the accuracy is just below 25%, which is a result I would have expected from random selecting the classes. And it seems to never learn anything as the loss is always -inf
.
So I calculate the accuracy of the model on the test set and it is even worse. Indeed with the following code:
import numpy as np
from sklearn.metrics import f1_score
y_pred = model.predict(x_test.to_numpy())
# Revert one-hot encoding to classes
y_pred_classes = pd.DataFrame((y_pred.argmax(1)[:,None] == np.arange(y_pred.shape[1])),
columns=y_test.columns,
index=y_test.index)
y_test_vals = y_test.idxmax(1)
y_pred_vals = y_pred_classes.idxmax(1)
# F1 score
# Use idxmax() to convert back from one-hot encoding
f1 = f1_score(y_test_vals, y_pred_vals, average='weighted')
print("Test Set Accuracy: {:.2%} (But results would have been better if trained on the FULL dataset)".format(f1))
I don’t understand, it is an architecture that I had managed to put to work on another loan classification problem.
I get: Test Set Accuracy: 10.92%
All the preceding modelisation were unweighted or without focal loss.
I tried to cope with the class unbalance in different way, such as resampling. Without resampling I did it with weights:
weights = df_en2['Voting intention in 2021'].value_counts(normalize=True)
weights = weights.sort_index().tolist()
weights = {0: 1 / weights[0],
1: 1 / weights[1],
2: 1 / weights[2],
3: 1 / weights[3]}
Where dfen_2
is the dataframe that gives x_train
, y_train
, x_test
, y_test
with spilt_data()
function which you can find here (It’s basically the same architecture but for the loan classification problem).
In comparison, with a tree classifier, if I leave max_depth
to None
, the leaves are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. And I get an average accuracy of 42%.
# importing necessary libraries
from sklearn import datasets
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
# dividing X, y into train and test data
# X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
# training a DescisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
dtree_model = DecisionTreeClassifier().fit(x_train, y_train)
dtree_predictions = dtree_model.predict(x_test)
# creating a confusion matrix
cm = confusion_matrix(y_test.values.argmax(axis = 1), dtree_predictions.argmax(axis = 1))
And it returns:
array([[0.19047619, 0.15873016, 0.45454545, 0.27118644],
[0.15873016, 0.38095238, 0.2 , 0.30508475],
[0.15873016, 0.15873016, 0.4 , 0.22033898],
[0.19047619, 0.19047619, 0.21818182, 0.38983051]])
With multi-label classification you want to use binary_crossentropy loss and sigmoid activation on the final layer since each class can take a value between 0 and 1. In your current case, your model is assuming that the sum of all values from the last layer should total 1, when in reality the total could be more than 1. Here is a good reference: https://towardsdatascience.com/multi-label-image-classification-with-neural-network-keras-ddc1ab1afede
Answered by LiamFiddler on February 12, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP