Data Science Asked by throwawayz932 on June 14, 2021
I am training a multi-class LSTM classifier on approximatively 700k documents of 40 words.
My classes are very umbalanced, some have 2 or 3 samples while the biggest class has 48548 documents.
My data was originally unlabeled so I previously trained another model to cluster them, that process was: Cleaning, etc -> LSTM auto-encoder -> DBSCAN clustering on codes -> Automatic merge using difflib’s ratio and a threshold -> Manual checks to filter, remerge, drop noisy cluters, etc -> Automatic duplicate removal inside the classes using FastDamereauLevenshtein (for example biggest class went from 100k -> 48k thanks to this).
Now, my classes are still very clean. Although umbalanced, texts are very similar to each other inside a single class, sometimes it’s just a difference of 1 or 2 words. So this should be an easy job for another classifier to learn.
I start by creating my training and validation sets, using a 90%-10% ratio (10% only, because some classes have like 3 documents, so I don’t want to drop more documents from them).
I use a tokenizer to build a vocabulary and fit the sequences:
max_words = 20000
max_len = 40
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X)
sequences = tokenizer.texts_to_sequences(X)
X = pad_sequences(sequences, maxlen=max_len)
Then, to deal with the problem of umbalanced classes, I found a solution online to smoothen the class weights and assign more weight to the classes that are less populated:
class_weights = {}
k = 0
for k_class in range(n_classes):
N = np.count_nonzero(Y_encoded==k_class)
class_weights[k] = N
k+=1
# Smoothen it
import math
total = df_train.shape[0]
keys = class_weights.keys()
class_weights_smooth = dict()
mu = 0.5
for key in keys:
score = math.log(mu*total/float(class_weights[key]))
class_weights_smooth[key] = score if score > 1.0 else 1.0
Thanks to this, the weight of my biggest class (48k documents) is 1, and the weight of the lowest class is ~11. 11 is still very small compared to 48k but at least better.
Now here’s the architecture of my classifier:
from tensorflow.keras.layers import Dropout, Embedding, LSTM, Dense
from tensorflow.keras.models import Sequential
model = Sequential()
model.add(Embedding(max_words, 50, mask_zero=True, input_length=40))
model.add(LSTM(40, return_sequences=True))
model.add(Dropout(0.5))
model.add(LSTM(40))
model.add(Dropout(0.5))
model.add(Dense(n_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
I don’t think this model is very complex. It has only 30k parameters if you exclude the 1,000,000 parameters of the embedding layer.
Training is pretty slow, however it does seem to quickly overfit to the training data:
Epoch 1/20
4999/4999 [==============================] - 943s 189ms/step - loss: 1.0499 - accuracy: 0.9229 - val_loss: 1.8765 - val_accuracy: 0.5728
Epoch 2/20
4999/4999 [==============================] - 941s 593ms/step - loss: 0.3578 - accuracy: 0.9748 - val_loss: 1.4938 - val_accuracy: 0.6635
It goes on, and the training accuracy eventually reaches >99% while the validation accuracy still lacks behind. I have to train it for almost 20 epochs so it can reach 95% on validation accuracy but the model will be ultra-overfitted.
I’m quite lost and don’t know what to really do now. How can I improve this?
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP