Why the training loss increases, and predict everything as '1' or '0'

Data Science Asked on March 10, 2021

Those two pictures are from two similar experiments using same code.

I am fine-tuning a pretrained-Bert model to do a binary text classification task, the dataset is 50% positive vs 50% negative, so the classifier shouldn’t classify everything as one class in the validation set as the picture shows.

I used AdamW optimizer with decreased learning rate.

I did the gradient clipping.

When I decrease the learning rate, it works fine, from 5e^-5 to 3e-5 or 2e-5.

What might be the problem here?

classification deep learning nlp pytorch

Add your own answers!

Ask a Question

Get help from others!

Recent Answers

haakon.io on Why fry rice before boiling?
Peter Machado on Why fry rice before boiling?
Jon Church on Why fry rice before boiling?
Joshua Engel on Why fry rice before boiling?
Lex on Does Google Analytics track 404 page responses as valid page views?