Data Science Asked on March 10, 2021
Those two pictures are from two similar experiments using same code.
I am fine-tuning a pretrained-Bert model to do a binary text classification task, the dataset is 50% positive vs 50% negative, so the classifier shouldn’t classify everything as one class in the validation set as the picture shows.
I used AdamW optimizer with decreased learning rate.
I did the gradient clipping.
When I decrease the learning rate, it works fine, from 5e^-5 to 3e-5 or 2e-5.
What might be the problem here?
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP