Data Science Asked by PeterPaul on April 1, 2021
We are training a BERT model (using the Huggingface library) for a sequence labeling task with six labels: five labels indicate that a token belongs to a class that is interesting to us, and one label indicates that the token does not belong to any class.
Generally speaking, this works well: loss decreases with each epoch, and we get good enough results. However, if we compute precision, recall and f-score after each epoch on a test set, we see that they oscillate quite a bit. We train for 1,000 epochs. After 100 epochs performance seems to have plateaued. During the last 900 epochs, precision jumps constantly to seemingly random values between 0.677 and 0.709; recall between 0.729 and 0.798. The model does not seem to stabilize.
To mitigate the problem, we already tried the following:
Does anyone have any recommendations on what we could do here? How can we pick the “best model”? Currently, we pick the one that performs best on the test set, but we are unsure about this approach.
BERT-style finetuning is known for its instability. Some aspects to take into account when having this kind of issues are:
There are two main articles that study BERT-like finetuning instabilities that may be of use to you. They describe in detail most of the aspects I mentioned before:
Correct answer by noe on April 1, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP