Data Science Asked on June 7, 2021
So the official BERT English model is trained on Wikipedia and BookCurpos (source).
Now, for example, let’s say I want to use BERT for Movies tag recommendation. Is there any reason for me to pretrain a new BERT model from scratch on movie-related dataset?
Can my model become more accurate since I trained it on movie-related texts rather than general texts? Is there an example of such usage?
To be clear, the question is on the importance of context (not size) of the dataset.
Sure, if you have a large and good quality in-domain dataset, the results may certainly be better than with the generic pretrained BERT.
This has already been done before: BioBERT is a BERT model pretrained on biomedical texts:
[...] a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question answering (12.24% MRR improvement).
Of course, other factors may be taken into account in the decision to pretrain such a model, e.g. computational budget.
Correct answer by noe on June 7, 2021
BERT is a fairly large model that requires many data and lots of training time to achieve its state-of-the-art performance. More often than not, there isn't enough data nor resources to completely train BERT from scratch. That's where these pretrained models are useful. The weights learned from prior training serve as a useful starting point for training your dataset -- a concept refereed to as transfer learning.
In a silly example, to properly generate movie tag recommendations, it first needs to learn how to "read" the tags. Or with image classification, it first needs to "see" the image. Training these models from scratch forces them to "learn" how to read or see before learning how to classify. With pretraining, the model already knows how to see/read and can better utilize training time/resources to optimize performance.
Many people freeze most layers during transfer learning and focus on training the tail-end of the model as a way to reduce the training time needed. How many layers you freeze -- if you freeze any at all -- depends on how much time you're willing to put into training the model. Play around, and see what happens with BERT. Good luck!
Answered by Robert Link on June 7, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP