Are there any objections to using the same (unlabelled) data for pre-training of a BERT-Based model and the downstream task?

Question

I'm looking to train an Electra model using unlabelled data in a specific field. Are there any objections to using the same data for unsupervised learning and then using the same data downstream for the supervised learning task?

Jindřich · Answer

Not at all. A recent ACL paper by AllenAI even says this is the best way. They recommend continuing pre-training on the task data and claim that it reduces the problems caused by domain mismatch. So, if you train the model on the in-domain data from the very beginning, it is probably a good thing given you have enough data for that.

Answered by Jindřich on December 9, 2020

Are there any objections to using the same (unlabelled) data for pre-training of a BERT-Based model and the downstream task?

One Answer

Add your own answers!

Ask a Question