Dealing with high frequency tokens during masked Language modelling?

Data Science Asked on June 15, 2021

Suppose I am working with a Masked Language Model to pre-train on a specific dataset. In that dataset, most sequences have a particular token of a high frequency

Sample Sequence:-
<tok1>, <tok1>, <tok4>, <tok7>, <tok4>, <tok4> ---> here tok4 is very frequent in this sequence

So if I mask some tokens and get the model to train to predict those masked tokens, obviously the model will gain a bias in predicting <tok4> due to its statistical frequency.

Since <tok4> represents important information, ‘downsampling’ (or removing those frequent tokens) would not be preferred and I would love to have my sequence as intact as possible.

How best should I deal with this? Is there any already established method that can counter this problem?

imbalanced data language model machine learning masking

Add your own answers!

Ask a Question

Get help from others!