Practical attention models

Question

Attention is all you need is a nice paper that suggests using
positional encodings as an alternative to RNNs in their Transformer architecture.
GPT-2 and GPT-3 are examples of using this architecture which
are trained on input data of a massive scale.
Is there a paper and a model that uses positional encodings
and outcompetes RNN/LSTM based models for small scale datasets (MBs of text data, not terabytes)?
If there are many, which ones are the leading ones in production applications?

Simon Larsson · Answer

Is there a paper and a model that uses positional encodings and outcompetes RNN/LSTM based models for small scale datasets (MBs of text data, not terabytes)?

Yes, there are several. Similar to GPT, they still pre-train on terabytes of data. But the embedding they learn generalize well. Then you can fine-tune on a much smaller dataset. It works much in the same way as transfer learning on a CNN where a model first is trained on ImageNet and then trained on a specific task. It tends to give better results than RNN/LSTMs.

If there are many, which ones are the leading ones in production applications?

The one that sees most use is definitely BERT. Here is a really nice explanation of how it works. This transformers library from Huggingface makes it really easy to work with BERT and other transformers that have already been pre-trained.

Practical attention models

One Answer

Add your own answers!

Ask a Question