TransWikia.com

Practical attention models

Data Science Asked by Valentas on December 25, 2020

Attention is all you need is a nice paper that suggests using
positional encodings as an alternative to RNNs in their Transformer architecture.

GPT-2 and GPT-3 are examples of using this architecture which
are trained on input data of a massive scale.

Is there a paper and a model that uses positional encodings
and outcompetes RNN/LSTM based models for small scale datasets (MBs of text data, not terabytes)?

If there are many, which ones are the leading ones in production applications?

One Answer

Is there a paper and a model that uses positional encodings and outcompetes RNN/LSTM based models for small scale datasets (MBs of text data, not terabytes)?

Yes, there are several. Similar to GPT, they still pre-train on terabytes of data. But the embedding they learn generalize well. Then you can fine-tune on a much smaller dataset. It works much in the same way as transfer learning on a CNN where a model first is trained on ImageNet and then trained on a specific task. It tends to give better results than RNN/LSTMs.

If there are many, which ones are the leading ones in production applications?

The one that sees most use is definitely BERT. Here is a really nice explanation of how it works. This transformers library from Huggingface makes it really easy to work with BERT and other transformers that have already been pre-trained.

Answered by Simon Larsson on December 25, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP