Data Science Asked on February 25, 2021
I know GPT is a Transformer-based Neural Network, composed of several blocks. These blocks are based on the original Transformer’s Decoder blocks, but are they exactly the same?
In the original Transformer model, Decoder blocks have two attention mechanisms: the first is pure Multi Head Self-Attention, the second is Self-Attention with respect to Encoder’s output. In GPT there is no Encoder, therefore I assume its blocks only have one attention mechanism. That’s the main difference I found.
At the same time, since GPT is used to generate language, its blocks must be masked, so that Self-Attention can only attend previous tokens. (Just like in Transformer Decoders.)
Is that it? Is there anything else to add to the difference between GPT (1,2,3,…) and the original Transformer?
GPT uses an unmodified Transformer decoder, except that it lacks the encoder attention part. We can see this visually in the diagrams of the Transformer model and the GPT model:
For GPT-2, this is clarified by the authors in the paper:
There have been several lines of research studying the effects of having the layer normalization before or after the attention. For instance the "sandwich transformer" tries to study different combinations.
For GPT-3, there are further modifications on top of GPT-2, also explained in the paper:
Correct answer by noe on February 25, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP