Data Science Asked by Kishkashta on August 14, 2021
In the "Attention Is All You Need" paper, the decoder consists of two attention sub-layers in each layer followed by a FF sub-layer.
The first is a masked self attention which gets as an input the output of the decoder in the previous step (and the first input is a special start token).
The second, ‘encoder-decoder’, attention sub-layer gets as an input queries from the lower self-attention sub-layer and keys & values from the encoder.
I do not see the use of the output of the FF sub-layer in the encoder; can someone explain where is it used?
Thanks
We can see this in the original Transformer diagram:
The output of the last encoder FF layer is added to the original input of the same layer, then layer normalization is applied and that is the output of the whole encoder, which is used as keys and values for the encoder-decoder attention blocks in the decoder.
Correct answer by noe on August 14, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP