Transformer: where is the output of the last FF sub-layer of the encoder used?

Question

In the "Attention Is All You Need" paper, the decoder consists of two attention sub-layers in each layer followed by a FF sub-layer.
The first is a masked self attention which gets as an input the output of the decoder in the previous step (and the first input is a special start token).
The second, 'encoder-decoder', attention sub-layer gets as an input queries from the lower self-attention sub-layer and keys & values from the encoder.
I do not see the use of the output of the FF sub-layer in the encoder; can someone explain where is it used?
Thanks

noe · Accepted Answer

We can see this in the original Transformer diagram:

The output of the last encoder FF layer is added to the original input of the same layer, then layer normalization is applied and that is the output of the whole encoder, which is used as keys and values for the encoder-decoder attention blocks in the decoder.

Transformer: where is the output of the last FF sub-layer of the encoder used?

One Answer

Add your own answers!

Ask a Question