Role of decoder in Transformer?

Question

I understand the mechanics of Encoder-Decoder architecture used in the Attention Is All You Need paper. My question is more high level about the role of the decoder. Say we have a sentence translation task: Je suis ètudiant -> I am a student
The encoder receives Je suis ètudiant as the input and generates encoder output which ideally should embed the context/meaning of the sentence.
The decoder receives this encoder output and an input query (I, am, a, student) as its inputs and outputs the next word (am, a, student, EOS). This is done step by step for every word.
Now, do I understand this correctly that the decoder is doing two things?

Figuring out relationship between the input query and encoder embedding i.e how is the query related to the input sentence Je suis ètudiant
Figuring out how is the current query related to previous queries through the masked attention mechanism. So when the query is student, the decoder would attend to relevant words which have already occurred (I am a).

If this is not the right way to think about it, can someone give a better explanation?
Also, if I have a task of classification or regression for a time series, do I need the decoder? I would think just the encoder would suffice as there is no context in the output of the model.

noe · Accepted Answer

Yes, you are right in your understanding of the role of the decoder.
However, your use of "query" here, while somewhat technically correct, seems a bit strange. You are referring as "query" to the partially decoded sentence. While the partially decoded sentence is actually used as query in the first multihead attention block, people normally do not refer to it as "query" when describing stuff from the conceptual level of the decoder.
About needing the decoder in classification or regression tasks: the decoder is used when the output of the model is a sequence. If the output of the model is a single value, e.g. for a classification task or a single-value regression task, the encoder would suffice. If you want to predict multiple values of a time series, you should probably use a decoder that lets you condition not only on the input but also on the partially generated output values.

Allohvk · Answer

Further details regarding your last question. Your problem is a sequence classification exercise. Decoders aren't needed. You can use a Dense layer to predict the labels. You can even add 'Attention'-like features by using all hidden state outputs form the encoder instead of the last one while predicting the labels
See for e.g.: https://stackoverflow.com/questions/63060083/create-an-lstm-layer-with-attention-in-keras-for-multi-label-text-classification/64853996#64853996

Role of decoder in Transformer?

2 Answers

Add your own answers!

Ask a Question