Data Science Asked on March 11, 2021
I understand the mechanics of Encoder-Decoder architecture used in the Attention Is All You Need paper. My question is more high level about the role of the decoder. Say we have a sentence translation task: Je suis ètudiant -> I am a student
The encoder receives Je suis ètudiant as the input and generates encoder output which ideally should embed the context/meaning of the sentence.
The decoder receives this encoder output and an input query (I, am, a, student) as its inputs and outputs the next word (am, a, student, EOS). This is done step by step for every word.
Now, do I understand this correctly that the decoder is doing two things?
If this is not the right way to think about it, can someone give a better explanation?
Also, if I have a task of classification or regression for a time series, do I need the decoder? I would think just the encoder would suffice as there is no context in the output of the model.
Yes, you are right in your understanding of the role of the decoder.
However, your use of "query" here, while somewhat technically correct, seems a bit strange. You are referring as "query" to the partially decoded sentence. While the partially decoded sentence is actually used as query in the first multihead attention block, people normally do not refer to it as "query" when describing stuff from the conceptual level of the decoder.
About needing the decoder in classification or regression tasks: the decoder is used when the output of the model is a sequence. If the output of the model is a single value, e.g. for a classification task or a single-value regression task, the encoder would suffice. If you want to predict multiple values of a time series, you should probably use a decoder that lets you condition not only on the input but also on the partially generated output values.
Correct answer by noe on March 11, 2021
Further details regarding your last question. Your problem is a sequence classification exercise. Decoders aren't needed. You can use a Dense layer to predict the labels. You can even add 'Attention'-like features by using all hidden state outputs form the encoder instead of the last one while predicting the labels
Answered by Allohvk on March 11, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP