Attention network without hidden state?

Question

I was wondering how useful the encoder's hidden state is for an attention network. When I looked into the structure of an attention model, this is what I found a model generally looks like:

x: Input.
h: Encoder's hidden state which feeds forward to the next encoder's hidden state.
s: Decoder's hidden state which has a weighted sum of all the encoder's hidden states as input and feeds forward to the next decoder's hidden state.
y: Output.

With a process like translation, why is it important for encoder's hidden states to feed-forward or exist in the first place? We already know what the next x is going to be. Thereby, the order of the input isn't necessarily important for the order of the output, neither is what has been memorized from the previous input as the attention model looks at all inputs simultaneously. Couldn't you just use attention directly on the embedding of x?

Allohvk · Answer

Though translations are not on a word-word basis, nonetheless there is a strong merit in retaining the sequence of the words at the encoder side. There is a huge penalty to pay for this as this causes serialization, but in spite of it LSTMs and GRU's became so popular so one can imagine that sequence order matters. After the encoder is done processing in sequence, the final state that is generated is sort of a sentence embedding and contains the essence of the sentence. This is a good starting point for the decoder to pick and use. Unlike what you have assumed, the model does not look 'only' at the context generated by the attention layer to make predictions. It also uses the prev LSTM state along with the context (and the last translated word to make the next prediction). If you trace back the prev LSTM state right to the beginning of the decoder, you can see it has its origins in the final LSTM state of the encoder.
Having said that, your question still is very pertinent. The concept of Attention is so powerful that with self attention and multi-head attention, it is now possible to do away with the RNN at the encoder end altogether and just use the representation generated purely by the process of 'Attention'. But even here the authors of the landmark paper - Attention is all you need - add in a small hack to retain the order of the sequence of the words of the input sentence. This seems to improve the model in better prediction.

Attention network without hidden state?

One Answer

Add your own answers!

Ask a Question