Data Science Asked by PKlumpp on December 30, 2020
In the past days, I read up on the theory behind attention, when to apply it and what types there are. I think I have a decent first understanding of the concept, but now I would like to apply some of the insights I got to my own project and I find myself stuck with the implementation of attention in TF. (Quick Link to TF Attention)
The attention layer requires me to provide at least the queries and values. Correct me if I am wrong already, but this is my idea of what they are:
So far so good. The thing I am struggling with is the fact that I have no idea where the hidden states of my decoder might come from. I would like to implement a self-attention mechanism. So my decoder hidden states are generated dynamically and I cannot know them before actually applying the attention layer. The example provided in the docs was not helpful for me, because it focused on a problem where I already have some query sequence.
Apart from whether the mentioned TF attention layer is applicable for self-attention, how do I interpret the different inputs?
In self-attention, it is not the decoder attending the encoder, but the layer attends itself, i.e., the queries and values are the same.
In practice, this is usually done in the multi-head setup. You can view that as every head focusing on collecting different kinds of information from the hidden states. In multi-headed attention with $H$ heads, you first linearly project the states in $H$ query vectors, $H$ key vectors, and $H$ value vectors, apply the attention, concatenate the resulting context vectors and project them back into the same dimension.
Correct answer by Jindřich on December 30, 2020
The 'Attention' terminology varies before and after a landmark paper in 2018 - Attention is all you need.
Before 2018
Remember - the 'query' attends to all the 'values'
So far so good. Attention mechanisms were used widely between 2014 to 2017 to improve the performance of RNNs. Then in 2018 a revolutionary paper comes out - Attention is all you need. It means what its title says - Basically chuck out your RNNs and use just Attention to encode sequences. By using self-Attention the model is able to build relationships between timesteps within an input sequence and encode it. RNN is not needed.
So post-2018, you can use self-attention on the input sequences or the output sequences independently. You dont need the traditional RNN as encoder or decoder. The attention mechanism itself does the job of the RNN. So you have an encoder and a decoder that use Attention models. In this scenario both the query and the values (and also the key) come from the previous layer output. The calculation goes something like this:
Answered by Allohvk on December 30, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP