TransWikia.com

Attention mechanism in Tensorflow 2

Data Science Asked by PKlumpp on December 30, 2020

In the past days, I read up on the theory behind attention, when to apply it and what types there are. I think I have a decent first understanding of the concept, but now I would like to apply some of the insights I got to my own project and I find myself stuck with the implementation of attention in TF. (Quick Link to TF Attention)

The attention layer requires me to provide at least the queries and values. Correct me if I am wrong already, but this is my idea of what they are:

  1. Queries: These are the hidden states of my decoder
  2. Values: These are the hidden states of my encoder

So far so good. The thing I am struggling with is the fact that I have no idea where the hidden states of my decoder might come from. I would like to implement a self-attention mechanism. So my decoder hidden states are generated dynamically and I cannot know them before actually applying the attention layer. The example provided in the docs was not helpful for me, because it focused on a problem where I already have some query sequence.

Apart from whether the mentioned TF attention layer is applicable for self-attention, how do I interpret the different inputs?

2 Answers

In self-attention, it is not the decoder attending the encoder, but the layer attends itself, i.e., the queries and values are the same.

In practice, this is usually done in the multi-head setup. You can view that as every head focusing on collecting different kinds of information from the hidden states. In multi-headed attention with $H$ heads, you first linearly project the states in $H$ query vectors, $H$ key vectors, and $H$ value vectors, apply the attention, concatenate the resulting context vectors and project them back into the same dimension.

Correct answer by Jindřich on December 30, 2020

The 'Attention' terminology varies before and after a landmark paper in 2018 - Attention is all you need.

Before 2018

  • Here 'query' is the hidden state of the decoder of the previous timestep.
  • 'Values' - All the hidden states of the encoder

Remember - the 'query' attends to all the 'values'

So far so good. Attention mechanisms were used widely between 2014 to 2017 to improve the performance of RNNs. Then in 2018 a revolutionary paper comes out - Attention is all you need. It means what its title says - Basically chuck out your RNNs and use just Attention to encode sequences. By using self-Attention the model is able to build relationships between timesteps within an input sequence and encode it. RNN is not needed.

So post-2018, you can use self-attention on the input sequences or the output sequences independently. You dont need the traditional RNN as encoder or decoder. The attention mechanism itself does the job of the RNN. So you have an encoder and a decoder that use Attention models. In this scenario both the query and the values (and also the key) come from the previous layer output. The calculation goes something like this:

  • Get the word embedding for each word, then create three vectors from the embedding vector - key, query, values. These are created from the weight matrices found during training
  • Now take each word and calculate scores by taking the dot product of the query vector (of the current word) with the key vector of the respective word we’re scoring.
  • Divide the scores by square root of the dimension of the key vectors to normalize the value to get more stable gradients.
  • Softmax this
  • Multiply each value vector by the Softmax score to amplify relevant words and and drown-out irrelevant words.
  • The final step is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word)

Answered by Allohvk on December 30, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP