NLP Transformers - understanding the multi-headed attention visualization (Attention is all you need)

Question

I am new to NLP and I just finished reading the paper "Attention is all you need".
I'm struggling to understand the interpretability of the multi-headed attention, and specifically how these visualizations were produced:

I understand that the output of the self-attention sub-layer (for a single head) is a vector of size d_v that is a weighted sum of all the value vectors. Than how do they use this vector to calculate the strengths of the relations between the positions?
Any help and insight would be appreciated, thanks a lot!

shepan6 · Answer

So the question is concerned about understanding the self-attention mechanism in greater detail, in particular how this idea of multi-head self-attention is used to compute strength of relations between tokens.
I think it's best you look through this great tutorial on self-attention and see if this helps in your understanding of multi-head self-attention: http://www.peterbloem.nl/blog/transformers

NLP Transformers - understanding the multi-headed attention visualization (Attention is all you need)

One Answer

Add your own answers!

Ask a Question