Data Science Asked by PhysicsPrincess on March 15, 2021
I am new to NLP and I just finished reading the paper "Attention is all you need".
I’m struggling to understand the interpretability of the multi-headed attention, and specifically how these visualizations were produced:
I understand that the output of the self-attention sub-layer (for a single head) is a vector of size d_v that is a weighted sum of all the value vectors. Than how do they use this vector to calculate the strengths of the relations between the positions?
Any help and insight would be appreciated, thanks a lot!
So the question is concerned about understanding the self-attention mechanism in greater detail, in particular how this idea of multi-head self-attention is used to compute strength of relations between tokens.
I think it's best you look through this great tutorial on self-attention and see if this helps in your understanding of multi-head self-attention: http://www.peterbloem.nl/blog/transformers
Answered by shepan6 on March 15, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP