Data Science Asked by Ilya.K. on May 1, 2021
Hi, in the original paper the following scheme of the self-attention appears:
https://arxiv.org/pdf/1805.08318.pdf
In a later overview:
https://arxiv.org/pdf/1906.01529.pdf
this scheme appears:
referring the original paper.
My understanding more correlates with the second paper scheme, as:
Where there is two dot-product operations and three hidden parametric matrices:
$$W_k, W_v, W_q$$
which corresponds to $W_f, W_g, W_h$ without $W_v$ as it in the original paper explanation, which is as following:
Is this a mistake in the original paper ?
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP