Data Science Asked by kyc12 on January 25, 2021
Positional encoding using sine-cosine functions is often used in transformer models.
Assume that $X in R^{ltimes d}$ is the embedding of an example, where $l$ is the sequence length and $d$ is the embedding size. This positional encoding layer encodes $X$’s position $P in R^{ltimes d}$ and outputs $P + X$
The position $P$ is a 2-D matrix, where $i$ refers to the order in the sentence, and $j$ refers to the position along the embedding vector dimension. In this way, each value in the origin sequence is then maintained using the equations below:
$${P_{i, 2j} = sin bigg( frac{i}{10000^{2j/d}}} bigg) $$
$${P_{i, 2j+1} = cos bigg( frac{i}{10000^{2j/d}}} bigg)$$
for $i = 0,…, l-1$ and $j=0,…[(d-1)/2]$
I understand the transormation across the time dimension $i$ but why do we need the transformation across the embedding size dimension $j$? Since we are adding the position, wouldn’t sin-cos just on time dimension be sufficient to encode the position?
EDIT
Answer 1 – Making the embedding vector independent from the "embedding size dimension" would lead to having the same value in all positions, and this would reduce the effective embedding dimensionality to 1.
I still don’t understand how the embedding dimensionality will be reduced to 1 if the same positional vector is added. Say we have an input $X$ of zeros with 4 dimensions – $d_0, d_1, d_2, d_3$ and 3 time steps – $t_0, t_1, t_2$
$$
begin{matrix}
& d_0 & d_1 & d_2 & d_3
t_0 & 0 & 0 & 0 & 0
t_1 & 0 & 0 & 0 & 0
t_2 & 0 & 0 & 0 & 0
end{matrix}
$$
If $d_0$ and $d_2$ are the same vectors $[0, 0, 0]$, and the meaning of position i.e time step is the same, why do they need to have different positional vectors? Why can’t $d_0$ and $d_2$ be the same after positional encoding if the input $d_0$ and $d_2$ are the same?
As for the embedding dimensionality reducing to 1, I don’t see why that would happen. Isn’t the embedding dimensionality dependent on the input matrix $X$. If I add constants to it, the dimensionality will not change, no?
I may be missing something more fundamental here and would like to know where am I going wrong.
A multi-head attention layer of the Transformer architecture performs computations that are position-independent. This means that, if the same inputs are received at two different positions, the attention heads in the layer would return the same value at the two positions.
Note that this is different from LSTMs and other recurrent architectures which, apart from the input, receive the state from the previous time step.
The role of positional embeddings is to supply information regarding the position of each token. This allows the attention layer to compute results that are context-dependent, that is, two tokens with the same value in the input sentence would get different representations.
Positional embeddings can be handled as "normal" embedding matrixes and therefore can be trained with the rest of the network. These are "trainable positional embeddings". With this kind of positional embeddings, after each training step, the positional embedding matrix is updated together with the rest of the parameters.
However, we can obtain the same level of performance (translation quality, perplexity, or whatever other measure being used) if, instead of training the positional embeddings, we used the formula proposed in the original transformer paper.
This saves us from having to train a very big embedding matrix.
We need different values in each position of the embedded vector. Having the same value in each position of the vector would leave us with an "effective" embedding size of 1, as we are wasting the other $d-1$ positions.
In order to compute different values for each position of the embedded vector, we need an independent variable that we use to compute the value at each position based on it. We don't have any other suitable variable but the position itself. That's why it is used in the formula.
Making the embedding vector independent from the "embedding size dimension" would lead to having the same value in all positions, and this would reduce the effective embedding dimensionality to 1.
The formula uses the embedding size dimension to be able to provide different values within each embedded vector.
Correct answer by ncasas on January 25, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP