In a Transformer model, why does one sum positional encoding to the embedding rather than concatenate it?

Question

While reviewing the Transformer architecture, I realized something I didn't expect, which is that :

the positional encoding is summed to the word embeddings 
rather than concatenated to it.

http://jalammar.github.io/images/t/transformer_positional_encoding_example.png

Based on the graphs I have seen wrt what the encoding looks like, that means that :

the first few bits of the embedding are completely unusable by the network because the position encoding will distort them a lot, 
while there is also a large amount of positions in the embedding that are only slightly affected by the positional encoding (when you move further towards the end).

https://www.tensorflow.org/beta/tutorials/text/transformer_files/output_1kLCla68EloE_1.png

So, why not instead have smaller word embeddings (reduce memory usage) and a smaller positional encoding retaining only the most important bits of the encoding, and instead of summing the positional encoding of words keep it concatenated to word embeddings?

noe · Accepted Answer

When you concatenate, you have to define a priori the size of each vector to be concatenated. This means that, if we were to concatenate the token embedding and the positional embedding, we would have to define two dimensionalities, $d_t$ for the token and $d_p$ for the position, with the total dimensionality $d = d_t + d_p$, so $d>d_t$ and $d>d_p$. We would be decreasing the total size we devote to tokens in favor of positional information.
However, adding them together is potentially a super case of the concatenation: imagine that there is an ideal split of $d$ into $d_t$ and $d_p$ in terms of minimizing the loss; then, the training could converge to position vectors that only take $d_t$ elements, making the rest zero, and the positions were learned and happened the same, taking the complementary $d_p$ elements and leaving the rest to zero.
Therefore, by adding them, we leave the optimization of the use of the $d$ dimensions to the optimization process, instead of assuming there is an optimal partition of the vector components and setting a new hyperparameter to tune. Also, the use of the vector space is not restricted by a hard split in the vector components, but takes the whole representation space.

shepan6 · Answer

So the question is about why positional embeddings are directly added to word embeddings instead of concatenated. This is a particularly interesting question. To answer this question, I will need to firstly separate the differences between sequential networks like RNNs and Transformers, which then introduces this problem nicely.
In RNNs, we feed in data (let's say a sequence of words) into the model in a sequential manner. This means that in the context of inputting in a sequence of words, the model does arguably obtain the order the tokens as it is fed in one by one.
With transformers, on the other hand, all of the words in the sequence are fed in all at once. This means that, so far, transformers do not have any notion of word ordering. Therefore, we need positional embeddings to tell the model where each word belongs in the sequence.
I believe the reason why we add them to word embeddings is because we want to maintain a similar input into the model as an RNN, which takes in word embeddings as its input as well.
I think your question is a very good one to ask, and maybe you should experiment with having a more compressed word embedding with its positional embedding and compare your approach against the more "traditional" approach and see what results you yield. I'll be excited to see them.

FremyCompany · Answer

It is been a while, but I think anyone ending up here might also be interested in the reading of the following paper:

What Do Position Embeddings Learn? An Empirical Study of Pre-Trained Language Model Positional Encoding (Yu-An Wang, Yun-Nung Chen)
https://www.aclweb.org/anthology/2020.emnlp-main.555

I am not changing the accepted answer as this article is not specific.

Denziloe · Answer

the first few bits of the embedding are completely unusable by the
network because the position encoding will distort them a lot

This confused me very much at first because I was thinking of the model using a pre-trained word embedding. And then an arbitrary initial chunk of that embedding gets severely tampered with by the positional encoding.
However, in the original transformer model at least, the embedding was trained from scratch, so this does not apply. An initial chunk of the overall embedding will be used for positional information, and the rest will be used for word information.
This still doesn't explain why we use this method instead of concatenation -- see the other answers for that -- but it does explain why the method isn't crazy.
That said, it may be that the method works well even with pre-trained word embeddings, I don't know. If so, it's hard to explain.

In a Transformer model, why does one sum positional encoding to the embedding rather than concatenate it?

4 Answers

Add your own answers!

Ask a Question