TransWikia.com

Weights shared by different parts of a transformer model

Data Science Asked by user105282 on July 5, 2021

enter image description here

Which parts of a transformer share weights, like, do all the encoders share the same weight or all the decoders share the same weights?

One Answer

Updated answer

The Transformer model has 2 parts: encoder and decoder.

Both encoder and decoder are comprised of a sequence of attention layers. Each layer is comprised of a combination of multi-head attention blocks, positional feedforward layers, normalization, and residual connections.

The attention layers from the encoder and decoder are slightly different: the encoder only has self-attention blocks while the decoder alternates self-attention with encoder attention blocks. Also, the self-attention blocks are masked to ensure causal predictions (i.e. the prediction of token N only depends on the previous N - 1 tokens, and not on the future ones).

In the blocks in the attention layers no parameters are shared.

Apart from that, there are other trainable elements that we have not mentioned: the source and target embeddings and the linear projection in the decoder before the final softmax.

The source and target embeddings can be shared or not. This is a design decision. They are normally shared if the token vocabulary is shared, and this normally happens when you have languages with the same script (i.e. the Latin alphabet). If your source and target languages are e.g. English and Chinese, which have different writing systems, your token vocabularies would probably not be shared, and then the embeddings wouldn't be shared either.

Then, the linear projection before the softmax can be shared with the output embedding matrix. This is also a design decision. It is frequent to share them.

Finally, the positional embeddings (which can be either trainable or pre-computed) are shared for the source and target languages.


Old Answer (mistakenly addressing multi-encoder/decoder transformers)

In general, parameters are not shared in multi-encoder/multi-decoder transformer architectures, in the sense that each encoder/decoder has its own parameters. This is because sharing parameters defeats the very purpose of having multiple encoders/decoders: if you have two encoders that share parameters, they are effectively the same encoder.

This, of course, is not a rule, and there may be cases where it makes sense to share some or all encoder/decoder parameters. An example could be a scenario where two sentences in a certain language (or two very similar languages, like dialects) are received as input. It may be beneficial to share parameters between both encoders if the amount of data in one of them is not enough.

Background: multi-encoder and/or multi-decoder Transformers are generally not used for "simple" sequence to sequence tasks like machine translation. There are certain special cases where this kind of architectures have been used with different purposes, e.g.:

  • Automatic Post-Editing (APE) consists of having a primary machine translation system which output is refined with a secondary translation system that corrects the errors from the primary system. The secondary system normally receives as inputs both the original source sentence and the translation from the primary system and generates the fixed translation as output. An option for this kind of scenario is having a dual-encoder single-decoder transformer, where the output of the encoders are both fed to the decoder, either concatenating them (example) or injecting them to different attention blocks (example). In this scenario, normally there is no parameter sharing at all.

  • Multimodal translation: in this case, we receive different data modalities (e.g. speech and text), so we need modality-specific architectures, so normally there is no parameter sharing.

Note that here I understand that the multiplicity in encoders and/or decoders happens both at training and inference time. There are other cases where such a multiplicity happens only at training time (as in multi-task learning), but at inference time only one encoder and decoder are selected for each case, e.g.:

  • Multilingual translation: while many multilingual MT systems have a single model for all supported languages (normally supplying special tokens that identify the source/target language), it is possible to have language-specific encoders or decoders. You can see examples of this kind of architecture in works like this or this. In this scenario, each encoder/decoder has its own parameters and there is no sharing.

Correct answer by noe on July 5, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP