Transformer-based architectures for regression tasks

Question

As far as I've seen, transformer-based architectures are always trained with classification tasks (one-hot text tokens for example). Are you aware of any architectures using attention and solving regression tasks? Could one build a regressive auto-encoder for example? How would normalization fit into this (as LayerNorm destroys some of the information from the input)?

Jindřich · Answer

In the simplest case, doing regression with Transformers is just a matter of changing the loss function.
BERT-like models that use the representation of the first technical token as an input to the classifier. You can replace the classifier with a regressor and pretty much nothing will change. The error from the regressor will get propagated to the rest of the network and you can both train the regressor and fine-tune/train the underlying Transformer.
Also, I don't think that layer normalization causes severe information loss. It is already there when the network is trained, so the rest of the network parameters need to take care of that, which should not be a problem because the gradients "know very well" that there was a normalization layer.

Transformer-based architectures for regression tasks

One Answer

Add your own answers!

Ask a Question