Data Science Asked by Damjan Dakic on July 27, 2020
As far as I’ve seen, transformer-based architectures are always trained with classification tasks (one-hot text tokens for example). Are you aware of any architectures using attention and solving regression tasks? Could one build a regressive auto-encoder for example? How would normalization fit into this (as LayerNorm destroys some of the information from the input)?
In the simplest case, doing regression with Transformers is just a matter of changing the loss function.
BERT-like models that use the representation of the first technical token as an input to the classifier. You can replace the classifier with a regressor and pretty much nothing will change. The error from the regressor will get propagated to the rest of the network and you can both train the regressor and fine-tune/train the underlying Transformer.
Also, I don't think that layer normalization causes severe information loss. It is already there when the network is trained, so the rest of the network parameters need to take care of that, which should not be a problem because the gradients "know very well" that there was a normalization layer.
Answered by Jindřich on July 27, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP