Data Science Asked by David Rein on January 20, 2021
I’m working through Attention is All you Need, and I have a question about masking in the decoder. It’s stated that masking is used to ensure the model doesn’t attend to any tokens in the future (not yet predicted), so it can be used autoregressively during inference.
I don’t understand how masking is used during inference. When the encoder is given an unseen sample with no ground truth output or prediction, it seems to me that there is nothing to mask, since there aren’t any output tokens beyond what the decoder has already produced. Is my understanding of masking correct?
Thanks!
The trick is that you do not need masking at inference time. The purpose of masking is that you prevent the decoder state from attending to positions that correspond to tokens "in the future", i.e., those that will not be known at the inference time, because they will not have been generated yet.
At inference time, it is no longer a problem because there are no tokens from the future, there have not been generated yet.
Correct answer by Jindřich on January 20, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP