Data Science Asked on July 31, 2021
I am new to attention-based models and wanted to understand more about the attention mask in NLP models.
attention_mask
: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
selected in [0, 1]. It’s a mask to be used if the input sequence length is smaller than the max
input sequence length in the current batch. It’s the mask that we typically use for attention when
a batch has varying length sentences.
So a normal attention mask is supposed to look like this, for a particular sequence of length 5 (with last 2 tokens padded) –> [1,1,1,0,0]
.
But can we have attention mask like this –> [1, 0.8, 0.6, 0, 0]
where values would be between (0 and 1) to indicate that we want to pay attention to those tokens, but it’s result wouldn’t be completely effective on the model’s result due to it’s lower attention weights (kinda of like dealing with class imbalance where we weight out certain classes to deal with imbalance).
Is this approach possible? is there some other way to have the model not use the information presented by some tokens completely?
In theory maybe yes, but you would probably need to reimplement the model yourself.
In practice, with the current implementations, probably no. (Judging from the documentation snippet, you use Huggingface Transformers.) The documentation says it expects a LongTensor
, i.e., a tensor with integer values. Internally, the attention mask is used to compute sequence lengths, but summing the mask along dimension 1. This would need to be fixed and there might many other places in the code just assume the mask values are zeros and ones.
Answered by Jindřich on July 31, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP