Data Science Asked on May 3, 2021
I am doing experiments on bert architecture and found out that most of the fine-tuning task takes the final hidden layer as text representation and later they pass it to other models for the further downstream task.
Bert’s last layer looks like this :
Where we take the [CLS] token of each sentence :
I went through many discussion on this huggingface issue, datascience forum question, github issue Most of the data scientist gives this explanation :
BERT is bidirectional, the [CLS] is encoded including all
representative information of all tokens through the multi-layer
encoding procedure. The representation of [CLS] is individual in
different sentences.
My question is, Why the author ignored the other information ( each token’s vector ) and taking the average, max_pool or other methods to make use of all information rather than using [CLS] token for classification?
How does this [CLS] token help compare to the average of all token vectors?
It's because you need to fine-tune BERT for your specific task anyway. You can train it to classify based on either cls token, or mean of token outputs, or whatever.
In essence, CLS token of the last layer has connections with all of the other tokens on the previous layer. So, does it make sense to average manually?
Answered by roman on May 3, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP