Bert-Transformer : Why Bert transformer uses [CLS] token for classification instead of average over all tokens?

Question

I am doing experiments on bert architecture and found out that most of the fine-tuning task takes the final hidden layer as text representation and later they pass it to other models for the further downstream task.
Bert's last layer looks like this :

Where we take the [CLS] token of each sentence :

Image source
I went through many discussion on this huggingface issue,  datascience forum question,  github issue Most of the data scientist gives this explanation :

BERT is bidirectional, the [CLS] is encoded including all
representative information of all tokens through the multi-layer
encoding procedure. The representation of [CLS] is individual in
different sentences.

My question is, Why the author ignored the other information ( each token's vector ) and taking the average, max_pool or other methods to make use of all information rather than using [CLS] token for classification?
How does this [CLS] token help compare to the average of all token vectors?

roman · Answer

It's because you need to fine-tune BERT for your specific task anyway. You can train it to classify based on either cls token, or mean of token outputs, or whatever.
In essence, CLS token of the last layer has connections with all of the other tokens on the previous layer. So, does it make sense to average manually?

Bert-Transformer : Why Bert transformer uses [CLS] token for classification instead of average over all tokens?

One Answer

Add your own answers!

Ask a Question