Next sentence prediction in RoBERTa

Question

I'm trying to wrap my head around the way next sentence prediction works in RoBERTa. Based on their paper, in section 4.2, I understand that in the original BERT they used a pair of text segments which may contain multiple sentences and the task is to predict whether the second segment is the direct successor of the first one. RoBERTa's authors proceed to examine 3 more types of predictions - the first one is basically the same as BERT, only using two sentences insted of two segments, and you still predict whether the second sentence is the direct successor of the first one. But I can't understand what the goal is in the other 2. I will cite their explanation below:
• FULL-SENTENCES: Each input is packed with full sentences sampled contiguously from one or more documents, such that the total length is at most 512 tokens. Inputs may cross document boundaries. When we reach the end of one document, we begin sampling sentences from the next document and add an extra separator token between documents. We remove the NSP loss.
• DOC-SENTENCES: Inputs are constructed similarly to FULL-SENTENCES, except that they
may not cross document boundaries. Inputs sampled near the end of a document may be
shorter than 512 tokens, so we dynamically increase the batch size in these cases to achieve a similar number of total tokens as FULL-SENTENCES. We remove the NSP loss.
So from what I understand in these two training strategies they already sample consecutive sentences, or at least consecutive sentences from neighbouring documents,  and I can't see what they are trying to predict - it can't be whether they're consecutive text blocks, because to me it seems that all of their training examples have already been sampled contiguously, thus making such a task redundant. It would be of enormous help if someone were to shed some light on the issue, thanks in advance!

Jindřich · Answer

Similarly to BERT, they sample negative (i.e., non-adjacent) examples and train a classifier telling whether the sentence are consecutive or not.

saiRegrefree · Answer

BERT uses both masked LM and NSP (Next Sentence Prediction) task to train their models. So one of the goals of section 4.2 in the RoBERTa paper is to evaluate the effectiveness of adding NSP tasks and compare it to just using masked LM training.
For the sake of completeness, I will briefly describe all the evaluations in the section.
First, they compare SEGMENT-PAIR+NSP and SENTENCE-PAIR+NSP both models use masked LM + NSP training and they find that

using individual sentences hurts performance on downstream tasks

i.e., SEGMENT-PAIR+NSP performs better than SENTENCE-PAIR+NSP.
Second, they remove the NSP task (hence they take continuous sentences) and train the model with only masked LM. And they add small variations by allowing the sampled input sentence to cross the document boundary in one case but not in another. They report that

removing the NSP loss matches or slightly improves downstream task performance

by comparing DOC-SENTENCES and FULL-SENTENCES with SEGMENT-PAIR+NSP and SENTENCE-PAIR+NSP. And that

single document (DOC-SENTENCES)
performs slightly better than packing sequences
from multiple documents (FULL-SENTENCES)

Mat A · Answer

The NSP loss has been used to improve the model's inter-sentences understanding, which is particularly observable on inter-sentences understanding datasets like SQUAD (question answering) or MNLI (sentence entailment). You can refer to the ablation studies of the original BERT's paper for more details.
By removing the NSP loss and replacing it by the DOC-SENTENCES or FULL-SENTENCES strategies, Roberta's authors showed that using MLM alone* (without NSP) is enough, even better, for the model to catch inter-sentences understanding. Again, you can refer to the ablation studies of the original Roberta's paper for more details.
*To be precise, Roberta is performing dynamic MLM vs static MLM for BERT.

Next sentence prediction in RoBERTa

3 Answers

Add your own answers!

Ask a Question