What are some key strengths of BERT over ELMO/ULMFiT?

Question

I see BERT family is being used as benchmark everywhere for NLP tasks. What are some key strengths of BERT over models like ELMO or ULMFiT?

Gad · Answer

The reason you're seeing BERT and its derivatives as benchmarks is probably because it is newer than the other models mentioned and shows state-of-the-art performance on many NLP tasks. Thus, when researchers publish new models they normally want to compare them to the current leading models out there (i.e BERT).
I don't know if there has been a study on the strengths of BERT compared to the other methods but looking at their differences might give some insight:

Truly Bidirectional
BERT is deeply bidirectional due to its novel masked language modeling technique. ELMo on the other hand uses an concatenation of right-to-left and left-to-right LSTMs and ULMFit uses a unidirectional LSTM. Having bidirectional context should, in theory, generate more accurate word representations.

Model Input
BERT tokenizes words into sub-words (using WordPiece) and those are then given as input to the model. ELMo uses character based input and ULMFit is word based. It's been claimed that character level language models don't perform as well as word based ones but word based models have the issue of out-of-vocabulary words. BERT's sub-words approach enjoys the best of both worlds.

Transformer vs. LSTM
At its heart BERT uses transformers whereas ELMo and ULMFit both use LSTMs. Besides the fact that these two approaches work differently, it should also be noted that using transformers enables the parallelization of training which is an important factor when working with large amounts of data.

This list goes on with things such as the corpus the model was trained on, the tasks used to train and more. So while it is true that BERT shows SOTA performance across a variety of NLP tasks, there are times where other models perform better. Therefore, when you're working on a problem it is a good idea to test a few of them a see for yourself which one suits your needs better.

BOLICHE AHMED · Answer

BERT uses transformers archtecture of neural network so parallelization can be very helpful whereas the other (ELMO and ULMfit) uses LSTM .BERT has state-of-art preformance in many of the NLP tasks .
But i've heard that araBERT is less performant than hULMounA when it comes to arabic sentiment analysis ,correct me if i'm wrong pls

What are some key strengths of BERT over ELMO/ULMFiT?

2 Answers

Add your own answers!

Ask a Question