Using TF-IDF for feature extraction in Sentiment Analysis

Question

I am working on sentiment analysis for twitter data, for which I have used Vader to get an approximation of sentiment for a tweet. Along with, I have used TF-IDF for feature extraction. These feature words I am using to train and test a Random Forest model. In my dataset, there are around 3K plus tweets from which I extracted around 570 unique feature words using TF-IDF. And all these features I have used for training the Random Forest Model.

My query is in regards to the usage of this trained model in the real world. What if the new tweets which this model has never seen do not have feature words that I have used for training, will the model fail to make correct predictions for them (in my case there are only 3 possible predictions viz. positive, negative and neutral) correctly for them? If yes, then how should I handle this scenario?

Please let me know in case I am missing something or doing something wrong here.

Aishwarya A R · Answer

TF-IDF is a vectorization technique used to convert documents (a single tweet in your case is a document) to vectors. After you train the TF-IDF model, the only words/vocabulary it has learnt, would be from the set of documents (aka corpus, the entire set of 3k tweets).

Since you mentioned that there were 570 unique feature words after TF-IDF, that would be the vocabulary your model has learnt. If you give this model a document with words that are present in its vocabulary, it will successfully vectorize it. However, if one or more words in your new document are not present in the model's vocabulary, those words won't be included in the vectorization at all. The words in a new sentence will be given weights only if the model had encountered those words in its training. In other words, every sentence will be vectorized w.r.t the model's vocabulary.

Ex -

Model's vocabulary - a, big, hat, have, I, mat

Input - "I have a big mat"

Vector - [ sequence of tfidf weights in the order of the model's vocabulary. If word in vocab is not present in the sentence, weight assigned is 0 ]

Input - "I have a dog"

Vector - [ sequence of tfidf weights calculated for all the words in the vocabulary, in the same order as the vocabulary ]

Since "dog" wasn't a part of the vocabulary, it's not included in the vectorization.

What if the new tweets which this model has never seen do not have
  feature words that I have used for training, will the model fail to
  make correct predictions for them (in my case there are only 3
  possible predictions viz. positive, negative and neutral) correctly
  for them?

I can't say much about the final prediction, but your feature vectors from TF-IDF can be way off, if you're expecting it to vectorize documents with a large number of words that the model hasn't trained on. Improper vectorization can affect your prediction accuracy.

If yes, then how should I handle this scenario

My suggestion would be to consider the following -

Pre-process the train data thoroughly. Removing punctuation, substituting abbreviations by their full forms and other steps can help the TF-IDF model train well. 
If gathering more data is an option, try training the model on a large set of 
tweets that fall in the context of your prediction. The more the vocabulary, the better equipped the model is, to vectorize new unseen documents. 
Try using a pre-trained model like BERT.

Using TF-IDF for feature extraction in Sentiment Analysis

One Answer

Add your own answers!

Ask a Question