TransWikia.com

Should I keep common stop-words when preprocessing for word embedding?

Data Science Asked on June 7, 2021

If I want to construct a word embedding by predicting a target word given context words, is it better to remove stop words or keep them?

the quick brown fox jumped over the lazy dog

or

quick brown fox jumped lazy dog

As a human, I feel like keeping the stop words makes it easier to understand even though they are superfluous.

So what about for a Neural Network?

2 Answers

In general stop-words can be omitted since they do not contain any useful information about the content of your sentence or document.

The intuition behind that is that stop-words are the most common words in a language and occur in every document independent of the context. Therefore they contain no valuable information which could hint to the content of the document.

Correct answer by Tinu on June 7, 2021

It's not mandatory. Removing stopwords can sometimes help and sometimes not. You should try both.

A case for not using stopwords: Using stopwords will provide context to the user's intent. So when you use a contextual model like BERT, all stopwords are kept to provide enough context information like the negation words (not, nor, never) which are considered to be stopwords.

According to this paper:

Surprisingly, the stopwords received as much attention as non-stop words, but removing them has no effect in MRR performances.

Answered by Soroush Faridan on June 7, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP