Is it a best practice to exclude retweets from the data set?

Question

I am going to build machine learning algorithm to identify fake tweets. The data set has huge retweets which I think might be an issue. Do you think given that the focus is the original tweet, it is better to remove all the retweets?

Thank you,

machine learning pandas python supervised learning

Thank you,

Michael Hearn · Answer

No. I do not believe so and I can explain a few reasons why.

If an entity wants to create waves in twitter with false tweets retweets are probably apart of the plan.
If you want to detect tweets generated by bots looking at the statistical data on said tweets and retweets like time stamps could be relevant to detecting if the tweet is generated by a bot.
If You have a way of checking retweets by bots then removing all retweets would also remove that data.

You should remove retweets if.

The project is focused on analysis of text to determine if a tweet is bot or not.
There is no labeled human or bot retweet data.

Uday T · Answer

There might be a chance that the retweet has an entirely different context compared to the original tweet.
It is also possible that some retweets with different opinion/comment gain more popularity than the original one.

In these cases I don't think you can classify them as fake tweets.

You can classify tweets as fake when they are widely retweeted but with no context,
One such example is retweets due to a giveaway or charity.

If you can figure out how to separate the spam retweets and original tweets it would help for better analysis and accurate results.

BeamsAdept · Answer

To me it depends on what you want to focus on : do you want to create a model dealing with original posts that are fake news, and then make an algorithm finding the original from a retweet then applying your model ? Or do you just want a model that takes one tweet, not looking if it's a retweet or not, and trying to guess if it's fake or not.
In the first case, you should remove them, because you'll have many information about the people retweeting fake news, while you only want to find info about origin posters, which will make your model biaised.
In the second case, of course, since that's exactly what your model aims to do, you should keep them.

Is it a best practice to exclude retweets from the data set?

3 Answers

Add your own answers!

Ask a Question