Tweet Classification into topics- What to do with data

Question

Good evening,
First of all, I want to apologize if the title is misleading.
I have a dataset made of around 60000 tweets, their date and time as well as the username. I need to classify them into topics. I am working on topic modelling with LDA getting the right number of topics (I guess) thanks to this R package, which calculates the value of three metrics("CaoJuan2009", "Arun2010", "Deveaud2014"). Since I am very new to this, I just thought about a few questions that might be obvious for some of you, but I can't find online.

I have removed, before cleaning the data (removing mentions, stopwords, weird characters, numbers etc), all duplicate instances (having all three columns in common), in order to avoid them influencing the results of topic modelling. Is this right?

Should I, for the same reason mentioned before, remove also all retweets?

Until now, I thought about classifing using the "per-document-per-topic" probability. If I get rid of so many instances, do I have to classify them based on the "per-word-per-topic" probability?

Do I have to divide the dataset into testing and training? I thought that is a thing only in supervised training, since I cannot really use the testing dataset to measure quality of classification.

Antoher goal would be to classify twitterers based the topic they most are passionate about. Do you have any idea about how to implement this?

Thank you all very much in advance.

Erwan · Accepted Answer

As far as I'm aware there is no correct/standard way to apply topic modelling, most decisions depend on the specifics of the case. So below I just give my opinion about these points:

I have removed, before cleaning the data (removing mentions, stopwords, weird characters, numbers etc), all duplicate instances (having all three columns in common), in order to avoid them influencing the results of topic modelling. Is this right?
Should I, for the same reason mentioned before, remove also all retweets?

In general there is no strict need to deduplicate the data, doing it or not would depend on the goal. Duplicate documents would affect the proportion of the words which appear in these documents, and in turn the probability of the topic these documents are assigned to. If you want the model to integrate the notion of popularity/prominence of tweets/words/topics, it would probably make sense not to deduplicate and keep retweets. However if there is large amount of duplicates/retweets the imbalance might cause less frequent tweets/words to be less visible, possibly causing less diverse topics (the smallest topics might get merged together for instance).

Until now, I thought about classifing using the "per-document-per-topic" probability. If I get rid of so many instances, do I have to classify them based on the "per-word-per-topic" probability?

I'm not sure what is called the "per-document-per-topic" probability in this package. The typical way to use LDA in order to cluster the documents is to use the posterior probability of topic given document (this might be the same thing, I'm not sure): for any document $d$, the model can provide the conditional probability of every topic $t$ given $d$. The sum of this value across topics sums to 1 (it's a distribution over topics for $d$), and for classification purposes one can just select the topic which has the highest probability given $d$.

Do I have to divide the dataset into testing and training? I thought that is a thing only in supervised training, since I cannot really use the testing dataset to measure quality of classification.

You're right, you don't need to split into training and test set since this is unsupervised learning.

Antoher goal would be to classify twitterers based the topic they most are passionate about. Do you have any idea about how to implement this?

The model gives you the posterior probability distribution over topics for every tweet. From these values I think you can obtain a similar distribution over topics for every tweeter, simply by marginalizing over the tweets by this author $a$: if I'm not mistaken, this probability $p(t|a)$ can be obtained simply by calculating the mean of $p(t|d)$ across all the documents/tweets $d$ by author $a$.

Tweet Classification into topics- What to do with data

One Answer

Add your own answers!

Ask a Question