Using Machine Learning techniques for text-analysis

Question

I am analysing a bunch of tweets and I want to understand which political party the authors support.

I am using Mathematica but I am thinking to re-write my code in Python. If you have any suggestion, do not esitate.

Let say, for example that I have five parties labelled by a number: Party1, Party2,...,Party5 and I provide a list of tweets which are for sure associated to the parties. This list is roughly trainingList = {tweet1-> "Party1",tweet2->"Party2",...,tweet5->"Party5",tweetK->"Neutral"}. Moreover, I added a class "Neutral" because some tweets are not classifiable. Actually, I provide more than one tweet par party. In fact, the trainingList is much longer.

Notice that I remove stop-words from the tweets and I apply a stemming-words algorithm to simplify the expressions. Then, every tweets is a list of potentially-significant words.

I create the classifier function, providing a prior (the explicit for of the prior is not important)

c = Classify[trainingList, ClassPriors-> prior]

Then, I want to do some checks. I take a bunch of tweets belonging to other party-supporters and try

c[{word1,word2,word3,word4,word5,word7},"Probabilities"]

which gives me a list of probabilities $p_i$. For example,

<| 'Party1' -> p1, 'Party2' -> p2, 'Party3' -> p3, 'Party4' -> p4, 'Party5' -> p5, 'Neutral' -> p6 |>

Question1 Is there a way to understand how the algorithm associate these probabilites? More specifically, how each word is associated to a party

Question2 Is there a way to see which are the most frequent n-grams of words associated to a given party (class)?

Carl Rynegardh · Answer

Take a few words you know are linked with e.g republicans and with democrats. 
Extract their word embedding. That is the vector representation of the word which will be high dimensional.
This will assume you have trained word embedding, either pretrained, or even better trained them yourself on your data. The word embeddings could have been trained/extracted through e.g LSTM training or matrix factorisations.

On the word embeddings, do dimension reduction(PCA as an example) to 2d. Plot! Those who are close together will be "similar" for the objective they are trained for.

Pretrained word embeddings exist to download. However, they might not be the optimal embeddings for your objective, but better than nothing still.

pcko1 · Answer

I believe that "Sequence Classification using LSTM RNN" is the exact answer to your problem.

Check out this tutorial, it's one of the best:

https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/

ctwardy · Answer

I think you are asking, for an arbitrary trained classifier, how do you find its mapping between tweet-words and party-probabilities?  There are a couple of model-agnostic ways to do this, amounting to checking which features make the biggest difference. The following two Python libraries implement one or more of these algorithms, starting from a paper written the LIME authors:

LIME shows which words matter for a given classification. You could send it single-word texts to get their average importance, but of course in better text models, context will matter.

ELI5 includes an alternate LIME implementation, and also several special functions for some common scikit-learn classifiers.

@CarlRynegardh also suggests a general method, given a (word x latent-feature) matrix that can be extracted from many classifiers, but given you seemed unfamiliar with that approach, I think you will get more value from exploring these libraries. Also, I think they will better account for context effects, if you want to consider more than single words, and at least ELI5 takes advantage of some specific model features.

For getting started doing this in Python, you're likely to get a lot of opinions.  I think most would recommend the anaconda python distribution and the SciPy ecosystem including NumPy, matplotlib, and pandas. For general machine learning my go-to package is scikit-learn.   For deep networks, the two main competitors seem to be Keras/TensorFlow and FastAI/PyTorch. The second is more Pythonic. (I believe it also currently has the edge in classifier performance, but not as much support for deploying in production. Both are still evolving quickly.)

Coming from Mathematica, I expect you'll want to work in the jupyter notebook environment.  Both LIME and ELI5 have several example notebooks, and GitHub has a built-in notebook viewer, so you can read them in your browser when following the links above.

The fastest way to start may be to use a cloud service that provides instant access to a jupyter notebook pre-configured with all the libraries. Plenty of options from clouds big and small.  I'm using Paperspace, but I hear Google Collab is free and Just Works (TM).

Using Machine Learning techniques for text-analysis

3 Answers

Add your own answers!

Ask a Question