Neural network based on twitter followers, what would be my features?

Cross Validated Asked by Sharki on December 6, 2020

I was thinking of training a neural network that would be able to classify twitter users according to their followers. For example, I would like to know if a user is “gamer” or not by the people they follow (not the number, but the list of the people they follow). I’d have a dataset of people who are gamers or not based on the accounts they’re following and train them.

The problem that I find is that the number of followers of each person will vary … One person could follow thousands of users while another could only follow hundreds, so since the list of followers of each person won’t be the same lenght, how could I determine my feature there?

This would be an example of my dataset:

enter image description here

Features would be every people an account is following, and the label would say if it’s a gamer or not.

As I said, the problem I can see is that the number of labels won’t be the same…

5 Answers

This is a very interesting problem that can be approached in a lot of different ways!
Even though you are asking for a neural network, I will apply my own experience with Bayesian networks as I think they are more than suitable for this task (they might be even the best choice).
You could start of with a small seed of highly popular users which you can label as gamers/not gamers. Afterwards, using some threshold (which should vary in proportion to the number of users which have been labelled P(θ = X | labelledUsers = 100)), you should decide whether a particular user is a gamer or not (I would start off with people that follow a lot of "twitterers" to maximize the chances of following one of your "seed".
To help you find the previously mentioned probability, you can also help yourself with a test set to validate the probability. I hope this was helpful even though it doesn't use a neural network. BTW, you could also model P(θ = X | labelledUsers = 100) with a beta distribution. Using this technique, the algorithm would choose how much to weigh each feature using a PDF and you could just feed it as many features as you like without bothering about how relevant they are. As an extra, I wholeheartedly encourage you to buy/download John Kruschke's book on Doing Bayesian Data Analysis (A tutorial with R and BUGS), it will give you an excellent insight into data analysis for solving problems like this one.

In case you still want to use ANNs, I would recommend learning the parameters (the weights) for the neural network in a special environment (which you have to manually label) and using features such as the number of people following, ratio: gamers/all, you can try to create a computer vision algorithm for the images he posts (gaming vs non gaming). The list is infinite! Still, you should be careful not to include too many random (meaningless) features otherwise it might cause underfitting!

If you would also like details on which programming language would be best for this, I would recommend R or Matlab (my personal choice even though R comes with more premade functions) for developing the algorithm and Fortran or C for a final exportable version (if you wished to do so).

Correct answer by david david on December 6, 2020

I would start by giving each Twitter user a unique ID - this will become the index of your dataset. Then for each unique ID (twitter user), do feature engineering and brainstorm different features that you think would be useful. I would highly recommend reading 'Hands On Machine Learning with Scikit-Learn and Tensorflow' if you don't have much experience with machine learning and neural networks for an implementation based approach. I'd also recommend researching and reading papers about Twitter user trends and the different category of users if you haven't already. Knowing the subject matter of what you are trying to model is very important.

Keep in mind, that you don't need to be very selective with the features you add. As long as you can come up with a reasonable suspicion, go ahead and add the data. Keep in mind, this will add to the amount of time you spend cleaning the data.

Some features I can think of:

  1. Parse their tweets and use a 'bag of words' model to assign a tone, negative or positive. Some category of users may have more positive or negative tone. For example, political tweets on Twitter.
  2. Twitter API may disclose what country the user is from, or even the city they are from. Some cities have a bigger population of certain people. For example, there are more farmers in Idaho than in NYC.
  3. Number of retweets and likes per unique ID.
  4. Number of tweets per user (as another commenter mentioned)

Another thought: a problem like this is ripe for a network analysis and graph based network model. Here is an example.

Answered by FifthCode on December 6, 2020

Your problem is that you are taking each followed user as an input feature and that the number of followed users can vary so you don't have a fixed length vector for your NN.

Two possible approaches (that I can think of):

  1. you want to extract on single "value" from each followed user (such as their own label) and in that case you could have an input vectors that aggregate these in counters: each dimension of the vector is one of your label and the value is just the number of followed users that have that label. Of course you need a fixed number of labels...

eg: the target account X has 3 followers:

  • MisterA being tagged a 'gamer', an 'influencer' and a 'youngster'
  • MissB being tagged a 'gamer', a 'leader' and a 'youngster'
  • MisterC being tagged a 'gamer' and 'political activist'

Feature vector for the target account should aggregate these values, for instance:

X = {'gamer': 3; 'influencer': 1; 'youngster': 2; 'leader': 1; 'political activist': 1}

You might want to normalize so all values are in [0,1].

  1. you extract several features from each followed users, in that case your inputs will be the aggregation of these features over all followed accounts. Here the form of the vector will depends on the features you have already in your data (tags, location, number of messages, numbers of followers...).

In both cases, you need to flip your view on the inputs and move from each followed user as a dimension to something else.

Answered by gdupont on December 6, 2020

If I correctly understand your question, your main problem here is the length of your features which are varying. So I think you need to use Long short-term memory (LSTM ) model or Gated Recurrent Units (GRU). These types of models are based on Recurrent neural networks(RNN)s architecture which are used widely in deep learning models specially when you need some memory in your model or where your models input sequence length are varying. Here is a good article about Sequence Classification with LSTM.

Answered by pouyan on December 6, 2020

I guess your question is: you want to use a neural network to predict whether a user is a "gamer" or not, and you want to find some features from his/her followers. Well here are some possible features you may want to use.

  1. If you find the number of their followers is varying, why not just use this "follower number" as a feature.
  2. Based on the first point: because some gamers are very active in their game community and like to discuss their common interests, you can check the absolute number of the "gamers number" in their followers. Even based on the percentage of the (gamers/followers), you can even give weights to these users.
  3. Another thing I found from my project is that: you may find some interesting features from the user pages. For example, the number of tweets or pictures they post, the contents of their tweets (you can do some simple sentiment analyses or more complex NLP to extract features from it). Then you can find more useful features to feed your model.

Hope these can help you.

Answered by Nick Chen on December 6, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP