Better approach to assign values to determine potential fake sentences

Question

I am trying to assign different values for each sentences based on information about the presence of hashtags, upper case letters/words (e.g. HATE) and some others.
I created a data frame which includes some binary values (1 or 0):
Sentence           Upper case   Hashtags
 
I HATE migrants       1             0
I like cooking        0             0
#trump said he is ok  0             1
#blacklives SUPPORT   1             1

I would like to assign a value based on the binary values above, if they are satisfied or not, for example:
- if Upper case = 1 and Hashtags = 1 then assign -10;
- if Upper case = 1 and Hashtags = 0 then assign -5;
- if Upper case = 0 and Hashtags = 1 then assign -5;
- if Upper case = 0 and Hashtags = 0 then assign 0;

This would be ok for a small number of requests and combinations, but with three variables to check, it would be a greater number of combination to consider manually!
Do you know if there is a way to take into account all these in an easy (and feasible) way?
Someone told me about using regression, but I have never used it before for similar task. The context is about fake tweets.

aivanov · Accepted Answer

I understand that you are trying to derive new informative feature from the available tweet texts. And you do it in two steps: first you calculate dummy binary features, next you want to aggregate all binary features into one numerical feature.
Several aggregation rules come to mind:

simply calculate the sum of all binary features (and multiply by -5 if you really need to replicate the figures in your example). Note that with this approach you lose some information because you will not be able to distinguish between 0,1 and 1,0 (both will lead to -5, as in your example)
Use binary or Gray code to convert from binary variables to one numerical feature.

E.g. for binary code and three binary variables it could be done as follows
A) 0,0,0 -> 0
B) 0,0,1 -> 1
C) 0,1,0 -> 2
D) 0,1,1 -> 3
Basically, you just multiply the binary variable by corresponding power of 2 (1,2,4,...) and then sum it up.
The problem with this approach could be that it implies that the distance from A) to the D) is three times larger than from A) to B) and it might be not what you need. Furthermore the distance depends on the order of your binary variables.
EDIT 1: from the tag unsupervised learning I understand that you don’t have the labeled dataset, i.e. you don’t know what texts belong to the category “fake tweet”. Without labeled data you cannot define any objective criteria that would tell you that one aggregation approach (e.g. one of suggested above) is better than another.
What you could do:

label some tweets manually based on your gut feeling

apply both aggregation approaches to the labeled tweets and check whether you see any pattern. Aggregation approach could be assessed as successful / appropriate if tweets with the same label (say, “fake”) have similar scores. This could be quantified using correlation between score and label or just using contingency table.

mnm · Answer

I'll suggest to test the sentence or the tweet for polarity. This can be done using the textblob library. It can be installed as pip install -U textblob. Once the text data polarity is found, it can be assigned as a separate column in the dataframe. Subsequently, the sentence polarity can then be used for further analysis.

Polarity and Subjectivity are defined as;

Polarity is a float value within the range [-1.0 to 1.0] where 0 indicates neutral, +1 indicates a very positive sentiment and -1 represents a very negative sentiment.

Subjectivity is a float value within the range [0.0 to 1.0] where 0.0 is very objective and 1.0 is very subjective. Subjective sentence expresses some personal feelings, views, beliefs, opinions, allegations, desires, beliefs, suspicions, and speculations where as Objective sentences are factual.

Data

import pandas as pd

# create a dictionary
data = {"Date":["1/1/2020","2/1/2020","3/2/2020","4/2/2020","5/2/2020"],
    "ID":[1,2,3,4,5],
    "Tweet":["I Hate Migrants",
             "#trump said he is ok", "the sky is blue",
             "the weather is bad","i love apples"]}
# convert data to dataframe
df = pd.DataFrame(data)

Notice, the sentiment column is a tuple. So we can split it into two columns like, df1=pd.DataFrame(df['sentiment'].tolist(), index= df.index). Now, we can create a new dataframe to which I'll append the split columns as shown;

df_new = df
df_new['polarity'] = df1['polarity']
df_new.polarity = df1.polarity.astype(float)
df_new['subjectivity'] = df1['subjectivity']
df_new.subjectivity = df1.polarity.astype(float)

Next, on basis of the sentence polarity found earlier, we can now add a label to the dataframe, which will indicate if the tweet/sentence is fake, not fake or neutral.

import numpy as np
conditionList = [
    df_new['polarity'] == 0,
    df_new['polarity'] > 0,
    df_new['polarity'] < 0]
choiceList = ['neutral', 'positive', 'negative']
df_new['label'] = np.select(conditionList, choiceList, default='no_label')
print(df_new)

The result will look like this;

Result

        Date  ID                 Tweet    sentiment  polarity  subjectivity  label
0  1/10/2020   1       I Hate Migrants    (-0.8, 0.9)  -0.8      -0.8        fake
1  2/10/2020   2  #trump said he is ok    (0.5, 0.5)    0.5       0.5        not_fake
2  3/10/2020   3       the sky is blue    (0.0, 0.1)    0.0       0.0        neutral
3  4/10/2020   4    the weather is bad    (-0.68, 0.66) -0.7      -0.7       fake
4  5/10/2020   5         i love apples    (0.5, 0.6)    0.5       0.5        not_fake

Complete Code

import pandas as pd
import numpy as np
from textblob import TextBlob
data = {"Date":["1/10/2020","2/10/2020","3/10/2020","4/10/2020","5/10/2020"],
        "ID":[1,2,3,4,5],
        "Tweet":["I Hate Migrants",
                 "#trump said he is ok", "the sky is blue",
                 "the weather is bad","i love apples"]}
    # convert data to dataframe
df = pd.DataFrame(data)
# print(df)
df['sentiment'] = df['Tweet'].apply(lambda Tweet: TextBlob(Tweet).sentiment)
# print(df)

# split the sentiment column into two
df1=pd.DataFrame(df['sentiment'].tolist(), index= df.index)

# append cols to original dataframe
df_new = df
df_new['polarity'] = df1['polarity']
df_new.polarity = df1.polarity.astype(float)
df_new['subjectivity'] = df1['subjectivity']
df_new.subjectivity = df1.polarity.astype(float)
# print(df_new)

# add label to dataframe based on condition
conditionList = [
    df_new['polarity'] == 0,
    df_new['polarity'] > 0,
    df_new['polarity'] < 0]
choiceList = ['neutral', 'not_fake', 'fake']
df_new['label'] = np.select(conditionList, choiceList, default='no_label')
print(df_new)

Brian Spiering · Answer

Manually assigning a value to a feature level can be done. However, it is often better to allow the machine learning algorithm to learn the importance of different features during the training process.
The general machine learning process starts with labeled data. If the labels are numeric, it is a regression problem. In the specific case of fake tweets, a regression label could be how fake is the tweet (say on a scale from 1 to 100). Typically fake tweets is framed as a classification problem, either fake or not.
Then, encode the features. You have done that partly by one-hot encoding the presence of different features.
Next, feed both the features and the labels into a machine learning algorithm. The algorithm will learn the relative weights of the features in order to best predict the labels. For example, it might learn that upper case is not predictive and a hashtag is very predictive of fake tweets.

Better approach to assign values to determine potential fake sentences

3 Answers

Add your own answers!

Ask a Question