Data Science Asked on August 11, 2021
I am trying to assign different values for each sentences based on information about the presence of hashtags, upper case letters/words (e.g. HATE) and some others.
I created a data frame which includes some binary values (1 or 0):
Sentence Upper case Hashtags
I HATE migrants 1 0
I like cooking 0 0
#trump said he is ok 0 1
#blacklives SUPPORT 1 1
I would like to assign a value based on the binary values above, if they are satisfied or not, for example:
- if Upper case = 1 and Hashtags = 1 then assign -10;
- if Upper case = 1 and Hashtags = 0 then assign -5;
- if Upper case = 0 and Hashtags = 1 then assign -5;
- if Upper case = 0 and Hashtags = 0 then assign 0;
This would be ok for a small number of requests and combinations, but with three variables to check, it would be a greater number of combination to consider manually!
Do you know if there is a way to take into account all these in an easy (and feasible) way?
Someone told me about using regression, but I have never used it before for similar task. The context is about fake tweets.
I understand that you are trying to derive new informative feature from the available tweet texts. And you do it in two steps: first you calculate dummy binary features, next you want to aggregate all binary features into one numerical feature.
Several aggregation rules come to mind:
E.g. for binary code and three binary variables it could be done as follows
A) 0,0,0 -> 0
B) 0,0,1 -> 1
C) 0,1,0 -> 2
D) 0,1,1 -> 3
Basically, you just multiply the binary variable by corresponding power of 2 (1,2,4,...) and then sum it up.
The problem with this approach could be that it implies that the distance from A) to the D) is three times larger than from A) to B) and it might be not what you need. Furthermore the distance depends on the order of your binary variables.
EDIT 1: from the tag unsupervised learning I understand that you don’t have the labeled dataset, i.e. you don’t know what texts belong to the category “fake tweet”. Without labeled data you cannot define any objective criteria that would tell you that one aggregation approach (e.g. one of suggested above) is better than another.
What you could do:
label some tweets manually based on your gut feeling
apply both aggregation approaches to the labeled tweets and check whether you see any pattern. Aggregation approach could be assessed as successful / appropriate if tweets with the same label (say, “fake”) have similar scores. This could be quantified using correlation between score and label or just using contingency table.
Correct answer by aivanov on August 11, 2021
I'll suggest to test the sentence or the tweet for polarity. This can be done using the textblob
library. It can be installed as pip install -U textblob
. Once the text data polarity is found, it can be assigned as a separate column in the dataframe. Subsequently, the sentence polarity can then be used for further analysis.
Polarity and Subjectivity are defined as;
Polarity is a float value within the range [-1.0 to 1.0] where 0 indicates neutral, +1 indicates a very positive sentiment and -1 represents a very negative sentiment.
Subjectivity is a float value within the range [0.0 to 1.0] where 0.0 is very objective and 1.0 is very subjective. Subjective sentence expresses some personal feelings, views, beliefs, opinions, allegations, desires, beliefs, suspicions, and speculations where as Objective sentences are factual.
Data
import pandas as pd
# create a dictionary
data = {"Date":["1/1/2020","2/1/2020","3/2/2020","4/2/2020","5/2/2020"],
"ID":[1,2,3,4,5],
"Tweet":["I Hate Migrants",
"#trump said he is ok", "the sky is blue",
"the weather is bad","i love apples"]}
# convert data to dataframe
df = pd.DataFrame(data)
Notice, the sentiment column is a tuple. So we can split it into two columns like, df1=pd.DataFrame(df['sentiment'].tolist(), index= df.index)
. Now, we can create a new dataframe to which I'll append the split columns as shown;
df_new = df
df_new['polarity'] = df1['polarity']
df_new.polarity = df1.polarity.astype(float)
df_new['subjectivity'] = df1['subjectivity']
df_new.subjectivity = df1.polarity.astype(float)
Next, on basis of the sentence polarity found earlier, we can now add a label to the dataframe, which will indicate if the tweet/sentence is fake, not fake or neutral.
import numpy as np
conditionList = [
df_new['polarity'] == 0,
df_new['polarity'] > 0,
df_new['polarity'] < 0]
choiceList = ['neutral', 'positive', 'negative']
df_new['label'] = np.select(conditionList, choiceList, default='no_label')
print(df_new)
The result will look like this;
Result
Date ID Tweet sentiment polarity subjectivity label
0 1/10/2020 1 I Hate Migrants (-0.8, 0.9) -0.8 -0.8 fake
1 2/10/2020 2 #trump said he is ok (0.5, 0.5) 0.5 0.5 not_fake
2 3/10/2020 3 the sky is blue (0.0, 0.1) 0.0 0.0 neutral
3 4/10/2020 4 the weather is bad (-0.68, 0.66) -0.7 -0.7 fake
4 5/10/2020 5 i love apples (0.5, 0.6) 0.5 0.5 not_fake
Complete Code
import pandas as pd
import numpy as np
from textblob import TextBlob
data = {"Date":["1/10/2020","2/10/2020","3/10/2020","4/10/2020","5/10/2020"],
"ID":[1,2,3,4,5],
"Tweet":["I Hate Migrants",
"#trump said he is ok", "the sky is blue",
"the weather is bad","i love apples"]}
# convert data to dataframe
df = pd.DataFrame(data)
# print(df)
df['sentiment'] = df['Tweet'].apply(lambda Tweet: TextBlob(Tweet).sentiment)
# print(df)
# split the sentiment column into two
df1=pd.DataFrame(df['sentiment'].tolist(), index= df.index)
# append cols to original dataframe
df_new = df
df_new['polarity'] = df1['polarity']
df_new.polarity = df1.polarity.astype(float)
df_new['subjectivity'] = df1['subjectivity']
df_new.subjectivity = df1.polarity.astype(float)
# print(df_new)
# add label to dataframe based on condition
conditionList = [
df_new['polarity'] == 0,
df_new['polarity'] > 0,
df_new['polarity'] < 0]
choiceList = ['neutral', 'not_fake', 'fake']
df_new['label'] = np.select(conditionList, choiceList, default='no_label')
print(df_new)
Answered by mnm on August 11, 2021
Manually assigning a value to a feature level can be done. However, it is often better to allow the machine learning algorithm to learn the importance of different features during the training process.
The general machine learning process starts with labeled data. If the labels are numeric, it is a regression problem. In the specific case of fake tweets, a regression label could be how fake is the tweet (say on a scale from 1 to 100). Typically fake tweets is framed as a classification problem, either fake or not.
Then, encode the features. You have done that partly by one-hot encoding the presence of different features.
Next, feed both the features and the labels into a machine learning algorithm. The algorithm will learn the relative weights of the features in order to best predict the labels. For example, it might learn that upper case is not predictive and a hashtag is very predictive of fake tweets.
Answered by Brian Spiering on August 11, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP