Stack Overflow Asked by user12907213 on January 1, 2021
I have a dataframe made by many rows which includes tweets. I would like to classify them using a machine learning technique (supervised or unsupervised).
Since the dataset is unlabelled, I thought to select a few rows (50%) to label manually (+1 pos, -1 neg, 0 neutral), then using machine learning to assign labels to the other rows.
In order to do this, I did as follows:
Original Dataset
Date ID Tweet
01/20/2020 4141 The cat is on the table
01/20/2020 4142 The sky is blue
01/20/2020 53 What a wonderful day
...
05/12/2020 532 In this extraordinary circumstance we are together
05/13/2020 12 It was a very bad decision
05/22/2020 565 I know you are the best
Split the dataset into 50% train and 50% test. I manually labelled 50% of data as follows:
Date ID Tweet PosNegNeu
01/20/2020 4141 The cat is on the table 0
01/20/2020 4142 The weather is bad today -1
01/20/2020 53 What a wonderful day 1
...
05/12/2020 532 In this extraordinary circumstance we are together 1
05/13/2020 12 It was a very bad decision -1
05/22/2020 565 I know you are the best 1
Then I extracted words’frequency (after removing stopwords):
Frequency
bad 2
circumstance 1
best 1
day 1
today 1
wonderful 1
….
I would like to try to assign labels to the other data based on:
I know that there are several ways to do this, even better, but I am having some issue to classify/assign labels to my data and I cannot do it manually.
My expected output, e.g. with the following test dataset
Date ID Tweet
06/12/2020 43 My cat 'Sylvester' is on the table
07/02/2020 75 Laura's pen is black
07/02/2020 763 It is such a wonderful day
...
11/06/2020 1415 No matter what you need to do
05/15/2020 64 I disagree with you: I think it is a very bad decision
12/27/2020 565 I know you can improve
should be something like
Date ID Tweet PosNegNeu
06/12/2020 43 My cat 'Sylvester' is on the table 0
07/02/2020 75 Laura's pen is black 0
07/02/2020 763 It is such a wonderful day 1
...
11/06/2020 1415 No matter what you need to do 0
05/15/2020 64 I disagree with you: I think it is a very bad decision -1
12/27/2020 565 I know you can improve 0
Probably a better way should be consider n-grams rather than single words or building a corpus/vocabulary to assign a score, then a sentiment. Any advice would be greatly appreciated as it is my first exercise on machine learning. I think that k-means clustering could also be applied, trying to get more similar sentences.
If you could provide me a complete example (with my data would be great, but also with other data would be fine as well), I would really appreciate it.
I'll propose the sentence or tweet in this context to be analysed for polarity. This can be done using the textblob
library. It can be installed as pip install -U textblob
. Once the text data polarity is found, it can be assigned as a separate column in the dataframe. Subsequently, the sentence polarity can then be used for further analysis.
Initial Code
from textblob import TextBlob
df['sentiment'] = df['Tweet'].apply(lambda Tweet: TextBlob(Tweet).sentiment)
print(df)
Intermediate Result
Date ... sentiment
0 1/1/2020 ... (0.0, 0.0)
1 2/1/2020 ... (0.0, 0.0)
2 3/2/2020 ... (0.0, 0.1)
3 4/2/2020 ... (-0.6999999999999998, 0.6666666666666666)
4 5/2/2020 ... (0.5, 0.6)
[5 rows x 4 columns]
From the sentiment column (in the above output), we can see the sentiment column is categorized between two — Polarity and Subjectivity.
Polarity is a float value within the range [-1.0 to 1.0] where 0 indicates neutral, +1 indicates a very positive sentiment and -1 represents a very negative sentiment.
Subjectivity is a float value within the range [0.0 to 1.0] where 0.0 is very objective and 1.0 is very subjective. Subjective sentence expresses some personal feelings, views, beliefs, opinions, allegations, desires, beliefs, suspicions, and speculations where as Objective sentences are factual.
Notice, the sentiment column is a tuple. So we can split it into two columns like, df1=pd.DataFrame(df['sentiment'].tolist(), index= df.index)
. Now, we can create a new dataframe to which I'll append the split columns as shown;
df_new = df
df_new['polarity'] = df1['polarity']
df_new.polarity = df1.polarity.astype(float)
df_new['subjectivity'] = df1['subjectivity']
df_new.subjectivity = df1.polarity.astype(float)
Finally, basis of the sentence polarity found earlier, we can now add a label to the dataframe, which will indicate if the tweet is positive, negative or neutral.
import numpy as np
conditionList = [
df_new['polarity'] == 0,
df_new['polarity'] > 0,
df_new['polarity'] < 0]
choiceList = ['neutral', 'positive', 'negative']
df_new['label'] = np.select(conditionList, choiceList, default='no_label')
print(df_new)
Finally, the result will look like this;
Final Result
[5 rows x 6 columns]
Date ID Tweet ... polarity subjectivity label
0 1/1/2020 1 the weather is sunny ... 0.0 0.0 neutral
1 2/1/2020 2 tom likes harry ... 0.0 0.0 neutral
2 3/2/2020 3 the sky is blue ... 0.0 0.0 neutral
3 4/2/2020 4 the weather is bad ... -0.7 -0.7 negative
4 5/2/2020 5 i love apples ... 0.5 0.5 positive
[5 rows x 7 columns]
Data
import pandas as pd
# create a dictionary
data = {"Date":["1/1/2020","2/1/2020","3/2/2020","4/2/2020","5/2/2020"],
"ID":[1,2,3,4,5],
"Tweet":["the weather is sunny",
"tom likes harry", "the sky is blue",
"the weather is bad","i love apples"]}
# convert data to dataframe
df = pd.DataFrame(data)
Full Code
# create some dummy data
import pandas as pd
import numpy as np
# create a dictionary
data = {"Date":["1/1/2020","2/1/2020","3/2/2020","4/2/2020","5/2/2020"],
"ID":[1,2,3,4,5],
"Tweet":["the weather is sunny",
"tom likes harry", "the sky is blue",
"the weather is bad","i love apples"]}
# convert data to dataframe
df = pd.DataFrame(data)
from textblob import TextBlob
df['sentiment'] = df['Tweet'].apply(lambda Tweet: TextBlob(Tweet).sentiment)
print(df)
# split the sentiment column into two
df1=pd.DataFrame(df['sentiment'].tolist(), index= df.index)
# append cols to original dataframe
df_new = df
df_new['polarity'] = df1['polarity']
df_new.polarity = df1.polarity.astype(float)
df_new['subjectivity'] = df1['subjectivity']
df_new.subjectivity = df1.polarity.astype(float)
print(df_new)
# add label to dataframe based on condition
conditionList = [
df_new['polarity'] == 0,
df_new['polarity'] > 0,
df_new['polarity'] < 0]
choiceList = ['neutral', 'positive', 'negative']
df_new['label'] = np.select(conditionList, choiceList, default='no_label')
print(df_new)
Correct answer by maverick on January 1, 2021
IIUC, you have a percentage of the data labelled and require labelling the remaining data. I would recommend reading about Semi-Supervised machine learning.
Semi-supervised learning is an approach to machine learning that combines a small amount of labeled data with a large amount of unlabeled data during training. Semi-supervised learning falls between unsupervised learning (with no labeled training data) and supervised learning (with only labeled training data)
Sklearn provides quite an extensive variety of algorithms that can assist with this. Do check this out.
If you need more insight into this topic I would highly recommend checking this article out as well.
Here is an example with the iris data set -
import numpy as np
from sklearn import datasets
from sklearn.semi_supervised import LabelPropagation
#Init
label_prop_model = LabelPropagation()
iris = datasets.load_iris()
#Randomly create unlabelled samples
rng = np.random.RandomState(42)
random_unlabeled_points = rng.rand(len(iris.target)) < 0.3
labels = np.copy(iris.target)
labels[random_unlabeled_points] = -1
#propogate labels over remaining unlabelled data
label_prop_model.fit(iris.data, labels)
Answered by Akshay Sehgal on January 1, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP