Data Science Asked by CopyOfA on February 14, 2021
I am new to NLP and have a dataset that has a bunch of (social media) messages on which I would like to try some methods like latent Dirichlet allocation (LDA). First, I need to clean the data of things like punctuation, emojis, etc. I’m not sure how to go about doing this in the most efficient and accurate manner. My code right now is this:
import pandas as pd
import re
class TopicModel():
def __init__(self, data_path = "data.csv"):
self.data_path = data_path
self.data = pd.read_csv(self.data_path, low_memory=False)
def clean_data(self):
self.remove_message_na()
self.remove_emojis()
self.remove_punctuation_and_lower()
self.remove_url()
self.remove_empty_messages()
def remove_message_na(self):
self.data = self.data.loc[~pd.isna(self.data['message'])]
def remove_emojis(self):
self.data['message'] = self.data['message'].str.encode("ascii", "ignore").str.decode("utf8")
def remove_punctuation_and_lower(self):
p = re.compile('''[!#?,.:";]''')
self.data['cleaned_data'] = [p.sub("", ii).lower() for ii in self.data['message'].tolist()]
def remove_empty_messages(self):
self.data = self.data.loc[self.data['cleaned_data'] != ""]
def remove_url(self):
self.data = [re.sub(r"httpS+", "", ii) for ii in self.data['message'].tolist()]
I don’t want to remove contractions, which is why I left out '
from my punctuation list, but I think optimally, the contractions would be reformatted as two separate words. I also wonder about the other punctuation marks when dealing with social media data, e.g., #
. I know this question is a bit general, but I’m wondering if there is a good python library for performing the kind of data-cleaning operations that I want, prior to perform topic analysis, sentiment analysis, etc. I’d also like to know which libraries can efficiently perform these data-cleaning operations on a pandas data frame.
I summarize your questions and then try to answer under each bullet point:
The first goto is a regular expression that is used in data preprocessing very frequently. But if you are looking for all your punctuation to be removed from the text you can use one of these two approaches:
import string
sentence = "hi; how*, @are^ you? wow!!!"
sentence = sentence.translate(sentence.maketrans('', '', string.punctuation))
print(sentence)
output: 'hi how are you wow'
or use a regular expression, example:
import re
s = "string. With. Punctuation? 3.2"
s = re.sub(r'[^ws]','',s)
print(s)
output: 'string With Punctuation 32'
generally, NLTK and SciPy are very handly. But for specific purposes, other libraries also exist. For example library contractions
, inflect
.
You can apply any function from any library to your pandas dataframe using the apply
method.
Here is an example:
mport pandas as pd
s = pd.read_csv("stock.csv", squeeze = True)
# adding 5 to each value
new = s.apply(lambda num : num + 5)
Answered by Fatemeh Rahimi on February 14, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP